Beyond Local Explanations: A Framework for Global Concept-Based Interpretation in Image Classification
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsVery good and interesting paper . The authors presented a novel framework for generating global explanations in terms of human-adapted concepts, which is applicable to any image-based classifier. It is wildly architecture-independent which is very important. The proposed methods provide both local and global explanations for multiple images from multiple classes. Good level of teh paper, very well done on the editing side too.
Comments for author File: Comments.pdf
Author Response
Comment 1: "Maybe more detailed time analysis (runtime benchmarks), especially when scaling to larger datasets.Considering parallel/GPU implementation."
Response 1: "We avoid providing runtime benchmarks as our approach varies significantly across different GPUs since they deal with doing a forward pass with a deep network. The beam search procedure takes about 0.1 second on each image. We have added additional details talk about GPU consumption and runtime from line 305 - 308"
Comment 2: "Post-processing (e.g., logical minimization) or rule hierarchy could be applied."
Response 2: "Logical minimization yields ~80% reduction in DNF size but only 34% of the test coverage is preserved. Although logical minimization would yield a much more manageable DNF for viewing, from the explainability perspective we would like to uncover spurious combinations, therefore logical minimizations could yield expressions without the necessary brevity for uncovering spurious correlations.
Great point about rule hierarchy, as we are currently exploring that possibility for a follow up paper."
Comment 3: "A sensitivity analysis could be added: how many annotated images are needed to obtain reasonable results?"
Response 3: "For the CUB and STF dataset we manually annotated about 15% using distance based measures to maximize viewpoints and were able to achieve as high as 99% coverage on STF and 59% on CUB. Conducting a senstivity analysis would require us to annotate more examples which was not possible to do within the review period. Note, the amount of annotated data vs test coverage is highly dependant on the complexity of the dataset. For example in the ADE case, we use 100% annotations yet achieve a lower test coverage when compared to STF and CUB that had 15% annotations."
Reviewer 2 Report
Comments and Suggestions for Authors(1) The abstract needs to be further condensed, and necessary conclusive data support needs to be added. Currently, there is almost no quantitative descriptions in the abstract, making the advance of the proposed methodology could not be reflected.
(2) This manuscript has a certain degree of innovation in theory and the proposed methodology is relatively solid from mathematical theory point of view.
(3) The experimental part is not very comprehensive, it is suggested to expand the evaluation indicators and add standardized indicators in machine learning (ML).
(4) In some of the line-chart figures in this manuscript, the curves and legends are superimposed, modifications must be made.
Author Response
Comment 1: "The abstract needs to be further condensed, and necessary conclusive data support needs to be added. Currently, there is almost no quantitative descriptions in the abstract, making the advance of the proposed methodology could not be reflected."
Response 1: "Thank you for the suggestion, we have added quantitative descriptions in the abstract in Line 12."
Comment 2: "The experimental part is not very comprehensive, it is suggested to expand the evaluation indicators and add standardized indicators in machine learning (ML)."
Response 2: " Explainability is very difficult to measure without a human in the loop, we used coverage as our primary metric but are not sure what other standardized metric could be adopted for measuring global explanations, while metrics do exsist for measuring local explanations"
Comment 3: "In some of the line-chart figures in this manuscript, the curves and legends are superimposed, modifications must be made."
Response 3: "Thank you for pointing this out, we agree and have made changes to figure 5 to make sure the legends are not superimposed."
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors propose a model-agnostic black-box explanation framework that extracts minimal sufficient explanations (MSX), symbolically encodes them, and applies a greedy set cover algorithm to generate human-interpretable, concept-based global explanations for image classifiers.
Major Issues
1. The way HyperPixelFlow fuses spatial consistency weights with semantic consistency weights is unclear; please provide a clear formula or pseudocode to illustrate how appearance constraints and geometric constraints are combined.
2. The beam search used to extract the minimal sufficient explanation (MSX) may have redundant paths or fail to converge; please discuss its worst-case complexity or provide empirical evidence of convergence and reproducibility.
3. The MSX masking algorithm uses a fixed Gaussian blur and a confidence interval of 0.95, and no sensitivity analysis is performed; please evaluate the impact of different blur widths and score thresholds on the stability of the explanation.
4. Table 1 omits the number of samples per category, and the average test coverage lacks standard deviation or confidence interval; add category distribution and statistical indicators to reveal bias and result stability.
5. Insufficient analysis is provided; for example, visualization examples of low-coverage ADE20k classes are included, and comparisons are made with random or frequency-based MSX selection to quantify the advantages of the greedy approach.
Minor issues
1. Symbolic descriptions are missing in equations (1) and (2). The meanings of the subscripts l, c, i, and j in ∆lcij and wlcij are not centrally explained. It is recommended to add brief comments before and after the equations.
2. When citing "Algorithm 1" in the main text, the algorithm title is not given in the previous text. It is recommended to add a title "Algorithm 1: Greedy Set Cover" above the algorithm box.
3. The description of SLIC superpixel parameters needs to be supplemented. Chapter 4 mentions a fixed number of segmentations "k = 60", but it is not stated whether it is used uniformly on the three datasets ADE20k, CUB-200, and Cars. It is recommended to add a description or set it separately for different complexities.
4. The second reference in the reference list at the end of the article contains redundant abbreviations "GDPR, G.D.P.R.". It is recommended to correct it to "GDPR: General Data Protection Regulation".
5. [23] and [7] are both cited as Selvaraju et al. Grad-CAM. The same work appears twice. It is recommended to check and merge.
6. "GitHub" is mentioned in the article, but the actual URL is not provided. It is recommended to add the specific repository address.
Author Response
Comment 1: "The way HyperPixelFlow fuses spatial consistency weights with semantic consistency weights is unclear; please provide a clear formula or pseudocode to illustrate how appearance constraints and geometric constraints are combined."
Response 1: "We have added additional description regarding Hyperpixel flow from line 231- 245."
Comment 2: "The beam search used to extract the minimal sufficient explanation (MSX) may have redundant paths or fail to converge; please discuss its worst-case complexity or provide empirical evidence of convergence and reproducibility"
Response 2: "The beam search process is reproducible as the search process is deterministic, we will provide the worst case complexity. For empherical evidence of convergence, we experimented with a larger than beam width=5 and depth=20 and found no significant benefit therfore showing convergence. We added the worst case complexity in Line 306-308."
Comment 3: " The MSX masking algorithm uses a fixed Gaussian blur and a confidence interval of 0.95, and no sensitivity analysis is performed; please evaluate the impact of different blur widths and score thresholds on the stability of the explanation."
Response 3: " For explainability, the goal is to understand the model behavior as completely as possible, reducing that threshold to be lower than 95% does not benefit explainability as any such explainations do not track the model's original decision and therefore lowers test coverage. Ideally the threshold must have been 100% but factoring in some variance due to blurring we allow a 5% error compared to the model's actual decision. Although, the confidence threshold is proportional to test coverage, we will add a senstivity analysis in the final version of the paper.
For the Resnet101 model we observed about ~1.5% change in overall test coverage wrt to different Gaussian kernel sizes ranging from 11 - 91 with 51 (value we choose) having the best test coverage"
Comment 4: "Table 1 omits the number of samples per category, and the average test coverage lacks standard deviation or confidence interval; add category distribution and statistical indicators to reveal bias and result stability."
Response 4: "Thank you for the suggestion, we avoided adding the number of samples per category because CUB and STF have hundreds of categories. For revealing bias our initial version did contains standard deviation along with the average test coverage in Table 1."
Comment 5: "Insufficient analysis is provided; for example, visualization examples of low-coverage ADE20k classes are included, and comparisons are made with random or frequency-based MSX selection to quantify the advantages of the greedy approach"
Response 5: "We are not sure where we have added a comparision with random or frequency-based MSXs, we included both the classes with the highest and lowest test coverage classes because of ~200 classes and the inability to visualize coverage for all 200 classes."
We have taken care of the minor revisions, thank you for catching them.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors has made proper revision according to the comments in the last round, it is recommended to accept this manuscript.