Quantifying AI Model Trust as a Model Sureness Measure by Bidirectional Active Processing and Visual Knowledge Discovery
Round 1
Reviewer 1 Report (Previous Reviewer 1)
Comments and Suggestions for Authors- The updated version of the manuscript has been improved, it takes into consideration some of the review points in the previous review report, which have been adequately addressed and the updated version of the manuscript. However, points 5, 7, 9, 10, 11, 13, and 14 are not / partially addressed, especially, the lack of true experimental benchmarking (Point 5), lack of quantitative VKD explanation (Points 13–14), the weak justification of novelty, the high-level, not measured computational complexity (Point 10 improvement required). The points that have not / partially addressed are core points that must be adequately addressed in the manuscript.
- The updated manuscript relies heavily on non-peer reviewed references (e.g., references [24], [26], [27], [31], [32], [41], [46], [47], [55], [58], [67], [20], [34], [35]), which reduces the credibility of the manuscript. Please kindly use alternative peer-reviewed ones.
- More than 43% of the references are dated before 2020, which is considered a bit old given the recency of the topic. Additionally, more than 20% of the references are dated before 2015, which is definitely old. Please consider using recent alternatives.
Still there are many grammatical/spelling mistakes in the updated manuscript such as:
- “and ensuring prediction consistency” should be “and ensures prediction consistency”
- “NP hard” should be “NP-hard”
- “A most updated review” should be “The most updated review”
- …
- The paper really needs full proofreading.
Author Response
Thank you for your review, please see the attached pdf responding to each point in detail.
Author Response File:
Author Response.pdf
Reviewer 2 Report (Previous Reviewer 2)
Comments and Suggestions for AuthorsMy previous concerns have been adequately addressed in this version. I have two minor suggestions regarding the current status.
- the explanations of the results of the experiments could be enriched.
- The concluding comments are a little too long and could be refined slightly.
Author Response
Thank you for your review, please see the attached pdf responding to each point in detail.
Author Response File:
Author Response.pdf
Reviewer 3 Report (New Reviewer)
Comments and Suggestions for Authors1. The term “Active Learning” (AL) has a very specific meaning in machine learning: an algorithm actively queries an oracle (a human) to obtain labels for new, unlabeled data points (as noted in the paper’s own citation [80] and Section 3.1). The proposed BAL method (Section 3.1) explicitly assumes all labels are already known and operates on a fully labeled dataset. It does not query an oracle. This method is a form of “iterative instance selection” or “iterative subset evaluation.” Using the term “Active Learning” is confusing and misrepresents the methodology.
2. The paper repeatedly claims BAL is “simpler” (Sec 1.3) and “more scalable” (Sec 2.3.4) than other methods. These claims are directly contradicted by the authors’ own complexity analysis (Section 4.4) and the limitations section of the conclusion (Section 6), which correctly states the approach “requires extensive computational resources.” The BAL method requires retraining the model t * (N/m) times (per the complexity analysis). This is computationally massive and is a weakness, not a strength. The authors should reframe this: the method is not scalable, but (as they also argue in Sec 2.3.4) it is thorough and fine-grained, allowing it to avoid local minima found by k-fold CV or binary search.
3. The title and abstract promise a method that combines BAL and VKD. However, the VKD component is poorly integrated. Section 2.3.3 describes VKD, but the case studies in Section 5 use visualization (PCA, t-SNE, Parallel Coordinates) only for post-hoc analysis of the results generated by the computational BAL. The “interactive” discovery aspect of VKD is not demonstrated or used in the main methodology. The VKD part feels tacked on.
4. The figure numbering in Section 5 (Case Studies) is completely broken.
5. The paper’s premise is that finding a small, sufficient subset (high MS) increases trust (Sec 2.4). This is a strong assumption. The paper fails to discuss the primary counterargument: what if the 80% of data removed (the MUTS) contained the most difficult edge cases? The resulting model might achieve 95% accuracy on the test set (which is drawn from the same distribution) but be less robust to real-world, out-of-distribution (OOD) data, making it less trustworthy. This major limitation is ignored.
6. Typos, check the attached pdf.
7. Section 3.2 (Methodology - VC Dimension) spends a long time explaining VC-dimension only to conclude that Model Sureness is completely different. VC-dim is about the capacity of an algorithm class over all possible datasets , while MS is about the performance of a specific algorithm on subsets of one given dataset. The comparison is forced. Recommend removing or heavily shortening this section.
8. Section 3.4 (Methodology - Conformal Prediction) is not methodology; it is a proposal for future work. The paper does not combine MS with Conformal Prediction.
9. Table 3 is confusing. “Cases in Training data” just increments by 5. “Next Misclassified” is always 0. It does not show the results of the MS analysis (i.e., the minimal set). The text then makes an interesting point (incremental growth produced a simpler model (2 HBs) than training on all data (4 HBs)), but this finding is disconnected from the main MS metric.
10. The finding that k-NN (a simple model) required 16% of the MNIST data (Table 5) while a CNN (a complex model) required only 4-5% (Table 8) is the most interesting result in the paper. It implies CNNs are more “sure” (data-efficient) on this task. This should be highlighted and discussed prominently in the Conclusion, as it is a non-obvious finding.
11. The motivation correctly identifies high-risk domains like healthcare. The author can significantly broaden the Impact and Generalization of the premise by showing that this “trust gap” is also a critical barrier in complex industrial and engineering applications, where data-driven models are rapidly replacing traditional physics-based models that are too slow or complex. Check the attached pdf.
12. The review of trustworthiness (Section 2.1) and principled metrics (Section 2.4) is a good foundation. To improve scientific soundness and clarify the novelty, it’s recommended to clearly distinguish your data-centric approach (MS) from model-centric and XAI-centric approaches. This shows you are aware of the field’s breadth and highlights what your method does differently. Check the attached pdf.
Comments for author File:
Comments.pdf
Author Response
Thank you for your review, please see the attached pdf responding to each point in detail.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report (Previous Reviewer 1)
Comments and Suggestions for AuthorsThe updated version of the manuscript has been improved, however still core scientific issues are not resolved and important methodological details and clarifications are missing. In other words, there are still several important issues from the last two review reports that are inadequately addressed in the updated manuscript such as:
- The manuscript still does not provide an actual experimental benchmark comparison. The updates in the manuscript concerning this issue are a kind of descriptive rather that real experimental comparison.
- The paper still lacks a clear mathematical definition of representativeness and redundancy.
- The manuscript shows the effect of redundancy removal, however, it does not provide definite metrics for measuring it (e.g., noise score, information gain … etc.), which is really important to be provided and quantified.
- Still more than 25% of the references are before 2020, which can be a bit old given the recency of the topic.
- Several references are non-peer reviewed such as [23], [25], [26], [30], [40], and [51], which can considerably reduce the credibility of the paper.
Still, there are many grammatical/spelling mistakes in the updated manuscript such as:
- “NP hard” should be “NP-hard”
- “A most updated review” should be “The most updated review”
- There are several sentences that are excessively long and needs restructuring such as: “It is in contrast with a common motivation of studies for finding smaller training datasets for faster model computation on the smaller data …” in page 5.
- …
- The paper really needs full proofreading.
Author Response
Please see the attached letter. Thank you.
Author Response File:
Author Response.pdf
Reviewer 3 Report (New Reviewer)
Comments and Suggestions for AuthorsThanks for the extensive revision as well as the detailed response, this manuscript should be accepted.
Author Response
We are grateful to the reviewer for their time and comments.
Round 3
Reviewer 1 Report (Previous Reviewer 1)
Comments and Suggestions for AuthorsAll comments in the previous review report are not adequately addressed (except for minor English corrections), and the authors’ responses are just defensive without addressing the comments scientifically either in the response or in the updated manuscript. Here are the comments again that were clear in the previous review report, however, I explained them in very detailed manner taking into consideration the authors’ response for each of them.
Point a: Lack of Experimental Benchmark Comparison
The author’s response is scientifically insufficient for the following reasons:
- Benchmarking does not require identical objectives: In experimental machine learning research, benchmark comparisons are routinely performed across methods with partially overlapping goals, provided that the experimental protocol is shared and the evaluation metric is clearly defined. For example, feature selection methods compared using accuracy even if objectives differ. However, the authors’ claim that “no benchmark is possible”.
- The paper already uses a shared evaluation metric, the authors define target accuracy threshold (95%) and training set size reduction These are directly comparable quantities. Thus, the authors could compare against coreset selection, active learning (pool-based), random subsampling baselines … etc. Even if those are not trust measures, they still provide a quantitative insight on how much data can be removed while preserving accuracy.
- Descriptive comparison is not experimental validation, narrative comparison does not establish empirical superiority, equivalence, or limitations. Without a benchmark it is impossible to assess effect size, claims of novelty remain unquantified, and the added value cannot be judged.
Hence, what is still required, at minimum is that the paper needs to do one or more of the following: a baseline comparison (random removal versus BAP), a simple coreset or active learning baseline, an ablation study showing BAP versus naive reduction … etc.
Point b: Missing Mathematical Definitions of Representativeness and Redundancy
- The authors’ response contradicts with foundational scientific grounds mainly because the paper explicitly builds on these concepts. It repeatedly uses: “Representative cases”, “Redundant cases”, “Unnecessary cases”, “Noise or redundancy removal”. Actually, these are operational concepts driving algorithm design, model evaluation … etc. Hence, using a concept operationally requires defining it.
- Occam’s Razor is misapplied in this portion of the response, Occam’s Razor argues against unnecessary complexity, not against necessary definitions. This directly contradicts the paper goal of quantifying trust.
- “Reverse definability” is not scientifically acceptable, the claim “Representativeness can be defined later using sureness” as indicated in the author response is scientifically invalid because a metric cannot be both definition and consequence, which creates a circular definition. Hence, it violates basic principles of mathematical modeling.
- “Decades of debate” is not a justification for omission as indicated in authors’ response, many debated concepts (e.g., robustness, fairness, bias) are still formally defined per-paper. The expectation is local, task-specific definitions, not universal ones.
Hence, the paper must include at least one task-dependent formal definition, even a weak or approximate definition is scientifically superior to none.
Point c: No Definite Metrics for Measuring Redundancy
The authors’ response confuses outcome measurement with construct measurement.
- Tables show outcomes, not redundancy, the tables measure accuracy, training set size, and runtime. However, they do not measure redundancy, information content, case influence, or dataset diversity. Consequently, they do not quantify the claimed construct.
- Redundancy removal does not mean accuracy preservation mainly because accuracy preservation is a side effect, not a measure. Two methods may remove 50% of data and achieve 95% accuracy, however, they can remove entirely different information content. Hence, without a metric, redundancy is unobservable.
- “Noise is hard to define” is not a defense, most machine learning metrics are task-dependent (fairness, robustness, interpretability … etc.) Yet, they are still quantified, and I actually gave examples in the previous review report for quantifying it such as (“noise score, information gain”).
Hence, at least one quantitative redundancy-related metric is required.
Point d: Outdated References (>25% before 2020)
- Recency is not about citation count as indicated in the authors’ response, highly cited does not mean methodologically current. The recency of the references is mainly needed as a kind of awareness of state of the art, engagement with recent developments, and contextual positioning of the provided work.
- No attempt was made to balance references, actually, I did not ask to remove all older references, I only asked to reduce dominance and add recent work. Unfortunately, the authors made no effort to do so.
Hence, adding recent (2020–2025) references, balance foundational and recent work, and explicitly position model sureness relative to modern trends are needed.
Point e: Non-Peer-Reviewed References
arXiv references are not peer reviewed regardless of author prominence or citations and cannot be trusted as a reliable source of information. Additionally, the credibility issue refers to verifiability not popularity. Hence, it is required to replace non-peer-reviewed references with peer-reviewed alternatives.
Comments on the Quality of English Language
Still the paper needs full proof-reading, the author response in this point contradicted by the manuscript that still contains long-convoluted sentences in addition to the repetitions and redundancy that are evident in the manuscript.
Author Response
We are grateful to the reviewer for their time and comments. Please see the attached review answers document.
Author Response File:
Author Response.pdf
Round 4
Reviewer 1 Report (Previous Reviewer 1)
Comments and Suggestions for AuthorsStill the problem of relying on non-peer reviewed references exists, these references are [23], [25], [26], [30], [40], and [51] and should be replaced by peer-reviewed ones for having proper academic rigor.
Comments on the Quality of English LanguageStill the paper needs full proof-reading, the manuscript still contains long-convoluted sentences that should be factorized for better readability.
Author Response
We are grateful to the reviewer for their comments, please see the attached answers.
Author Response File:
Author Response.pdf
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- The literature review provided in the introduction is mostly around the authors' own work, even though there are many recent studies along the same direction that have been ignored, and they are worth being used as references.
- it is better to have a separate section for literature review and to extend it in the scope to cover other recent related work.
- Fig. 3 needs to be revisited, since the "Select Algorithm" block also should be connected to “Test Data” block.
- More discussion about the data reduction methods should be covered in the literature review with proper citations.
- The study lacks any benchmarking comparison against recent work in trust metrics, e.g., trustworthiness, conformal predication … etc.
- In the abstract: “The method iteratively varies the training dataset and retrains models until a pre-defined efficiency criterion is met”, however the paper does not say how to measure the efficiency or what are the efficiency criteria.
- The parameter selection process of the model should be detailed, for example how the data to be included/excluded is determined, the convergence criteria (if the model converges), …etc.
- It is important to provide the pseudo code of the proposed algorithms in the paper.
- The paper should provide details on how to adapt the approach provided with models that are sensitive to variation in data.
- The computational cost of the provided model should be analyzed, justified, and compared against the related benchmarks, for example, the computational complexity of the provided approach.
- There should be a concrete measure for the noise reduction or redundancy reduction in the data so that we can measure the benefit of the provided approach.
- The scalability of the model should be analyzed in detail.
- VKD is mainly qualitative in nature and the paper does not show how it can affect the data selection process, meanwhile, it is not clear how this approach can deal with high-dimensional datasets.
- The quantitative mapping of visual knowledge should be provided in details in the paper.
- The concept of AI trust and sureness are not new and have been studied earlier in many research efforts that have been completely ignored by this paper.
- The captions of some figures are way too long such as Fig. 1 and Fig. 2.
- I wonder how some model located in a blog can be used, which is a non-peer reviewed such as reference [30], why not to use models that have been scientifically studied and accepted by the research community. This is a major concern of the case study section and honestly on the paper itself.
- More than 50% of the references are dated before 2020, which can be considered a bit old. Also, some of them are really outdated.
- References [14] and [30] are not peer-reviewed references. Please kindly replace them with peer reviewed ones.
- More than 25% of the references are self-citation, while most of them have a way better alternative.
- More references should be used in AI trust models, sureness, VKD, data reduction, and above all widen the scope by not relying mainly on the author’s own previous work.
- “Fisher Iris” and “MNIST” datasets are used in the paper without providing their proper citations. Please provide proper references and citations for them.
- Better use passive voice in the paper instead of "we, our, ... etc."
- "... opportunity to conduce model sureness exploration much ...", it should be "... opportunity to conduct model sureness exploration much ..."
- "... the entire training dataset S and excludes data iteratively to retrain a selected ML algorithm on. ..." the preposition "on" at the end of the sentence is not needed.
- "... is allowed difference ... " should be "... is the allowed difference ... " or "... is an allowed difference ... "
- “split into train and test subsets like a 70:30 split” better not to use word “like” in the context of scientific papers since it tends to be too informal, better use “similar to”, “such as”, “for example”, “e.g.” … etc.
- Better have a full proof-reading of the paper.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper explores an iterative supervised learning and visual knowledge discovery approach for evaluating the trustworthiness of AI models. Overall, the authors present a novel tool for quantifying the trust in AI models. I have thoroughly read this manuscript and would like to give the following suggestions.
1) In Section 1, the explanations of Fig. 1 should be enriched a bit. The current illustration is only a simple case. Could you, please, provide more insights into the motivations of this paper?
2) The novelty of the proposed method is not clear in Section 1. I suggest that authors highlight the primary differences between their methods and existing studies. Besides, the contributions of the proposed approaches should be summarized at the end of Section 1.
3) The proposed solution methods in Section 2 should be further detailed. I suggest authors provide the algorithm steps for the main framework of their methods in Section 2.3.
4) In Section 3.1, the results of Figs. 4 and 5 can be simplified further. Besides, the analysis of these two figures is not sufficient. Please also enrich the explanations of the results in other tables and figures of Section 3.
5) More insights benefited from the proposed iterative framework should be enriched based on the results of experiments reported in Section 3.

