Review Reports - A Standardized Validation Framework for Clinically Actionable Healthcare Machine Learning with Knee Osteoarthritis Grading as a Case Study

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposed a methodology for validation of machine learning models in healthcare. The paper could be enhanced with overview of its organization. The goal of article is not clear as well as authors' original contribution.
The paper also lacks overview of existing validation frameworks and potential comparison to authors’ work.
The current literature review on the analyzed isssue lacks depth. It is recommended to conduct a more comprehensive examination of recent studies related to the topic. Incorporating some of these works into the literature review would strengthen the discussion. For example:
- Wierzbicki M.P, Jantos B.A, Tomaszewski M. A Review of Approaches to Standardizing Medical Descriptions for Clinical Entity Recognition: Implications for Artificial Intelligence Implementation. Applied Sciences. 2024; 14(21):9903. https://doi.org/10.3390/app14219903
- Arora, A., Alderman, J.E., Palmer, J. et al. The value of standards for health datasets in artificial intelligence-based applications. Nat Med 29, 2929–2938 (2023). https://doi.org/10.1038/s41591-023-02608-w
- Um T-W, Kim J, Lim S, Lee GM. Trust Management for Artificial Intelligence: A Standardization Perspective. Applied Sciences. 2022; 12(12):6022. https://doi.org/10.3390/app12126022
Figure2 – it is based on FDA, however it is not compared to any regulations from other countries or from European Union or global institutions
Data description part of Figure 2 is not elaborated in text.
In 2.3 – what were the reasons for choosing those datasets?
In line 245 – some example for one of those „many healthcare applications” could be provided.
Equation 25 - provide source, as well as for rest of equations from Table 4.
In line 296 - lack of equation number and symbols description.
In section 3 - the short description of used model should be included or minimal reproducibility of the study's results.
In line 395 - what concrete clinical requirements?
Table 3 could benefit from sources of derived conclusions.

Author Response

We would like to sincerely thank you for your thorough and thoughtful review of our manuscript. Your constructive feedback and insightful suggestions have not only helped us identify important areas for clarification and improvement but have also genuinely motivated and excited us to strengthen our work. Engaging with your comments has been a rewarding process, and we greatly appreciate the time and expertise you devoted to advancing the quality and clarity of our study.

Comment 1: This paper proposed a methodology for validation of machine learning models in healthcare. The paper could be enhanced with overview of its organization. The goal of article is not clear as well as authors' original contribution.

Response 1: We thank the reviewer for this thoughtful and constructive comment. In response, we have revised the end of the Introduction to explicitly clarify the article's main goal, highlight our original contributions, and provide an overview of the manuscript's organization. Specifically, we have added a new subsection entitled "Article Goals, Contributions, and Organization," which now details the overall aim, the specific novel aspects of our work (including the practical validation framework and composite clinical utility metrics), and a roadmap of the paper's structure.

Comment 2: The paper also lacks overview of existing validation frameworks and potential comparison to authors’ work.

Response 2: We thank the reviewer for this insightful suggestion. In response, we have added a dedicated subsection in the Methods section titled “Comparison to Existing Validation Frameworks.” Here, we have briefly reviewed major validation frameworks, including those by regulatory bodies and recent reporting guidelines, and clarified how our proposed framework both incorporates these best practices and extends them. Specifically, we highlight our unique contributions, such as the formalization of the validation process and the introduction of composite clinical utility metrics, which address limitations of existing approaches.

Comment 3: The current literature review on the analyzed isssue lacks depth. It is recommended to conduct a more comprehensive examination of recent studies related to the topic. Incorporating some of these works into the literature review would strengthen the discussion. For example:

Wierzbicki M.P, Jantos B.A, Tomaszewski M. A Review of Approaches to Standardizing Medical Descriptions for Clinical Entity Recognition: Implications for Artificial Intelligence Implementation. Applied Sciences. 2024; 14(21):9903. https://doi.org/10.3390/app14219903

Arora, A., Alderman, J.E., Palmer, J. et al. The value of standards for health datasets in artificial intelligence-based applications. Nat Med 29, 2929–2938 (2023). https://doi.org/10.1038/s41591-023-02608-w

Um T-W, Kim J, Lim S, Lee GM. Trust Management for Artificial Intelligence: A Standardization Perspective. Applied Sciences. 2022; 12(12):6022. https://doi.org/10.3390/app12126022

Response 3: We thank the reviewer for this valuable suggestion. In response, we have expanded the literature review in the Introduction to include a more comprehensive discussion of recent studies relevant to standardization and validation in healthcare AI. Specifically, we have incorporated and discussed the works by Wierzbicki et al., Arora et al., and Um et al., as recommended. Admittedly, these additions help to contextualize our methodology within the broader landscape of current research and further clarify how our approach builds upon and extends recent advances in the field. Again, thank you for this valuable comment.

Comment 4: Figure1 – it is based on FDA, however it is not compared to any regulations from other countries or from European Union or global institutions

Response 4: We thank the reviewer for this insightful comment. In response, we have revised the manuscript to address the broader international regulatory landscape for AI/ML in healthcare. Specifically, we have added a new section (titled, “Beyond FDA Guidance”) that directly compares our validation framework not only to recent FDA guidance, but also to key European and global standards. In this new section, we highlight how the core principles of our framework are consistent with foundational international standards including:

1) The International Council for Harmonisation (ICH) Quality Guidelines

2) FDA’s Office of Pharmaceutical Quality

3) The European Union’s Medical Device Regulation (EU MDR)

4) The EU Artificial Intelligence Act (AI Act), and

5) The World Health Organization (WHO) guidance on AI ethics and governance.

Comment 5: Data description part of Figure 1 is not elaborated in text.

Response 5: We thank the reviewer for this insightful comment. We agree that further elaboration of the “Data Description” domain in Figure 1 is important for clarity and completeness. Accordingly, we have added a dedicated paragraph under that Figure that explicitly details the requirements and best practices for data characterization, documentation, annotation, and bias mitigation as expected within this domain.

Comment 6: In 2.3 – what were the reasons for choosing those datasets?

Response 6: We thank the reviewer for this insightful comment. To address this, we have added a dedicated paragraph at the end of Section ’Datasets’ explicitly describing the rationale for our dataset selection. In summary, Dataset A (Kaggle) was selected for its prevalence in the literature and balanced KL grade distribution, providing a strong foundation for model training and comparison. Dataset B (Mendeley Data) was chosen as an independent, externally sourced dataset to rigorously evaluate generalizability and robustness to domain shift, which is essential for clinically relevant machine learning validation. This dual-dataset approach ensures our results are not limited to a single data source and better reflect real-world clinical scenarios. The newly added paragraph now clarifies this rationale in the manuscript.

Comment 7: In line 245 – some example for one of those „many healthcare applications” could be provided.

Response 7: We thank the reviewer for this excellent suggestion. To address this point, we have added specific examples as requested. The revised manuscript now highlights cancer screening (e.g., mammography for breast cancer detection) and diabetic retinopathy screening as healthcare applications where maximizing sensitivity is of paramount importance to avoid missing critical diagnoses. These examples illustrate why sensitivity is often assigned a higher weight in clinical evaluation metrics. We have also expanded the Discussion section to explicitly address the importance of selecting appropriate weighting coefficients in composite evaluation metrics. We now clarify that the optimal balance between sensitivity and specificity depends on the clinical context: for population-level screening, sensitivity should be prioritized to ensure all positive cases are detected, whereas for individual patient management, specificity becomes more important to minimize unnecessary interventions. Concrete examples have been added.

Comment 8: Equation 25 - provide source, as well as for rest of equations from Table 4.

Response 8: We thank the reviewer for this comment and the careful attention to detail regarding the provenance of the evaluation metrics. These composite metrics (namely the Overall Model Score, Clinical Utility Score, Weighted Endpoint Accuracy Score (WEAS), and the Composite Utility Metric) are original to this manuscript. While each draws upon established statistical measures (such as AUC, sensitivity, specificity, predictive values, and F1 score), the specific composite forms, weighting schemes, and their structured integration to address clinical priorities have been developed by us as part of this work.

Comment 9: In line 296 - lack of equation number and symbols description.

Response 9: We thank the reviewer for this excellent suggestion. In response, we have added a new subsection, “Summary of Notation,” at the end of Section 2 (Materials and Methods). This subsection provides clear definitions for all symbols and notation used in Equations (29)--(31) and throughout the manuscript, thereby improving clarity and accessibility for readers.

Comment 10: In section 3 - the short description of used model should be included or minimal reproducibility of the study's results.

Response 10: We thank the reviewer for this valuable suggestion. To enhance the reproducibility and transparency of our work, we have now included a concise description of the model architecture, implementation details, and training protocol at the beginning of Section 3 (Results). This new paragraph outlines the CNN structure, data preprocessing steps, training procedure, and evaluation metrics, thereby facilitating the minimal reproducibility of our study's results.

Comment 11: In line 395 - what concrete clinical requirements?

Response 11: We thank the reviewer for this insightful question. We have clarified the specific clinical requirements referenced by adding a new explanatory paragraph to that Section. In this addition, we explicitly define the clinical requirements as those performance criteria that are essential for safe and actionable use in patient care, such as high sensitivity and specificity, strong predictive values, and reliable discrimination of clinically actionable endpoints. The composite metrics and weighting schemes in our study have been designed to directly align with these requirements, ensuring that the model’s outputs translate into meaningful and practical decision support for clinicians. Please see the added paragraph for details.

Comment 12: Table 3 could benefit from sources of derived conclusions.

Response 12: We appreciate this suggestion. All authors are affiliated with medical institutions, and that Table 3 summarizes problematic learning curve behaviors, their causes, and clinical implications, which are based primarily on our own direct experience with developing and evaluating machine learning models in clinical imaging contexts. We could cite our work, but we do not want to be accused of overly citing our own papers.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper is generally interesting and within the scope of the journal. the structure of the paper is logical and the paper is well written and has potential to be published. Still, there are some significant flaws that need to be considered in order to reach standard quality of the journal paper. Authors are invited to consider the following remarks:

Objective of the paper reflected in the title, abstract and contents seems overambitious with very wide formulation. This cannot be justified neither with the research presented nor with authors' publication records. Still, with moderate shift of focus and limiting the aim to a more narrow field would result in a much more suitable form of paper. In its current form the paper aims at providing guidance to a huge and very active research field, which needs to be changed.
In light of the previous remark, novelty of the paper has to be clearly and directly expressed. Authors should consider adding the concluding subparagraph in the Introduction section clearly expressing contributions of the paper in a direct and short manner. This would in return help the authors to resolve previous and the most important remark.
It is also advisable to add final subparagraph in Introduction, after the one expressing novelty, that briefly introduces papare structure. Something like: 'The rest of the paper is organized as follows:...'.
Conclusions need improvement, in relation to previous remarks. Conclusions should be less general an more directly related to presented results, with moderate expansion of volume of conclusions section.
It is advisable to add future research directions as a concluding subparagraph in Conclusions section. This would improve citatations of the paper in the future and partially compensate wide scope of the paper.
Abstract of the paper could be somewhat shortened for clarity.
For results adopted from [29] some justification is needed, why this has been choosen as an illustrative example for a wide conclusions? Alos in list of references [29] seems to have double occurence of the first author.
Numbering of the equations stops in the last part of the paper, and final equations are missing numbers for no apparent reason.
Captions of the figures are partially excessive. Some parts of the information in captions should be moved to regular text with reference to the figures. Avoid multi sentence captions.
Some parts of text seem not fully visible in figures (e.g. Validation in Figure 4).
Some equations seem not properly centred ( (26)-(28), aligned yo =, appears inconsistent with others).

If authors are willing to consider improvements especially regading the limiting the scope of the paper the paper should be reconsidered for publishing.

Author Response

Comment 1: Objective of the paper reflected in the title, abstract and contents seems overambitious with very wide formulation. This cannot be justified neither with the research presented nor with authors' publication records. Still, with moderate shift of focus and limiting the aim to a more narrow field would result in a much more suitable form of paper. In its current form the paper aims at providing guidance to a huge and very active research field, which needs to be changed.

Response 1: We thank the reviewer for this insightful comment. We agree that the original title and abstract were overly broad relative to the specific research presented. In response, we have revised the title and abstract to clarify that, while our standardized training and validation framework is generally applicable to machine learning models for classifying medical conditions, our primary demonstration and case study focus on knee osteoarthritis grading. We have also revised the manuscript to ensure that general claims are supported by specific examples, and we have clearly stated the limitations of our work.

Comment 2: In light of the previous remark, novelty of the paper has to be clearly and directly expressed. Authors should consider adding the concluding subparagraph in the Introduction section clearly expressing contributions of the paper in a direct and short manner. This would in return help the authors to resolve previous and the most important remark.

Response 2: We thank the reviewer for highlighting the need to clearly and directly express the novelty and contributions of our work. In response, we have added a dedicated subparagraph at the end of the Introduction (now Section 1.1: "Aims, Contributions, and Organization") that explicitly enumerates the original contributions of our paper in a concise and direct manner. This section clarifies the novelty of our standardized, FDA-aligned validation framework, the introduction of composite clinical utility metrics, and the demonstration of these methods in a real-world case study.

Comment 3: It is also advisable to add final subparagraph in Introduction, after the one expressing novelty, that briefly introduces papare structure. Something like: 'The rest of the paper is organized as follows:...'.

Response 3: We thank the reviewer for this helpful suggestion. In response, we have added a final subparagraph at the end of the Introduction (now Section 1.1: "Aims, Contributions, and Organization") that briefly introduces the structure of the paper. This subparagraph follows the statement of novelty and explicitly outlines the organization of the manuscript, as recommended.

Comment 4: Conclusions need improvement, in relation to previous remarks. Conclusions should be less general an more directly related to presented results, with moderate expansion of volume of conclusions section.

Response 4: We thank the reviewer for this valuable comment. We have revised the Conclusions section to make it less general and more directly tied to our presented results. Specifically, we now explicitly summarize the key quantitative findings, such as the in-domain and cross-domain performance, the implications of learning curve behaviors, and the direct impact of our proposed composite and endpoint-weighted metrics. The revised Conclusions now clarify how our results demonstrate the necessity of rigorous, standardized training and validation protocols for achieving genuinely clinically reliable machine learning models.

Comment 5: It is advisable to add future research directions as a concluding subparagraph in Conclusions section. This would improve citatations of the paper in the future and partially compensate wide scope of the paper.

Response 5: We thank the reviewer for this insightful suggestion. We agree that outlining future research directions can help guide subsequent work is needed. In response, we have added a dedicated “Future Research Directions” subparagraph at the end of the Discussion section. In this paragraph, we discuss promising avenues for further investigation, including the need for more robust model architectures, real-world clinical integration, standardized external benchmarks, and continued refinement of clinical utility metrics. As per journal policy and standard practice, we have not included additional citations in the Conclusions section; however, we have substantially increased the total number of citations throughout the manuscript to comprehensively reference related literature.

Comment 6: Abstract of the paper could be somewhat shortened for clarity.

Response 6: We thank the reviewer for the suggestion to shorten the abstract for clarity. In the revised manuscript, we have substantially condensed the abstract by removing redundant explanations and focusing on the essential findings and contributions.

Comment 7: For results adopted from [29] some justification is needed, why this has been choosen as an illustrative example for a wide conclusions?

Response 7: Thank you for raising this important point. We chose to adopt results from [29, now 40] as an illustrative example because that study represents a well-documented, peer-reviewed application of deep learning to the clinically relevant task of automated Kellgren–Lawrence grading in knee osteoarthritis. The methodology, dataset characteristics, and performance metrics in [40] are representative of common practices and challenges in medical machine learning research, particularly regarding the risk of deceptively high in-domain accuracy and the need for clinically meaningful evaluation.

Comment 8: in list of references [29] seems to have double occurence of the first author.

Response 8: Impressive attention to detail.They are actually two brothers/twins studying at our medical college together: Daniel and Demarcus Nasef.

Comment 9: Numbering of the equations stops in the last part of the paper, and final equations are missing numbers for no apparent reason.

Response 9: Thank you for your careful reading and for noting the inconsistency in equation numbering in the latter part of the manuscript. We would like to clarify that our approach to equation numbering was intentional: all equations in the Methods section are numbered, as they are referenced and discussed throughout the text. In contrast, equations appearing in the Results section are presented solely to illustrate the calculation of specific metrics or scores for the reader’s convenience; these formulas are not referenced elsewhere in the manuscript. Following standard practice, we did not assign numbers to these illustrative equations since they are not cited in the text.

Comment 10: Captions of the figures are partially excessive. Some parts of the information in captions should be moved to regular text with reference to the figures. Avoid multi sentence captions.

Response 10: Thank you for your comment regarding the length and detail of the figure captions. We appreciate the concern about excessive information in captions and the suggestion to move some content into the main text. We would like to clarify that the current style of multi-sentence, detailed captions is a deliberate choice based on the senior author’s experience in scientific publishing. In our field, it is common for many readers—especially those seeking a quick understanding of the results—to focus primarily on the figures and tables, sometimes without reading the full text. By providing comprehensive captions, we aim to ensure that the main findings and context of each figure are accessible and understandable even to those readers who do not read the manuscript in its entirety.

Comment 11: Some parts of text seem not fully visible in figures (e.g. Validation in Figure 4).

Response 11: I’m guessing you mean the label for y-axis of the Figure 4(b)? That’s because both the (a) and (b) share the same label. Hence, it was placed there only once. All Figures are generated by a series of LaTex commands. No figure is an image placed in the document. Please, clarify if I didn’t understand it. Thank you!

Comment 12: Some equations seem not properly centred ( (26)-(28), aligned yo =, appears inconsistent with others).

Response 12: Again, impressive attention to detail. We’ve fixed the alignment now.

Reviewer 3 Report

Comments and Suggestions for Authors

This work proposes a standardized validation framework for establishing and maintaining clinical credibility in healthcare machine learning models. The proposal presents an interesting topic; however, the following aspects were identified:

It is important to review the state of the art of works related to this proposal, indicating the main differences of the related works with respect to the proposal in order to highlight its originality and, above all, its novelty. This review can be presented in a new section or incorporated into the introduction.
It is suggested to expand the justification and explanation of why in the case of the study on knee osteoarthritis a greater weight (2) is assigned to classes 0 and 4, for a better understanding and comprehension of the results.
It is important that the authors indicate the limitations of the proposal; this could be done in the discussion or in the conclusions.
The Standardized validation framework for establishing and maintaining clinical credibility in healthcare machine learning models, could it be applied in other areas? If the answer is positive, indicate in which areas and if it is necessary to make some adjustments.
Likewise, it is suggested that in the conclusions indicate the future work to be carried out based on the results obtained.

Author Response

We would like to sincerely thank the reviewer for their thoughtful and constructive feedback. Your comments and suggestions have been invaluable in helping us clarify key points, strengthen our methodology, and better communicate the clinical relevance and broader applicability of our work. We greatly appreciate the time and effort you dedicated to reviewing our manuscript.

Comment 1: It is important to review the state of the art of works related to this proposal, indicating the main differences of the related works with respect to the proposal in order to highlight its originality and, above all, its novelty. This review can be presented in a new section or incorporated into the introduction.

Response 1: Thank you for the insightful comment. The revised manuscript now includes a substantially expanded review of the state of the art in both the introduction and a dedicated comparison section. We discuss recent advances in standardization and trust in healthcare AI, referencing key works and regulatory frameworks. We explicitly compare our proposed methodology to existing validation frameworks, highlighting how our approach uniquely addresses the challenges of clinical utility and external validation through mathematically formalized protocols and composite clinical metrics.

Comment 2: It is suggested to expand the justification and explanation of why in the case of the study on knee osteoarthritis a greater weight (2) is assigned to classes 0 and 4, for a better understanding and comprehension of the results.

Response 2: We thank the reviewer for highlighting the need to better justify the weighting scheme in our evaluation metric for knee osteoarthritis grading. In the revised manuscript, we have expanded our explanation to clarify that a greater weight (w = 2) is assigned to classes 0 (healthy) and 4 (severe osteoarthritis) because these endpoints represent the most clinically significant categories. Accurate identification of healthy patients and those with severe disease is critical for appropriate clinical decision-making, as misclassifications at these extremes can lead to either unnecessary interventions or missed opportunities for timely treatment. By assigning higher weights to these classes in the Weighted Endpoint Accuracy Score (WEAS), we ensure that the evaluation metric reflects the real-world clinical importance of correctly classifying these cases.

Comment 3: It is important that the authors indicate the limitations of the proposal; this could be done in the discussion or in the conclusions.

Response 3: We thank the reviewer for emphasizing the importance of clearly stating the limitations of our proposal. In the revised manuscript, we have addressed this by explicitly discussing the current limitations in both the Discussion and Conclusions sections. We acknowledge that, while our standardized validation framework and composite clinical utility metrics represent important steps forward, further work is needed to validate these approaches across more heterogeneous and multi-center datasets. Additionally, we recognize that the optimal weighting of composite metrics and their alignment with evolving clinical guidelines require ongoing refinement. We also note that real-world deployment and continuous monitoring of model performance are essential to fully establish clinical reliability.

Comment 4: The Standardized validation framework for establishing and maintaining clinical credibility in healthcare machine learning models, could it be applied in other areas? If the answer is positive, indicate in which areas and if it is necessary to make some adjustments.

Response 4: In the revised manuscript, we have clarified that our framework is intentionally designed for broad applicability and is consistent with both U.S. and international regulatory expectations, including those in the European Union and guidance from the World Health Organization. The core principles—such as transparency, rigorous evaluation, lifecycle management, and risk-based oversight—are relevant to any domain where trust, safety, and regulatory compliance are essential, such as medical devices, pharmaceuticals, and other high-stakes AI applications. While the framework’s structure is broadly transferable, we acknowledge that adjustments may be necessary to address the specific requirements and evaluation criteria of other fields.

Comment 5: Likewise, it is suggested that in the conclusions indicate the future work to be carried out based on the results obtained.

Response 5: We have added a dedicated section on "Future Research Directions" within the Discussion. Here, we outline several important next steps, including the development of more advanced model architectures to improve generalizability, the real-world deployment and integration of validated models into clinical workflows, and the creation of standardized benchmarks for external validation. We also highlight the need to further refine composite utility metrics in line with evolving clinical guidelines and regulatory standards.

Reviewer 4 Report

Comments and Suggestions for Authors

The manuscript presents a comprehensive framework for training and validating machine learning (ML) models in healthcare, emphasizing the need for standardized methodologies and clinically relevant evaluation. While the study is well-structured and addresses critical issues in healthcare ML, several sections require clarification, methodological rigor, or expansion to enhance reproducibility, clinical applicability, and readability. Below are detailed critiques and suggestions for improvement.

Title and Abstract (Page 1)

Title: Consider adding specificity, e.g., "A Case Study in Knee Osteoarthritis Grading."

Abstract: The clinical implications of misclassification (e.g., harm to patients) should be emphasized earlier. Briefly define "composite clinical measures" for broader accessibility.

Introduction (Pages 1–3)

I found an overemphasis on CNNs/imaging: Broaden discussion to include non-imaging ML applications (e.g., EHR-based models).

Moreover, provide concrete examples of how leakage occurs in medical datasets and expand on why traditional metrics (accuracy, F1) may mislead clinicians.

Materials and Methods (Pages 3–9)

First, justify the choice of VGG16 over newer architectures (e.g., ResNet, Transformers). Clarify why ImageNet normalization was used for medical images.

Secondly, provide demographic/clinical details (e.g., patient age, imaging protocols) to assess bias. Also, discuss ethical approvals for Dataset A/B (missing in IRB statement).

Thirdly, justify weight selection (e.g., why α₁=α₂=α₃=1/3?). A sensitivity analysis for weights would strengthen claims.

Finally, explain how the penalty function (|i-j|) was clinically validated (e.g., via physician input).

Results (Pages 10–14)

Explain why "higher validation than training accuracy" suggests leakage (counterintuitive to some readers). Also, add examples of corrective actions (e.g., data augmentation, regularization).

Regarding the composite scores, compare scores to baseline models or human performance.

Regarding the WEAS Evaluation, please discuss limitations (e.g., assumes linear misclassification cost; real-world costs may be nonlinear).

Discussion (Pages 14–18)

Address potential trade-offs (e.g., high sensitivity vs. specificity in screening vs. diagnostics).

Moreover, small dataset sizes (n=200) may limit generalizability, discuss about it.

Finally, I found a lack of prospective validation in clinical workflows.

Conclusions (Page 18)

Add actionable recommendations (e.g., "Adopt WEAS for endpoint-critical tasks"). Also, mention open challenges (e.g., model interpretability for clinicians).

Overall Recommendations

-Share code/data for reproducibility

-Involve clinicians in metric design (e.g., WEAS weights) to ensure relevance.

-Discuss how the framework applies beyond osteoarthritis (e.g., cancer grading).

This manuscript provides a valuable contribution to healthcare ML validation but requires refinements in methodological transparency, clinical grounding, and limitations discussion. With revisions, it could serve as a benchmark for robust model evaluation in medicine.

Author Response

Comment 1: Title: Consider adding specificity, e.g., "A Case Study in Knee Osteoarthritis Grading."

Response 1: We thank the reviewer for their suggestion to add specificity to the title. In response, we revised the title to explicitly reference the case study, changing it to "Standardized Validation Framework for Clinically Actionable Healthcare Machine Learning with Knee Osteoarthritis Grading as a Case Study." We also clarified throughout the abstract and introduction that knee osteoarthritis grading serves as the primary application example for our proposed framework.

Comment 2: Abstract: The clinical implications of misclassification (e.g., harm to patients) should be emphasized earlier. Briefly define "composite clinical measures" for broader accessibility.

Response 2: We appreciate the reviewer’s suggestion to emphasize the clinical implications of misclassification earlier in the abstract and to clarify the meaning of "composite clinical measures." In response, we revised the abstract to highlight at the outset that high in-domain accuracy does not guarantee reliable clinical performance, especially when validation protocols are insufficient—thereby underscoring the potential for patient harm due to misclassification. We also clarified that our composite clinical measures are designed to better capture real-world clinical utility, and we further defined and illustrated these metrics in the main text to ensure broader accessibility for readers.

Comment 3: I found an overemphasis on CNNs/imaging: Broaden discussion to include non-imaging ML applications (e.g., EHR-based models).

Response 3: We thank the reviewer for highlighting the overemphasis on CNNs and imaging applications in our original manuscript. In response, we revised the introduction and discussion to more explicitly acknowledge the breadth of machine learning applications in healthcare, including those based on electronic health records (EHR) and other non-imaging data sources. We now reference recent advances in EHR-based modeling and clarify that our proposed validation framework and composite clinical utility metrics are designed to be broadly applicable across both imaging and non-imaging ML tasks.

Comment 4: Moreover, provide concrete examples of how leakage occurs in medical datasets and expand on why traditional metrics (accuracy, F1) may mislead clinicians.

Response 4: In the revised manuscript, we now explicitly describe how data leakage can occur in medical datasets, such as when there is excessive overlap between training and validation sets or when the validation set is not representative of real-world data. We illustrate this with a concrete example from our case study, where anomalous learning curves and inflated validation accuracy signaled potential leakage, ultimately resulting in poor generalization to external data. Additionally, we expanded our discussion of traditional metrics, emphasizing that high accuracy and F1 scores on in-domain data can mask clinically significant errors and do not necessarily translate to reliable clinical performance.

Comment 5: First, justify the choice of VGG16 over newer architectures (e.g., ResNet, Transformers). Clarify why ImageNet normalization was used for medical images.

Response 5: We thank the reviewer for this insightful comment. In the revised manuscript, we have added a dedicated paragraph at the end of Section 2.4.1 to clarify our rationale. We chose VGG16 as the backbone due to its well-established reliability and interpretability in medical image analysis, and because it serves as a robust, reproducible baseline for demonstrating our validation framework. While newer architectures such as ResNet and transformers may offer higher performance in some settings, our focus was on model transparency and a standardized demonstration of training pitfalls, for which VGG16 is well suited—particularly with limited data. We also clarified that ImageNet normalization was used to ensure compatibility with the pre-trained VGG16 weights, as this is critical for optimal transfer learning: the initial layers expect input distributions matching those seen during pre-training. This practice is widely adopted in medical imaging transfer learning to promote stable and effective feature extraction. We acknowledge that benchmarking alternative architectures within our framework is an important avenue for future work.

Comment 6: provide demographic/clinical details (e.g., patient age, imaging protocols) to assess bias. Also, discuss ethical approvals for Dataset A/B (missing in IRB statement).

Response 6: We thank the reviewer for this important comment. In the revised manuscript, we have added a paragraph at the end of Section 2.5 clarifying the demographic and clinical details available for Datasets A and B. Both datasets are publicly available, de-identified knee X-ray collections that have been widely used in the literature. While detailed patient-level demographic and imaging protocol information is limited or unavailable, using two independently collected datasets enables us to demonstrate how learning curve analysis can predict generalizability across different sources, which is central to our study’s aims. We also clarify that, as both datasets are fully de-identified and publicly released for research, no IRB approval was required for their use. This clarification is now explicitly stated in the manuscript.

Comment 7: justify weight selection (e.g., why α₁=α₂=α₃=1/3?). A sensitivity analysis for weights would strengthen claims.

Response 7: We thank the reviewer for this excellent suggestion. In the revised manuscript, we have clarified our rationale for selecting equal weights (α₁=α₂=α₃=1/3) in the composite clinical score. Our primary aim was to provide a transparent and reproducible demonstration of the standardized validation framework, and equal weighting offers a neutral baseline for comparing model performance across metrics. We now explicitly discuss in Sections 3.4 and 4.2 that, in real-world clinical applications, the optimal weighting of sensitivity, specificity, and AUC should be tailored to the specific clinical context—such as prioritizing sensitivity in screening or specificity in confirmatory diagnosis. We also acknowledge that a formal sensitivity analysis of the composite metric with respect to different weighting schemes would further strengthen the robustness of our claims, and we have highlighted this as an important direction for future research.

Comment 8: Finally, explain how the penalty function (|i-j|) was clinically validated (e.g., via physician input).

Response 8: We thank the reviewer for this insightful comment. In the revised manuscript, we clarified the clinical rationale for the penalty function used in our composite utility metric. This penalty structure is grounded in established clinical reasoning, where the severity of misclassification is proportional to the distance between true and predicted classes—misclassifying a healthy patient as severely ill (or vice versa) is far more consequential than confusing adjacent grades. While this approach aligns with clinical intuition and prior literature, we acknowledge that it was not formally validated through a dedicated physician survey or consensus process in this study. We have now explicitly stated this in the manuscript and highlighted that future work could incorporate direct physician input or Delphi consensus to further refine and validate the penalty function.

Comment 9: Explain why "higher validation than training accuracy" suggests leakage (counterintuitive to some readers). Also, add examples of corrective actions (e.g., data augmentation, regularization).

Response 9: We thank the reviewer for highlighting the need to clarify why higher validation than training accuracy suggests data leakage and to provide examples of corrective actions. In the revised manuscript, we now explicitly explain in Sections 3.3 and 4.1.1 that this pattern is a red flag because it often indicates that information from the validation set has inadvertently influenced the training process, or that the validation set is not representative of real-world data. This can create an artificial boost in validation performance, masking underlying issues such as overfitting or data leakage. We also added concrete examples of corrective actions—including data augmentation, regularization, cross-validation, and robust data preprocessing—that can help prevent or address these issues.

Comment 10: Regarding the composite scores, compare scores to baseline models or human performance.

Response 10: We thank the reviewer for this valuable suggestion. In the revised manuscript, we report composite clinical scores for our model and discuss their clinical implications in detail. However, we acknowledge that a direct comparison to baseline models or human performance was not included in this version. Our primary focus was to demonstrate the utility of composite metrics and standardized validation protocols in revealing the true generalizability of machine learning models for knee osteoarthritis grading. We agree that benchmarking against baseline models or human readers would further contextualize the composite scores and strengthen the clinical relevance of our findings. We have highlighted this as an important direction for future work and appreciate the reviewer’s insight in guiding the continued development of our evaluation framework.

Comment 11: Regarding the WEAS Evaluation, please discuss limitations (e.g., assumes linear misclassification cost; real-world costs may be nonlinear).

Response 11: In the updated version, we explicitly discuss the limitation of the original WEAS formulation, which weights endpoint classes more heavily but assumes a linear aggregation of per-class accuracies. To address the reviewer’s point, we added an extension to the metric that incorporates a misclassification penalty based on the distance between true and predicted classes (i.e., |i–j|), thereby allowing for nonlinear penalization of errors that are further from the true class. We then define a composite utility function, U = λ·WEAS – (1–λ)·Penalty, where λ balances the importance of endpoint accuracy against the overall misclassification cost. This approach enables the evaluation metric to better reflect the nonlinear and context-dependent nature of clinical misclassification costs. By including this discussion and the extended formula.

Comment 12: Address potential trade-offs (e.g., high sensitivity vs. specificity in screening vs. diagnostics).

Response 12: we revised the manuscript to explicitly discuss these issues. In the revised version, we expanded the clinically oriented evaluation protocol to clarify how composite metrics can be tailored to reflect different clinical priorities. Specifically, we now explain that the weighting coefficients for sensitivity and specificity in our composite model score should be chosen based on the intended clinical application: for example, prioritizing sensitivity in population-level screening to minimize missed cases, or emphasizing specificity in individual diagnostics to avoid unnecessary interventions. We provide concrete examples, such as cancer screening (where high sensitivity is critical) versus confirmatory diagnostics (where specificity may be more important), and highlight that the optimal balance between these metrics is context-dependent. By making these trade-offs explicit and justifying the selection of metric weights according to clinical use case, we ensure that our evaluation framework aligns with real-world decision-making and addresses the reviewer’s concern about the nuanced balance between sensitivity and specificity in different healthcare scenarios.

Comment 13: Moreover, small dataset sizes (n=200) may limit generalizability, discuss about it.

Response 13: We appreciate the reviewer’s attention to the issue of dataset size and its potential impact on generalizability. We would like to clarify that the confusion matrix and associated metrics (such as WEAS) are computed using the independent test set, not the training data. Therefore, the number of instances in the confusion matrix reflects the size of the test set used for evaluation, which is separate from the training set. This distinction is important because it ensures that our reported metrics genuinely assess the model’s ability to generalize to unseen data, rather than its performance on the data it was trained on. We have explicitly clarified this point in the revised manuscript (see Section 3.5) to avoid any confusion and to reinforce the validity of our generalizability assessment.

Comment 14: Finally, I found a lack of prospective validation in clinical workflows.

Response 14: We thank the reviewer for highlighting the importance of prospective validation in clinical workflows. We agree that while our current study establishes a standardized validation framework and demonstrates robust external validation using independent datasets, true clinical utility ultimately requires prospective assessment within real-world clinical environments. In the revised manuscript, we have explicitly acknowledged this limitation and now discuss the need for future research focused on integrating validated models into clinical workflows and evaluating their ongoing impact on patient outcomes and care efficiency (see Section 4.3, Limitations and Future Research Directions). We view our work as a necessary foundation for such prospective studies and fully recognize that real-world deployment and continuous monitoring are essential for translating machine learning models into safe and effective clinical practice.

Comment 15: Add actionable recommendations (e.g., "Adopt WEAS for endpoint-critical tasks"). Also, mention open challenges (e.g., model interpretability for clinicians).

Response 15: We thank the reviewer for suggesting the inclusion of actionable recommendations and a discussion of open challenges. In the revised manuscript, we now explicitly recommend the adoption of the Weighted Endpoint Accuracy Score (WEAS) and composite clinical utility metrics for evaluating ML models in healthcare, particularly for tasks where accurate classification of clinically critical endpoints is essential. These metrics are designed to better align model evaluation with real-world clinical decision-making and patient safety priorities. Additionally, we have expanded the discussion of open challenges in Section 4.3, highlighting the need for further research on model interpretability for clinicians, integration of validated models into clinical workflows, and the development of standardized benchmarks for external validation. We also acknowledge that while our composite metrics were designed to reflect clinical reasoning, future work should incorporate structured clinician input to further enhance their relevance and utility in practice.

Comment 16: Share code/data for reproducibility.

Response 16: We appreciate the reviewer’s emphasis on reproducibility, which is a cornerstone of robust scientific research. In this manuscript, the example presented is intended as a demonstration of a standardized validation framework, using knee osteoarthritis grading as a case study. The specific model architecture, training protocol, and evaluation metrics closely follow our previously published work ([38] in the manuscript), which provides detailed methodological descriptions sufficient for readers to reproduce the results. As for the code, the implementation is based on standard deep learning frameworks (Keras/TensorFlow), and all model parameters, training procedures, and evaluation metrics are specified in that manuscript. The referenced prior publication ([39]) also provides further technical details. Given that the purpose of this work is to illustrate the validation framework rather than to introduce a novel algorithm, we believe that the level of detail provided, together with the use of public datasets and established methods, is sufficient for reproducibility. Regarding data availability, both datasets used in this study are publicly accessible and have been explicitly referenced in the manuscript (see references [40] and [41]). These datasets are widely used in the community and can be freely downloaded by any interested reader. The data preprocessing steps, model configuration (including all hyperparameters), and evaluation protocols are described in detail in the Materials and Methods section, ensuring that the workflow can be replicated.

Comment 17: Involve clinicians in metric design (e.g., WEAS weights) to ensure relevance.

Response 17: We appreciate the reviewer’s suggestion to involve clinicians in the design of evaluation metrics such as WEAS to ensure clinical relevance. In the revised manuscript, we have clarified the clinical rationale behind our weighting choices and composite metrics, explicitly aligning them with established clinical priorities for knee osteoarthritis grading—namely, the critical importance of accurately distinguishing healthy and severely affected cases. However, we also acknowledge as a limitation that our current formulas and weighting strategies were not yet formally validated through direct clinician input or structured expert consensus. We now explicitly state this in the manuscript (Section 4.3, Limitations) and outline plans to incorporate expert clinician feedback or a formal consensus process in future work to further optimize and validate these evaluation metrics for real-world clinical use.

Comment 18: Discuss how the framework applies beyond osteoarthritis (e.g., cancer grading).

Response 18: We thank the reviewer for highlighting the importance of generalizability beyond osteoarthritis. In the revised manuscript, we have clarified that our standardized validation framework is intentionally designed for broad applicability across a wide range of medical classification tasks, not just knee osteoarthritis. We now explicitly discuss how the framework and composite evaluation metrics can be adapted to other domains, such as cancer grading and diabetic retinopathy screening, by tailoring the weighting of sensitivity and specificity to the clinical context. For example, we illustrate how prioritizing sensitivity is critical in cancer screening to avoid missed malignancies, and we emphasize that the selection of metric weights should reflect the specific risks and priorities of each application. By providing these concrete examples and discussing the context-dependent nature of clinical evaluation, we demonstrate that our methodology is relevant and adaptable to diverse healthcare AI challenges beyond the case study presented.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The efforts that authors have invested in careful consideration and resolution of raised concerns has to be acknowledged. Especially willingness to limit the scope of the paper to a more narrow field and to define more precisely scope and limitations resulted into a much improved form of the paper. Only minor remarks remain:

In a much improved subsection 1.1 regarding novelty first sentence appears still too wide and unfocused, more in line with the previous title than with the current one. Please reconsider.
Conclusions section is much improved and significantly more meaningful. The first subparagraph that remained unchanged from te previous version contains statements that are general and too strong. Please reconsider and try to soften some wide, general and strict statements that undermine achievements in the very active research field. Just consider to slightly rephrase.
Abstract and title are much improved. In line with this improve keywords by replacing/adding something limiting the previous scope, for example Knee Osteoarthritis Grading.
I would still prefer numbering of all equations, but I leave the decision to the authors.

Response to these minor remaining remarks does not affect my positive recommendation. With pleasure I recommend that paper is published in its current form, after minor final (optional) edits.

Author Response

We thank the reviewer for their thoughtful and constructive feedback, as well as for recognizing our efforts to address the previous concerns and improve the manuscript. We greatly appreciate your positive recommendation and your helpful suggestions regarding scope, clarity, and precision throughout the paper. Please note that all subsequent edits made in this second revision can be found in the manuscript highlighted in blue color. Thank you again for your valuable input and support during the review process.

Comment 1: In a much improved subsection 1.1 regarding novelty first sentence appears still too wide and unfocused, more in line with the previous title than with the current one. Please reconsider.

Response 1: We thank the reviewer for this helpful suggestion. We agree that the original first sentence of subsection 1.1 was overly broad and did not sufficiently reflect the focus and novelty of our work. In response, we have revised the opening sentence to more clearly articulate our main contribution and its alignment with the manuscript title. The revised text now reads: “The primary goal of this article is to present a standardized validation framework for developing clinically actionable healthcare machine learning models, specifically demonstrated through the case study of knee osteoarthritis grading. By focusing on the risk of deceptively high in-domain accuracy, we emphasize the necessity of protocols and evaluation metrics that align model performance with true clinical utility and external validity.”

Comment 2: Conclusions section is much improved and significantly more meaningful. The first subparagraph that remained unchanged from te previous version contains statements that are general and too strong. Please reconsider and try to soften some wide, general and strict statements that undermine achievements in the very active research field. Just consider to slightly rephrase.

Response 2: We thank the reviewer for this thoughtful and constructive suggestion. We appreciate the importance of accurately reflecting the achievements and ongoing progress in the field, and we agree that our original wording could have been interpreted as overly broad. In response, we have carefully revised the first paragraph of the Conclusions section to clarify that the analysis of learning dynamics is intended as a complementary tool to established performance metrics, rather than as a replacement or as the sole criterion for model assessment. The new text also explicitly acknowledges that both learning dynamics and final performance metrics provide distinct insights into model reliability, and that a combined approach is recommended for clinical ML evaluation. The revised paragraph appears at the start of Section 5. Conclusions.

Comment 3: Abstract and title are much improved. In line with this improve keywords by replacing/adding something limiting the previous scope, for example Knee Osteoarthritis Grading.

Response 3: We thank the reviewer for this helpful suggestion. We agree that specifying “Knee Osteoarthritis Grading” in the keywords will better reflect the scope and focus of our manuscript. In response, we have revised the keywords to explicitly include “Knee Osteoarthritis Grading” and related terms, thereby making the topic and context of our case study more immediately apparent to readers. The updated keywords now appear immediately following the abstract.

Reviewer 4 Report

Comments and Suggestions for Authors

The authors have addressed my questions…. I recommend to approve the manuscript

Author Response