Modification and Validation of the System Causability Scale Using AI-Based Therapeutic Recommendations for Urological Cancer Patients: A Basis for the Development of a Prospective Comparative Study
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper evaluates the validity of the AI-base treatment option for urologic cancers. The reviewer would like to suggest some critiques as follows.
1. The primary endpoint of this study is unclear. Is it a preparatory study to lay the groundwork for a prospective study, or is the primary endpoint to establish a new validated scoring system (mSCS) to meet the requirements of this study and future similar studies?
2. The reviewer is not sure what “Ultimately” means in line 94.
3. The contents of abstracts are inconsistent. In particular, it is not clear what “this study” in line 27 refers to.” In summary,” is similar.
4. If it is a preliminary study, it should be reconstructed as an endpoint. The entire contents are difficult to understand.
Author Response
Manuscript ID: curroncol-3284670 – re.1
Prof. Dr. Shahid Ahmed
Editor-in-Chief, Current Oncology
Dear Prof. Dr. Ahmed,
we greatly appreciate the reviewers´ comments and suggestions concerning our manuscript “Modification and Validation of the System Causability Scale Using AI-Based Therapeutic Recommendations for Urological Cancer Patients: A Basis for the Development of a Prospective Comparative Study” by Emily Rinderknecht et al., which we would like to resubmit for publication in Current Oncology.
We have incorporated at our best all points according to the reviewers´ proposals. All changes performed in the revised version of our manuscript have been marked as requested (in red color). Please find attached a point-to-point reply to each remark with a description and referral to the amendments made in the revised version of the manuscript.
We wish to thank both reviewers for the helpful recommendations which have improved our manuscript and thank you for your consideration.
Kind regards,
Matthias May* (on behalf of all authors)
* Corresponding Author:
Prof. Dr. Matthias May; Department of Urology, St. Elisabeth Hospital Straubing, Brothers of Mercy Hospital, Straubing, Germany; email: matthias.may@klinikum-straubing.de
Reviewers’ comments
Reviewer #1 (four points):
Point 1: The primary endpoint of this study is unclear. Is it a preparatory study to lay the groundwork for a prospective study, or is the primary endpoint to establish a new validated scoring system (mSCS) to meet the requirements of this study and future similar studies?
Response: We appreciate the reviewer’s insightful question regarding the study’s primary objectives. To clarify, our study does not have a singular primary endpoint but instead comprises multiple "study objectives" (SOs), each essential for establishing the methodological foundation required for the blinded, prospective CONCORDIA study. These SOs were carefully defined to ensure that all necessary preliminary work is completed before the larger study commences. Each SO serves a distinct purpose in this preparatory context:
- Adaptation of the System Causability Scale (SCS) for Clinical Oncology: We have modified the SCS to create the mSCS, a tool specifically designed to assess therapeutic recommendations within an oncology setting. This adjustment was pivotal for evaluating LLM recommendations in a way that is clinically relevant.
- Validation of the Modified Scale (mSCS): This study rigorously validates the mSCS against the original SCS, confirming that it meets high standards of reliability and applicability for use in future oncological studies, including the CONCORDIA trial.
- Selection and Delphi-Based Evaluation of Suitable LLMs: We employed a Delphi process to identify LLMs (ChatGPT-4 and Claude 3.5 Sonnet) that align with our clinical objectives and can reliably support oncological decision-making in the forthcoming trial.
- Determination of a Non-Inferiority Threshold: Through a consensus-based Delphi process, we established a clinically meaningful threshold of 0.15 in the mSCS to benchmark non-inferiority between MTB and LLM recommendations.
- Sample Size Determination for CONCORDIA: Based on mSCS scores from this study, we have calculated the appropriate sample size for CONCORDIA with 90% statistical power, ensuring that the forthcoming trial is both adequately powered and methodologically sound.
This multifaceted approach reflects the preparatory nature of our study and its focus on laying a comprehensive foundation for the larger CONCORDIA trial, consistent with the scope of Current Oncology.
Point 2: The reviewer is not sure what “Ultimately” means in line 94.
Response: Thank you for noting this ambiguity. We have clarified the phrasing in line 94, making it explicit that "Ultimately" refers to the cumulative goal of these preparatory steps: to create a validated foundation that ensures the efficacy of the forthcoming CONCORDIA study.
The following revision of the final paragraph of the introduction has been made in the revised manuscript (lines 93-96): “The study objectives include: (1) the adaptation of the SCS for specific oncological contexts, (2) the validation of this adapted scale, (3) the Delphi-based selection of LLMs, (4) determination of a non-inferiority threshold for recommendations, and (5) sample size calculation for the prospective CONCORDIA study.”
Point 3: The contents of abstracts are inconsistent. In particular, it is not clear what “this study” in line 27 refers to. “In summary,” is similar.
Response: We apologize for any lack of clarity in the abstract. We have revised the abstract to clearly specify that “this study” refers to our preparatory work for the CONCORDIA trial. The phrase “In summary” has also been adjusted to ensure a coherent and consistent summary that aligns with the study’s purpose.
The following addition has been made in the revised manuscript: “This preparatory study establishes a robust methodological foundation for the forthcoming CONCORDIA trial, including the validation of the System-Causability-Scale (SCS) and its modified version (mSCS), as well as selection of LLMs for urological cancer treatment recommendations based on recommendations from ChatGPT-4 and an MTB for 40 urological cancer-scenarios.” (lines 26-30)
Point 4: If it is a preliminary study, it should be reconstructed as an endpoint. The entire contents are difficult to understand.
Response: We appreciate the constructive feedback and agree with the reviewer. The current study is indeed a preliminary study serving as an essential methodological prerequisite for the prospective CONCORDIA trial. This also explains why the study does not have a single endpoint, but rather various study objectives that ultimately need to be achieved to enable the successful execution of the planned prospective study. This is also stated in the introduction (see also response to Point 2). We recognize that the paper incorporates complex methodological components to facilitate a thorough analysis of individual prerequisites (which have been defined in more detail under the study objectives in Point 1.) Nevertheless, we embrace this level of rigor to establish optimal conditions for the successful implementation of the planned CONCORDIA study. Given Current Oncology’s mission to publish research that facilitates prospective studies in the field of oncology, we believe this manuscript is well-suited to the journal’s aims and audiences.
Reviewer 2 Report
Comments and Suggestions for AuthorsI noticed several inaccuracies in the manuscript and also missed responses to critically important questions.
1. The introduction mentions the various applications of artificial intelligence in medicine. However, due to the broad scope, it may be challenging to identify the main focus - specifically, the clear link between the study’s objective and the chosen research direction for LLM integration. I recommend more accurately emphasizing that the study is focused on mSCS reliability and LLM application in urological cancer treatment, rather than general LLM use in medicine.
2. Although important examples of LLM applications in medicine and their challenges are provided, there is a lack of a more consistent review of studies directly related to the treatment of urological cancers and the use of LLM in multidisciplinary teams.
3. The need for modifying SCS to mSCS specifically for urological patients and how this modification will enhance the application of LLM in treatment decisions is not sufficiently detailed. It is recommended to provide a broader explanation of SCS limitations in this area and clearly state the advantages of mSCS in current clinical practice.
4. Although the study objectives are mentioned in the introduction, these objectives could be formulated more concretely, highlighting the main hypotheses. For example, it would be helpful to clearly emphasize whether mSCS is expected to improve the reliability, accuracy, or both aspects of LLM recommendations.
5. While the criteria for choosing LLM (suitability for medical queries and response quality) are mentioned, it would be useful to provide a clearer rationale for how these criteria were quantitatively assessed or why these particular characteristics were selected. More detailed information would help avoid subjectivity and facilitate the study’s replication in other countries or institutions.
6. Currently, there is not enough detailed description of how the standardized recommendations matrix was developed to ensure blinded evaluation of LLM and MTB recommendations. It is recommended to supplement this information by describing each step of the structure creation process, including testing and refinement processes.
7. The mSCS modifications were evaluated by only two experts, which suggests limited objectivity. It is recommended to include a larger number of experts who could assess this new scale’s suitability for clinical practice, which would strengthen the reliability of the results.
8. To evaluate the study’s reliability, I suggest including more statistical methods that would help verify the repeatability and overall reliability of the results (e.g., intraclass correlation coefficient (ICC) as an alternative to Cohen’s kappa).
9. More detailed information should be provided on how the minimum sample size was determined to achieve the desired statistical power (90%). I recommend providing a more precise description, including the expected variance and other values used for the calculation.
10. Although the results are statistically validated, their presentation could be more straightforward and clearer. It may be worth interpreting some of the statistical data and indicators by explaining why, for example, a threshold difference of 0.15 was considered significant.
11. The section structure could be reviewed for easier readability. For instance, some details could be moved to tables or visualizations, making it easier to track statistical data and indicator differences.
12. It would be helpful to briefly discuss the potential significance of the study results for the future if LLM could aid in the decision-making process. This would not only include optimizing clinical decisions but also reducing resource requirements, which could be especially important for peripheral clinics or regions with limited resources.
13. The conclusions could benefit from a brief explanation of how mSCS can be used in practice, for example, what potential this tool has to assist clinical specialists in real-time decision-making or incorporating LLM recommendations into daily clinical practice.
14. Although the study is successful, it would be helpful to briefly mention any limitations of mSCS or challenges that may arise in future research or practical application. For example, it could be noted if the study revealed any areas where AI-generated recommendations were less reliable or more challenging to assess.
Author Response
Manuscript ID: curroncol-3284670 – re.1
Prof. Dr. Shahid Ahmed
Editor-in-Chief, Current Oncology
Dear Prof. Dr. Ahmed,
we greatly appreciate the reviewers´ comments and suggestions concerning our manuscript “Modification and Validation of the System Causability Scale Using AI-Based Therapeutic Recommendations for Urological Cancer Patients: A Basis for the Development of a Prospective Comparative Study” by Emily Rinderknecht et al., which we would like to resubmit for publication in Current Oncology.
We have incorporated at our best all points according to the reviewers´ proposals. All changes performed in the revised version of our manuscript have been marked as requested (in red color). Please find attached a point-to-point reply to each remark with a description and referral to the amendments made in the revised version of the manuscript.
We wish to thank both reviewers for the helpful recommendations which have improved our manuscript and thank you for your consideration.
Kind regards,
Matthias May* (on behalf of all authors)
* Corresponding Author:
Prof. Dr. Matthias May; Department of Urology, St. Elisabeth Hospital Straubing, Brothers of Mercy Hospital, Straubing, Germany; email: matthias.may@klinikum-straubing.de
Reviewers’ comments
Reviewer #2 (14 points):
Point 1: The introduction mentions the various applications of artificial intelligence in medicine. However, due to the broad scope, it may be challenging to identify the main focus - specifically, the clear link between the study’s objective and the chosen research direction for LLM integration. I recommend more accurately emphasizing that the study is focused on mSCS reliability and LLM application in urological cancer treatment, rather than general LLM use in medicine.
Response: Thank you for this valuable recommendation. Our intent was to provide a comprehensive AI framework specific to oncology to highlight the potential impact of LLMs on clinical decision-making. We agree, however, that further emphasis on the study’s specific focus on mSCS reliability and LLM applicability to urological cancer treatment would enhance clarity. We have revised the introduction accordingly to refine the scope and link it more directly to these central objectives (lines 71-82).
Point 2: Although important examples of LLM applications in medicine and their challenges are provided, there is a lack of a more consistent review of studies directly related to the treatment of urological cancers and the use of LLM in multidisciplinary teams.
Response: We appreciate this observation. However, we would like to kindly point out that, as noted earlier by the reviewer, the manuscript is somewhat complex due to the various necessary study objectives. We have therefore aimed to focus the discussion on the core topic, namely the use of LLMs in decision-making within (urological) MTBs. In this context, we have provided a comprehensive discussion of various studies on this topic across different specialties, which discusses both the potential opportunities in the application of LLMs as well as possible weaknesses, limitations, and pitfalls, while relating the discussed findings to the current study. (lines 409-427). To the best of our knowledge, there are currently no studies on the use of LLMs in GUC MTBs or on direct comparisons of LLM recommendations with real MTB recommendations. This point has also been addressed in the discussion (lines 409-412).
Point 3: The need for modifying SCS to mSCS specifically for urological patients and how this modification will enhance the application of LLM in treatment decisions is not sufficiently detailed. It is recommended to provide a broader explanation of SCS limitations in this area and clearly state the advantages of mSCS in current clinical practice.
Response: Thank you for highlighting this aspect. The modification of the SCS was one of our study objectives, aimed at creating a more clinically applicable scoring system (mSCS) that aligns with the unique needs of urological cancer patients. This study has confirmed that the mSCS offers improved reliability and clinical relevance compared to the SCS, with independent experts noting its effectiveness in enhancing LLM-based recommendations. We will add a section explaining the specific limitations of the SCS and the corresponding advantages of the mSCS in urological oncology, clarifying its potential impact on clinical decision-making (lines 428-444).
The following addition has been made in the revised manuscript: “To assess the quality of AI-generated explanations, especially in the context of scientific model development, Holzinger et al. introduced the System Causability Scale (SCS) in 2020 [8]. The SCS quantifies explainability based on responses to 10 questions, each rated on a 5-point Likert scale. Its simplicity and status as a standardized tool make it highly useful for evaluating AI- and LLM-generated explanations [8]. A major limitation of this method, however, is that while the general nature of the questions allows the scale to be applied across various medical specialties, the individual items are not ideally suited to specifically assess the quality of a therapeutic recommendation. Our goal, therefore, was to modify the individual items so that they are precisely tailored to evaluating therapeutic recommendations for GUC patients within the scope of an MTB, enabling reviewers to provide an assessment that is both intuitive and accurate, as well as reproducible. Our results confirm strong validity, reliability (all aggregated Cohen’s K > 0.74), and internal consistency (all Cronbach’s Alpha > 0.9) for both scales. However, compared to the SCS, the mSCS demonstrated superior reliability, internal consistency, and clinical applicability (p < 0.01), leading us to conclude that this tool is highly suitable for assessing therapeutic recommendations within the framework of the planned CONCORDIA study.”
Point 4: Although the study objectives are mentioned in the introduction, these objectives could be formulated more concretely, highlighting the main hypotheses. For example, it would be helpful to clearly emphasize whether mSCS is expected to improve the reliability, accuracy, or both aspects of LLM recommendations.
Response: We appreciate this helpful suggestion. To provide greater clarity, we will refine the study objectives to emphasize how the mSCS aims to improve reliability and accuracy in LLM recommendations. By articulating these specific goals, we hope to make the anticipated contributions of our study more explicit for the reader.
The following revision of the final paragraph of the introduction has been made in the revised manuscript (lines 93-96): “The study objectives include: (1) the adaptation of the SCS for specific oncological contexts, (2) the validation of this adapted scale, (3) the Delphi-based selection of LLMs, (4) determination of a non-inferiority threshold for recommendations, and (5) sample size calculation for the prospective CONCORDIA study.”
With respect, it is, in our view, not strictly necessary to support the stated study objectives with specific study hypotheses in this study.
Point 5: While the criteria for choosing LLM (suitability for medical queries and response quality) are mentioned, it would be useful to provide a clearer rationale for how these criteria were quantitatively assessed or why these particular characteristics were selected. More detailed information would help avoid subjectivity and facilitate the study’s replication in other countries or institutions.
Response: Thank you for your observation. Our selection of LLMs followed a structured, four-round Delphi process, detailed in both the Results and Supplementary Materials sections (lines 131-145 and lines 252-264). The option of assigning two or one point per rater in an anonymous voting based on predefined selection criteria resulted in a transparent, clear, and comprehensible decision. Moreover, the manuscript already provides explicit and transparent reasons that were discussed for the exclusion of the other LLMs within the framework of the Delphi process (lines 259-264). Therefore, we do not see any further way to increase the transparency of the decision-making process.
Point 6: Currently, there is not enough detailed description of how the standardized recommendations matrix was developed to ensure blinded evaluation of LLM and MTB recommendations. It is recommended to supplement this information by describing each step of the structure creation process, including testing and refinement processes.
Response: We appreciate this request for additional detail regarding the development of the standardized recommendations matrix. In our study, the matrix was carefully designed through a multi-stage process to ensure that raters were genuinely blinded to the source of recommendations (MTB vs. LLM). We will expand the Methods section to outline each step of this process, including the initial formulation, pilot testing, iterative refinement, and finalization stages. This addition should clarify how the blinded evaluation was rigorously implemented.
The following addition has been made in the revised manuscript (lines 157-160): “The recommendation matrix was developed through iterative testing, involving pilot studies and refinements to ensure that evaluators remained blinded to the source of recommendations. Each step aimed to maintain an objective comparison between LLM and MTB outputs.”
And (lines 279-289): “The following matrix was created to enable a blinded rating of the recommendations:
1.) Preferred therapy recommendation (if available)
2.) Therapy alternatives
3.) Justification of the recommendations
4.) Supportive measures / supplementary therapies
5.) Further information / explanations
The content of the corresponding MTB or LLM recommendation was manually inserted into the matrix in bullet points. The bullet point approach ensures that possible recurring phrases or ways of formulating do not invalidate the blinding. The recommendations focused exclusively on tumor therapy, while guidance on other coexisting conditions unrelated to the tumor was excluded.”
Point 7: The mSCS modifications were evaluated by only two experts, which suggests limited objectivity. It is recommended to include a larger number of experts who could assess this new scale’s suitability for clinical practice, which would strengthen the reliability of the results.
Response: We appreciate this thoughtful remark. However, we would like to point out that the assessment of the mSCS in terms of clinical applicability was conducted not only by two, but by four raters (ER, MH, DvW, and AK; lines 191-194). As outlined in the manuscript, all four raters independently assessed the mSCS using a 5-point Likert scale regarding its clinical usability and applicability compared to the SCS (which was assumed to be rated at 3. The mean of these four ratings was then calculated and statistically compared to the SCS, revealing that the mSCS was rated significantly higher (p=0.006) than the SCS (lines 191-196 and lines 374-377). It can generally be assumed that four different ratings provide sufficient statistical power to draw this conclusion.
Point 8: To evaluate the study’s reliability, I suggest including more statistical methods that would help verify the repeatability and overall reliability of the results (e.g., intraclass correlation coefficient (ICC) as an alternative to Cohen’s kappa).
Response: We appreciate this suggestion and will include the intraclass correlation coefficient (ICC) as an additional measure of reliability, supplementing our use of Cohen’s kappa. This addition should provide a more comprehensive statistical assessment of reliability, and we will update both the Methods and Results sections accordingly.
The following addition has been made in the revised manuscript (Methods; lines 226-230 and lines 238-240): “The intraclass correlation coefficient (ICC) was calculated as an additional reliability measure, confirming substantial agreement among raters and complementing the results obtained from Cohen’s kappa [25]. ICC calculations were based on a mean rating (k=2), absolute-agreement, 2-way random-effects model.“
and
“The ICC was interpreted using the following classification of reliability proposed by Koo and Li [27]: <0.5: poor, 0.5-0.75 moderate, 0.75-0.9 good, >0.9 excellent.”
The following addition has been made in the revised manuscript and an additional Table 3b was created (Results; lines 331-352 and lines 354-355): “To assess the agreement between the two independent raters, Cohen’s Kappa and ICC were calculated. The kappa (Κ) values are shown in Table 3a. The ICC values are shown in Table 3b.
Regarding the SCS rating of the MTB recommendations, kappa values of 0.7 to 1.0 (p < 0.001) were obtained for the individual items. The pooled analysis resulted in Κ = 0.90 (p < 0.001). The corresponding ICC values were 0.83 to 1.0 (p < 0.001) for the individual items and 0.95 for the pooled analysis. With regard to the SCS rating of the LLM recommendations, kappa values of 0.65 to 0.90 (p < 0.001) were obtained for the individual items. The pooled analysis resulted in Κ = 0.74 (p < 0.001). The corresponding ICC values were 0.83 to 0.95 (p < 0.001) for the individual items and 0.85 for the pooled analysis. In summary, substantial to almost perfect interrater reliability was shown using Cohen’s Kappa for the SCS across all items. Good to excellent interrater reliability was shown using ICC for the SCS across all items.
For the mSCS ratings regarding the MTB recommendation, for all Items the Kappa values were at least Κ = 0.75, indicating at least substantial agreement. In line with this, the ICC values were at least 0.86, which corresponds to good interrater reliability. For the mSCS ratings regarding the LLM recommendation, slightly more dispersion was observed. The lowest kappa value obtained was Κ = 0.65, which also indicates a substantial agreement. The lowest ICC value obtained was 0.79, which also indicates good reliability. In the pooled analysis of interrater reliability across all items of the mSCS, Κ = 0.95 (p < 0.001) and ICC 0.97 (p < 0.001) were obtained for the MTB recommendations and Κ = 0.81 (p < 0.001) and ICC 0.89 (p < 0.001) for the LLM recommendations (Table 3).”
Point 9: More detailed information should be provided on how the minimum sample size was determined to achieve the desired statistical power (90%). I recommend providing a more precise description, including the expected variance and other values used for the calculation.
Response: Thank you for this recommendation. Our sample size calculation was indeed conducted to achieve a 90% power, and we will provide further details on the expected variance, alpha level, and other parameters. We have consulted with our biostatistician, Florian Zeman, to ensure that this description is precise and fully transparent.
The following addition has been made in the revised manuscript (Results; lines 320 – 324): “To show non-inferiority of the LLM compared to the MTB with an expected difference of Δ = 0.095 ± 0.1445 between both assessments (paired design) at a non-inferiority margin of 0.15 with a power of 90% (beta=0.1) at a one-sided 1.25% level of significance, a total of 87 cases are needed for statistical analyses (overall p-value: 0.025 (one-sided)).“
Point 10: Although the results are statistically validated, their presentation could be more straightforward and clearer. It may be worth interpreting some of the statistical data and indicators by explaining why, for example, a threshold difference of 0.15 was considered significant.
Response: We appreciate the feedback on the presentation of our statistical data. The threshold difference of 0.15 was selected based on a consensus achieved through a Delphi process with our study team. As part of the Delphi process, various thresholds were discussed regarding when a recommendation was considered inferior to another, based on five different patient cases. During this process, all items of the mSCS were evaluated for each patient scenario to discuss corresponding differences and determine when the recommendation should be considered inferior. After comprehensive discussion, the potential cutoffs were set at differences of 3 points, 5 points, 8 points, or 10 points, based on a maximum score of 50 (corresponding to a threshold of 0.05, 0.1, 0.15, and 0.2). Ultimately, the decision was made through an anonymous voting process, assigning either one or two points, to ensure a comprehensible, clear, and transparent outcome. In this voting process, the cutoff of 8 points emerged as the clearly favored value (reflected by a total score of 9 points vs. 5, 3, and 1 point for the other cutoff values; lines 308-309). This means for practice, that if a rater strongly disagreed with the LLM's recommendation on 2 items but strongly agreed with the real MTB's recommendation on those same items, the LLM's recommendation would be considered inferior. As previously mentioned, the total of 8 points may also result from differences across other items. While large parts of the process are already represented in the manuscript (lines 212-223 and lines 299-312), we will add information in the revision in order to further clarify this threshold and provide interpretive context, referencing examples from clinical cases where this margin was deemed appropriate, as discussed during our Delphi rounds.
The following passages were added to the manuscript (lines 304-306 and lines 312-317): “This resulted in the considered non-inferiority cutoffs at differences of 3 points, 5 points, 8 points, or 10 points, based on a maximum score of 50 (corresponding to a threshold of 0.05, 0.1, 0.15, and 0.2).”
and
“To put this abstract number into a concrete context, the threshold of 0.15 corresponds to an absolute difference of 8 points on the SCS and mSCS scales, respectively. For example, if a rater strongly disagreed with the LLM's recommendation on 2 items but strongly agreed with the real MTB's recommendation on those same items, the LLM's recommendation would be considered inferior. Naturally, the total of 8 points may also result from differences across other items.”
Point 11: The section structure could be reviewed for easier readability. For instance, some details could be moved to tables or visualizations, making it easier to track statistical data and indicator differences.
Response: Thank you for the constructive suggestion. We will assess the structure and move some statistical details into tables or visual diagrams, where possible. This should facilitate easier navigation and enhance readability, as you recommended (for example the newly created Table 3b).
Point 12: It would be helpful to briefly discuss the potential significance of the study results for the future if LLM could aid in the decision-making process. This would not only include optimizing clinical decisions but also reducing resource requirements, which could be especially important for peripheral clinics or regions with limited resources.
Response: We appreciate this thoughtful recommendation and agree that the study’s potential implications for future clinical applications merit further discussion. Although this study is preparatory, we will add a brief paragraph in the Discussion section outlining potential benefits of LLM integration in clinical workflows, especially in settings with limited resources.
The following addition has been made in the revised manuscript (lines 465-468): “Our findings underscore the potential of LLMs to aid clinical decision-making in oncology, particularly in resource-limited settings where access to multidisciplinary expertise may be constrained. The mSCS validated in this study could serve as a framework for integrating AI-supported recommendations in real-time clinical practice.”
Point 13: The conclusions could benefit from a brief explanation of how mSCS can be used in practice, for example, what potential this tool has to assist clinical specialists in real-time decision-making or incorporating LLM recommendations into daily clinical practice.
Response: Thank you for this suggestion. We will expand the Conclusions to briefly explain how the validated mSCS could assist clinical specialists in real-time decision-making and in integrating LLM recommendations. This addition will clarify the practical applicability of our findings.
The following addition has been made in the revised manuscript (lines 486-488): “The validated mSCS offers clinicians a structured tool for assessing LLM recommendations in oncological practice, with potential applications in real-time decision-making and enhancing the reliability of AI-derived guidance.”
Point 14: Although the study is successful, it would be helpful to briefly mention any limitations of mSCS or challenges that may arise in future research or practical application. For example, it could be noted if the study revealed any areas where AI-generated recommendations were less reliable or more challenging to assess.
Response: We agree with the reviewer that a discussion of potential limitations is essential. However, we would like to draw the reviewer’s attention to the fact that we have already outlined the key limitations of our study in a separate section of the Discussion. We will add a short paragraph to the Discussion acknowledging possible challenges, such as areas where LLM recommendations may lack precision or require further validation. This addition will provide a balanced view of our study’s contributions and the ongoing need for refinement.
The following addition has been made in the revised manuscript (lines 479-482): “Despite its successful validation, the mSCS may encounter challenges in cases where LLM recommendations lack specificity or clinical nuance, especially in complex or ambiguous clinical scenarios. Further studies are warranted to refine these limitations in broader oncological contexts.”
Round 2
Reviewer 1 Report
Comments and Suggestions for Authorsnone.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript has been revised