Next Article in Journal
Bone Marrow-Derived Mesenchymal Stem Cells Differentiate into Cancer-Associated Fibroblasts and Promote Tumor Growth in Renal Cell Carcinoma
Previous Article in Journal
Efficacy and Toxicity of CDK4/6 Inhibitors in Early and Metastatic HR+/HER2− Breast Cancer: An Updated Meta-Analysis of Phase III Trials
Previous Article in Special Issue
Generalization of the Conformity Index for Multi-Target Radiotherapy Plans
 
 
Review
Peer-Review Record

Artificial Intelligence in Image Assisted Radiation Oncology

Cancers 2026, 18(11), 1715; https://doi.org/10.3390/cancers18111715
by He Wang 1,*, Yao Zhao 1, Xinru Chen 1, Brigid McDonald 1, Yunxiang Li 2, Jiacheng Xie 2, Dong Joo Rhee 1, Tze Yee Lim 1, Tucker J. Netherton 1, Jack Phan 3, Michael T. Spiotto 3 and Mu-Han Lin 2
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Cancers 2026, 18(11), 1715; https://doi.org/10.3390/cancers18111715
Submission received: 15 April 2026 / Revised: 15 May 2026 / Accepted: 21 May 2026 / Published: 25 May 2026
(This article belongs to the Special Issue Image-Assisted High-Precision Radiation Oncology)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This is an important paper summarizing the current status of the use of AI and AI modeling in radiation oncology. This is important for all radiation oncologists and physicists to read. The age of AI in radiation oncology has arrived and a fundamental and working knowledge of AI is essential for all in our discipline. This is an exceptionally well written manuscript with extraordinary references.

Author Response

This is an important paper summarizing the current status of the use of AI and AI modeling in radiation oncology. This is important for all radiation oncologists and physicists to read. The age of AI in radiation oncology has arrived and a fundamental and working knowledge of AI is essential for all in our discipline. This is an exceptionally well written manuscript with extraordinary references.

Response: Thank you!

Reviewer 2 Report

Comments and Suggestions for Authors

The topic of this manuscript is timely and very relevant. It sets out to review applications of AI and radiomics throughout the radiotherapy workflow, including diagnosis and treatment planning, delivery, quality assurance, and response assessment and comprises a bibliometric analysis of the area based on PubMed publications over 2015–2025. The broad clinical ambition, workflow-focused framing, and the explicit discussion of practical challenges (heterogeneity of data, lack of interpretability, lack of generalizability, workflow integration, privacy, and regulation) are the main strengths of the paper. Figure 1 provides another valuable overview of the role of imaging in radiotherapy.

Major comments: -

  1. Please specify the review design and ensure that methods are reproducible. It claims to have performed a PubMed search and a so-called systematic bibliometric analysis of the literature published since 2015 and up to 2025, although the description of the method is too brief at this point to allow reproducibility. Please state the complete Boolean search query, search date, inclusion criteria, search strategy, presence of duplicate records, manual or automated screening, bibliometric analysis, and network of authors and keywords in the software or pipeline used. In its written form, the reader is not clearly differentiated between the bibliometric corpus and the studies reviewed in the narrative review sections.
  2. The introduction could be made stronger by referring to recent research on generative AI in biomedical imaging research, specifically Hsieh, Shang-Chin, et al. "Generative AI and Scientific Authorship: A New Paradigm for Biomedical Imaging Research." International Journal of Software Science and Computational Intelligence (IJSSCI) 17.1 (2025): 1-25. This source is topical since the article presents AI as a fast-developing power of image-guided radiation oncology and presents issues connected to transparency, application, and responsible use. This citation would expand the background information by recognizing the new role of generative AI in the field of biomedical imaging studies and the research-related questions of authorship, accountability, and research governance.
  3. The scope of the manuscript ought to be narrowed down or justifiably explained. The title highlights image-assisted radiation oncology, but sections of the diagnostic section cover widespread AI screening in lung, breast, and pancreatic cancer that do not have a clear connection to radiotherapy processes. It is not clear to me whether the review is meant to capture all oncology imaging AI that can indirectly impact radiation oncology or AI applications directly involved in radiation oncology practice. If the latter is the case, we should reduce certain parts or establish more direct connections to radiotherapy decision-making.
  4. The bibliometric part needs to be reinforced or compressed to some degree. The bibliometric section is potentially useful, but because of the underreported approaches, its present worth is restricted. Please explain how the keywords were normalized, how author-name disambiguation works, how collaboration links were defined, and if the 2025 data are a full year. In the absence of such information, Figures 2-7 might end up being qualitative illustrations and not analytical products. If the authors wish to make the paper clinically oriented, they may transfer some of this information to the supplementary material.
  5. The section on agentic AI and LLMs ought to be less bold and distinctly divided by evidence-based review information. The promise of LLM-enabled orchestration and understandable coordination is duly mentioned in the manuscript, but this part of the review is more speculative as compared to the rest. Considering the known dangers of fabrication, automation bias, and the fact that the field of clinical integration is at an early stage of development, the authors ought to refer more to it as a future-facing viewpoint than as a direction that practice is actually going. A brief explanation of validation needs, safety guardrails, and realistic near-term and long-term roles of LLMs in radiotherapy would enhance equilibrium.
  6. Please add a special limitations section to the review itself. The manuscript addresses AI issues in general, such as data quality, domain shift, explainability, and workflow integration; however, it does not directly address the limitations of this review. Limitations should be mentioned at the review level and should recognize the search of PubMed only, the use of a broad keyword strategy, the potential of missing any study, the narrative nature of the study selection for topic coverage, and the absence of any formal quality appraisal of the included evidence. This step is very important for transparency.
Comments on the Quality of English Language

There are multiple grammatical or stylistic issues throughout the text. A full language edit would improve readability and professionalism.

Author Response

  1. Please specify the review design and ensure that methods are reproducible. It claims to have performed a PubMed search and a so-called systematic bibliometric analysis of the literature published since 2015 and up to 2025, although the description of the method is too brief at this point to allow reproducibility. Please state the complete Boolean search query, search date, inclusion criteria, search strategy, presence of duplicate records, manual or automated screening, bibliometric analysis, and network of authors and keywords in the software or pipeline used. In its written form, the reader is not clearly differentiated between the bibliometric corpus and the studies reviewed in the narrative review sections.

Response: We thank the reviewer for raising the reproducibility concern. The Methods description has been expanded to specify the full search strategy, and we now distinguish the bibliometric corpus (the systematic PubMed dataset analyzed in Section 2) from the narrative review corpus (the studies cited and discussed in Sections 3–10, identified through targeted searches and citation tracking).

The updated Method description is as follows (beginning of Section 2.):

We performed a systematic search of PubMed combining four radiotherapy synonyms (‘radiation therapy’, ‘radiotherapy’, ‘radiation treatment’, ‘radiation oncology’) with four AI synonyms (‘artificial intelligence’, ‘AI’, ‘machine learning’, ‘deep learning’) in Title/Abstract fields, yielding 16 Boolean queries of the form (RT term[Title/Abstract]) AND (AI term[Title/Abstract]). The search was performed via the NCBI E-utilities API, restricted to publication dates between 2015 and 2025, and finalized after the close of the 2025 calendar year. Records were retrieved across all 16 queries and deduplicated by PMID, yielding 2,064 unique publications that constitute the bibliometric corpus analyzed in this section. The 2025 publication count of 451 reflects a complete calendar year. Bibliometric processing—keyword frequency analysis, year-to-year Jaccard similarity, PELT change-point detection, author productivity and collaboration network analysis, and community detection—was performed in Python (NetworkX, ruptures, scipy).

2. The introduction could be made stronger by referring to recent research on generative AI in biomedical imaging research, specifically Hsieh, Shang-Chin, et al. "Generative AI and Scientific Authorship: A New Paradigm for Biomedical Imaging Research." International Journal of Software Science and Computational Intelligence (IJSSCI) 17.1 (2025): 1-25. This source is topical since the article presents AI as a fast-developing power of image-guided radiation oncology and presents issues connected to transparency, application, and responsible use. This citation would expand the background information by recognizing the new role of generative AI in the field of biomedical imaging studies and the research-related questions of authorship, accountability, and research governance.

Response: We thank the reviewer for this helpful suggestion. We agree that the Introduction can be strengthened by acknowledging the emerging role of generative AI in biomedical imaging research and by explicitly linking this development to broader issues of transparency, accountability, reproducibility, and responsible research governance. Accordingly, we have revised the Introduction to include a new statement discussing generative AI in biomedical imaging and image-assisted radiation oncology, with citation to Hsieh et al. We have positioned this addition after the paragraph describing the rapid expansion of AI applications and commercial AI-enabled tools in radiation oncology, and before the paragraph discussing limitations and governance challenges. We believe this addition strengthens the framing of the Introduction, and we appreciate the reviewer's recommendation.

“A particularly important recent development is the emergence of generative AI, including generative adversarial networks, diffusion models, and related foundation-model approaches. These advances have broadened biomedical imaging research from image analysis toward image synthesis, reconstruction, modality translation, data augmentation, and workflow-support. In this context, Hsieh et al. emphasized that the growing use of generative AI in biomedical imaging and scientific authorship raises critical questions regarding transparency, disclosure of AI assistance, accountability for generated content, reproducibility of AI-assisted workflows, and research governance. These considerations are essential to the responsible development and reporting of AI-enabled image-assisted radiation oncology research [17].”

3. The scope of the manuscript ought to be narrowed down or justifiably explained. The title highlights image-assisted radiation oncology, but sections of the diagnostic section cover widespread AI screening in lung, breast, and pancreatic cancer that do not have a clear connection to radiotherapy processes. It is not clear to me whether the review is meant to capture all oncology imaging AI that can indirectly impact radiation oncology or AI applications directly involved in radiation oncology practice. If the latter is the case, we should reduce certain parts or establish more direct connections to radiotherapy decision-making.

Response: We thank the reviewer for raising the scope question. We have retained the diagnostic-AI examples and added an explicit justification for their inclusion: diagnostic-stage AI shapes which patients are referred for radiotherapy and at what disease stage, both of which directly affect the technical and biological challenges of subsequent treatment planning. On this basis the topic is relevant to a review of image-assisted radiation oncology.

A new bridge paragraph has been inserted immediately preceding Section 3.1.1:

“Although cancer detection and screening occur upstream of radiation oncology decision-making, they critically influence which patients are referred for radiotherapy and at what disease stage. These factors, in turn, shape the technical and biological demands of subsequent treatment planning. The example in this section illustrate how AI-driven diagnostic imaging affects the patient population entering radiotherapy workflows, with downstream implications for treatment volume, fractionation, and outcome prediction. Studies of diagnostic AI without direct relevance to radiotherapy are beyond the scope of this review.”

4. The bibliometric part needs to be reinforced or compressed to some degree. The bibliometric section is potentially useful, but because of the underreported approaches, its present worth is restricted. Please explain how the keywords were normalized, how author-name disambiguation works, how collaboration links were defined, and if the 2025 data are a full year. In the absence of such information, Figures 2-7 might end up being qualitative illustrations and not analytical products. If the authors wish to make the paper clinically oriented, they may transfer some of this information to the supplementary material.

Responses: We thank the reviewer for the four specific sub-points; all have been addressed. The analytical content of Section 2 has also been strengthened in response to Reviewer 3’s related comments (see below), so that Figures 2–7 are now backed by quantitative network metrics rather than qualitative description.

The following methodological clarifications have been added:

“Author keywords from PubMed records were lowercased and filtered against a stop-list of generic demographic descriptors (humans, male, female, adult, aged, middle aged, young adult, adolescent, child, infant, animals). No further normalization (stemming, synonym merging, or controlled-vocabulary mapping) was applied; we acknowledge as a limitation that this leaves variant forms (e.g., ‘CNN’ / ‘convolutional neural network’, ‘AI’ / ‘artificial intelligence’, ‘auto-segmentation’ / ‘automatic segmentation’) as separate keywords in the analysis.”

Authors were identified by the concatenation of first and last names recorded in PubMed, without explicit ORCID-based disambiguation. This may underestimate productivity for authors with name variants and may conflate distinct authors with identical names; we therefore present author productivity and collaboration analyses as descriptive bibliometric trends rather than precise individual rankings.

Collaboration ties in the network of Figure 7 were defined as co-authorship on the same publication, with edge weights equal to the number of jointly authored papers. The visualization is restricted to the 70 authors with ≥10 publications to keep the network legible. The 2025 portion of the corpus covers a complete calendar year (n = 451 publications), since the search was finalized after the close of 2025. Figures 2, 5, 6, and 7 are retained from the original submission, but the methodology and analytical context that support them are now described directly in Section 2.

5. The section on agentic AI and LLMs ought to be less bold and distinctly divided by evidence-based review information. The promise of LLM-enabled orchestration and understandable coordination is duly mentioned in the manuscript, but this part of the review is more speculative as compared to the rest. Considering the known dangers of fabrication, automation bias, and the fact that the field of clinical integration is at an early stage of development, the authors ought to refer more to it as a future-facing viewpoint than as a direction that practice is actually going. A brief explanation of validation needs, safety guardrails, and realistic near-term and long-term roles of LLMs in radiotherapy would enhance equilibrium.

Response: The authors respectfully disagree with the reviewer comments on the agentic AI section. This section is not overly bold nor does it promise any benefits of agentic AI, but rather we explicitly use the word "potential" to describe possible future enhancements built upon deep leasrning. Furthermore, such potential enhancements are directly supported by the citations provided - one being a direct example of AI agentic orchestration with MIM autocontouring and plan evaluation and others mention the challenges still to come. " However we do strongly agree that this section is better placed in the "future perspectives and conclusions sections" and we have moved it there. Thank you.

6. Please add a special limitations section to the review itself. The manuscript addresses AI issues in general, such as data quality, domain shift, explainability, and workflow integration; however, it does not directly address the limitations of this review. Limitations should be mentioned at the review level and should recognize the search of PubMed only, the use of a broad keyword strategy, the potential of missing any study, the narrative nature of the study selection for topic coverage, and the absence of any formal quality appraisal of the included evidence. This step is very important for transparency.

Response: thank you for the suggestion. We add a paragraph at the end of the review:

“This review focused specifically on AI and radiomics applications within radiation oncology where imaging plays a central role. Given the breadth and rapidly evolving nature of this field, several limitations should be acknowledged. First, the bibliometric analysis relied solely on PubMed, potentially omitting relevant studies indexed in other databases. Also, while we used the most relevant and popular keywords (as detailed in Section 2), this strategy may not have captured all pertinent studies. Second, the narrative review in Sections 3-10 prioritized conceptual clarity and representative examples - selected via expert-driven searches and citation tracking – rather than systematic exhaustiveness. Finally, no formal quality appraisal of the included studies was conducted. Despite these limitations, this review provides a comprehensive overview of AI applications in radiation oncology and aims to inform clinicians and researchers about key developments and future directions in the field. “

Comments on the Quality of English Language

There are multiple grammatical or stylistic issues throughout the text. A full language edit would improve readability and professionalism.

We thank the reviewer for this helpful comment. The manuscript has been carefully revised to address grammatical and stylistic issues. We have also tried to improve sentence structure and flow to enhance readability and ensure a more consistent academic tone.

Reviewer 3 Report

Comments and Suggestions for Authors

The bibliometric analysis section provides a descriptive overview but lacks methodological transparency. The division into four temporal phases appears arbitrary and is not justified. It is unclear whether these cutoffs are data-driven or chosen for convenience. Providing a rationale, such as clustering analysis or publication inflection points, would strengthen the credibility of this segmentation. Additionally, the keyword evolution discussion remains superficial. The text lists dominant terms in each period but does not interpret why these shifts occurred or how they relate to technological or clinical milestones. For example, the rise of convolutional neural networks could be linked to improvements in GPU computing and large annotated datasets, which would provide useful context.

The author collaboration analysis is descriptive but does not extract meaningful insights. The statement that top authors contribute a small proportion of papers is expected in most scientific fields and does not add much value without comparison to other domains. Including network metrics such as centrality, clustering coefficients, or collaboration indices would provide a more rigorous interpretation. The discussion of collaboration clusters reads more like a narrative summary rather than an analysis. It would be beneficial to explain how these clusters influence innovation, knowledge dissemination, or clinical translation.

There is no clear link between the earlier keyword analysis and identified research hotspots section. Strengthening this connection would improve coherence. Additionally, the section lacks critical perspective. For example, automatic segmentation is described as highly successful, but no mention is made of known limitations such as generalizability across institutions, contouring variability, or regulatory barriers.

In the cancer detection subsection, the discussion is generally accurate but somewhat unbalanced. The cited studies are presented in a positive light without addressing limitations. For instance, the claim that deep learning models surpassed radiologists in lung cancer detection should be contextualized with known concerns such as dataset selection, retrospective evaluation, and lack of real-world validation. Similarly, performance metrics like reductions in false positives and negatives should be accompanied by confidence intervals or study conditions to avoid overgeneralization.

The radiomics section contains several technical inaccuracies and conceptual oversimplifications. The statement that traditional radiomics relies on “subjective” features is misleading, as most radiomic features are mathematically defined and reproducible, although sensitive to preprocessing. The distinction between traditional radiomics and deep learning should be clarified more carefully. Furthermore, reported accuracy ranges such as 80–90% for brain tumor diagnosis is presented without context regarding datasets, task definitions, or validation strategies, which reduces their interpretability.

The discussion of radiogenomics and prognostic modeling is promising but lacks critical depth. Associations between imaging features and gene expression are mentioned without addressing reproducibility issues, small cohort sizes, and lack of standardization, which are well-known challenges in the field. Similarly, claims that radiomics can outperform clinical factors should be supported with stronger evidence or tempered to avoid overstatement.

The statement that deep learning has “overcome” the limitation of filtered back projection is too strong. Deep learning methods mitigate artifacts but introduce new concerns such as hallucination, lack of interpretability, and dependence on training data. A more balanced discussion is needed. The explanation of reconstruction as shifting from inverse problems to restoration tasks is conceptually interesting but should be clarified, as most modern approaches still incorporate elements of both.

The 4D imaging section is generally coherent, but some claims require clarification. For example, reporting gamma pass rates without specifying evaluation protocols or comparison baselines limits interpretability. The discussion of transformer-based models is relevant, but the explanation of why they outperform CNNs in this context remains vague. More emphasis on the challenges of temporal consistency and motion modeling would improve depth.

Describing image synthesis as a paradigm shift may be justified, but the argument would be stronger with concrete clinical adoption examples. The MRI-to-CT synthesis subsection is informative, though the claim that MRI-guided radiotherapy “has been emerged” should be corrected for grammar and clarity. Reported error ranges from challenges should be contextualized with clinical thresholds for acceptability.

The CBCT-to-CT synthesis discussion appropriately highlights diffusion models, but the explanation assumes familiarity with these techniques. A brief clarification of why diffusion models outperform GANs in this context would improve accessibility. The limitation regarding single-institution datasets is well noted, though it could be expanded to include domain shift and scanner variability.

The truncated image synthesis subsection introduces several methods but lacks a unifying narrative. The progression from GAN-based to diffusion-based approaches is implied but not explicitly articulated. Additionally, references to challenges such as MICCAI should be described more clearly for readers unfamiliar with them.

The idea that radiomics or functional imaging can reliably guide daily adaptation is still largely investigational. There are significant challenges related to temporal stability of imaging biomarkers, noise in longitudinal measurements, and unclear thresholds for clinical decision-making. The text would benefit from clarifying that most current implementations are exploratory rather than standard clinical practice. Additionally, the description of linking imaging signals to “objective setting and gating of re-optimization” is vague and should be explained more concretely.

The table summarizing AI-powered image assistance is useful, but it simplifies complex processes to the point of being somewhat misleading. For example, MRI-to-sCT synthesis is described as producing “planning-quality” images and eliminating registration uncertainties, which may not hold across all anatomical sites or scanners. Similarly, CBCT-to-sCT approaches are described as enabling dose calculation without noting residual uncertainties in HU accuracy. It would be helpful to include limitations or confidence levels alongside each application.

The patient-specific QA section is one of the stronger parts of the manuscript, but it still tends to emphasize positive findings. The concept of “virtual IMRT QA” is introduced clearly, yet the limitations are somewhat underdeveloped. The discussion correctly mentions concerns about gamma criteria and interpretability, but it could go further by addressing issues such as dataset bias, sensitivity to rare failure modes, and medico-legal implications of reducing measurement-based QA. Some reported workload reduction percentages appear optimistic and should be contextualized with implementation details.

The treatment toxicity section is well structured and highlights an important application area. However, the critique of traditional NTCP models is somewhat oversimplified. While these models have limitations, they are still clinically validated and widely used. The comparison with AI-based approaches should be more balanced, particularly given that many AI models lack external validation and standardization.

The challenges and limitations section is an important component of the manuscript, but it remains relatively high-level and somewhat generic. Many points, such as data heterogeneity and lack of interpretability, are well known and could be discussed in more depth. For example, the discussion of domain shift could include specific mitigation strategies such as federated learning or domain adaptation. The statement that “there are no guidelines” for AI should be revised to reflect the evolving regulatory landscape.

The regulatory and ethical considerations section is appropriate but somewhat superficial. Issues such as algorithmic bias and automation bias are mentioned, yet there is little discussion of how these risks can be mitigated in practice. Including concrete examples or existing frameworks would strengthen this section.

The multi-omics paragraph is relevant but somewhat repetitive and loosely connected to the rest of the section. While the potential is clear, the practical challenges of integrating heterogeneous data, such as data harmonization and interpretability, are not discussed.

Author Response

  1. The bibliometric analysis section provides a descriptive overview but lacks methodological transparency. The division into four temporal phases appears arbitrary and is not justified. It is unclear whether these cutoffs are data-driven or chosen for convenience. Providing a rationale, such as clustering analysis or publication inflection points, would strengthen the credibility of this segmentation. Additionally, the keyword evolution discussion remains superficial. The text lists dominant terms in each period but does not interpret why these shifts occurred or how they relate to technological or clinical milestones. For example, the rise of convolutional neural networks could be linked to improvements in GPU computing and large annotated datasets, which would provide useful context.

Response: We appreciate this critique. We have retained the four phases and now ground each boundary explicitly in a combination of data-driven and narrative analyses, with an expanded discussion of why each transition occurred. Section 2.3.2 has been expanded to include the PELT change-point analysis, the year-to-year Jaccard similarity, the per-phase distinctive-keyword analysis, and the technical/clinical context for each transition. Figure 5 (keyword evolution wordclouds) is retained, now supported by the underlying methodology described directly in the text.

 

  1. The author collaboration analysis is descriptive but does not extract meaningful insights. The statement that top authors contribute a small proportion of papers is expected in most scientific fields and does not add much value without comparison to other domains. Including network metrics such as centrality, clustering coefficients, or collaboration indices would provide a more rigorous interpretation. The discussion of collaboration clusters reads more like a narrative summary rather than an analysis. It would be beneficial to explain how these clusters influence innovation, knowledge dissemination, or clinical translation.

Response: We thank the reviewer for these specific methodological suggestions. Section 2.4 has been rewritten to be evidence-driven, addressing each of the reviewer’s points in turn:

  • Productivity comparison via Lotka’s Law. The original “top 40 authors contribute a small proportion” framing has been replaced with a quantitative comparison against Lotka’s Law, the standard scientometric reference for productivity distributions. (Section 2.4.1)
  • Quantitative network metrics. We computed and report (network restricted to the 70 authors with ≥10 publications, after removing 2 isolates: 68 nodes, 225 edges):

Metric Value Interpretation Density 0.099 Sparse network; mean degree 6.6 Average clustering coefficient 0.593, 6.0× the random-graph baseline — strong community organization Global transitivity 0.500. Modularity Q 0.745 Exceeds the conventional Q ≥ 0.7 threshold for “very strong” community structure. (Section 2.4.2)

  • Centrality-based interpretation of collaboration clusters. Three centrality measures (degree, betweenness, eigenvector) surface complementary roles. Eigenvector centrality is highest for Xiaofeng Yang (0.465), Tian Liu (0.446), and Yang Lei (0.429), the densely interconnected core of one of the largest communities. Betweenness centrality highlights bridge authors who connect otherwise separate communities—these broker roles are critical for cross-pollination of methods across clinical application areas.

Inspection of the 10 detected communities shows that they map cleanly onto recognized PI-led research groups. The eight largest communities are anchored by: an outcome-modeling cluster centered on Issam El Naqa, Andre Dekker, and Gilmer Valdes (n=14); the LMU Munich MR-guided RT group around Claus Belka, Luca Boldrini, and Stefanie Corradini (n=14); the MD Anderson head-and-neck cluster around Clifton Fuller, Kareem Wahid, and Laurence Court (n=9); the Emory deep-learning group around Xiaofeng Yang, Tian Liu, and Yang Lei (n=8); a Duke knowledge-based-planning cluster around Yang Sheng, Q. Jackie Wu, and Fang-Fang Yin (n=7); the UT Southwestern deep-learning cluster around Steve Jiang, Dan Nguyen, and Jing Wang (n=6); a Stanford methodology cluster around Lei Xing and Wei Zhao (n=3); and a Dana-Farber/MGH radiomics cluster around Hugo J. W. L. Aerts, Raymond H. Mak, and Danielle S. Bitterman (n=3). (Section 2.4.2)

  • Linking community structure to innovation/knowledge dissemination. We added the following interpretive paragraph at the end of Section 2.4.2:

“Together, these metrics characterize the field as highly clustered and strongly modular, with a small set of broker authors connecting otherwise locally tight research communities. The high modularity (Q = 0.745) raises a practical concern: while internal community cohesion supports rapid methodological iteration within groups, the relatively low cross-community density may slow the diffusion of new methods across institutions, potentially contributing to the well-documented difficulty of reproducing imaging-AI results across data domains.”

 

  1. There is no clear link between the earlier keyword analysis and identified research hotspots section. Strengthening this connection would improve coherence. Additionally, the section lacks critical perspective. For example, automatic segmentation is described as highly successful, but no mention is made of known limitations such as generalizability across institutions, contouring variability, or regulatory barriers.

Response: We agree on both counts. Section 2.5 has been rewritten so that each hotspot is anchored in the quantitative keyword analysis of Section 2.3, and a “remaining limitations” sentence has been added to each.

Changes in manuscript (Section 2.5). Each hotspot is now introduced with an explicit reference to its keyword-frequency rank, its Phase 2 vs. Phase 3 trajectory, and a paired limitations note. For example, the automatic-segmentation hotspot now reads:

“(a) Automatic Segmentation. Among the most widely applied areas, with combined keyword frequencies for ‘segmentation’, ‘auto-segmentation’, and ‘organ at risk’ growing from rank 14 in Phase 1 to rank 7 in Phase 3 (n=80 in Phase 3 alone). Deep learning models have achieved substantial progress in automated delineation of OARs and tumors. Clinical translation, however, remains uneven: cross-institutional generalization, contouring variability when target boundaries are ambiguous (e.g., GTV in head-and-neck), the absence of harmonized regulatory pathways for adaptive segmentation tools, and limited prospective dosimetric validation continue to limit routine adoption.”

Analogous limitations sentences have been added to the radiomics, dose prediction, image reconstruction, and treatment-response prediction hotspots.

 

  1. In the cancer detection subsection, the discussion is generally accurate but somewhat unbalanced. The cited studies are presented in a positive light without addressing limitations. For instance, the claim that deep learning models surpassed radiologists in lung cancer detection should be contextualized with known concerns such as dataset selection, retrospective evaluation, and lack of real-world validation. Similarly, performance metrics like reductions in false positives and negatives should be accompanied by confidence intervals or study conditions to avoid overgeneralization.

Response: We thank the reviewer for pushing for a more balanced presentation. The language describing each cited diagnostic-AI study has been revised to include dataset, validation strategy, and prospective-deployment caveats.

Section 3.1.1 has been revised as below:

  1. The reference of Ardila et al. (lung cancer) had related correction. As suggested by MDPI, it’s replaced with a review paper by Thanoon et al., A Review of Deep Learning Techniques for Lung Cancer Screening and Diagnosis Based on CT Images. Diagnostics, 2023. 13(16): p. 2617. The potential and limitations in this field were summarized: “Thanoon et al. [20] reviewed the DL techniques for lung cancer screening and diagnosis based on CT images, demonstrating the capability of three-dimensional image analysis for automated nodule identification and malignancy risk stratification. This advancement from abnormality detection to risk assessment has important implications for optimizing subsequent diagnostic and therapeutic decisions. However, the authors also raised concerns regarding the use of heterogenous datasets for model training, emphasizing that careful preprocessing – including image standardization, annotation consistency and cohort balancing – is essential to achieve precise and effective lung cancer screening.”
  2. McKinney et al. (breast cancer). The original text reporting “reductions in false-positive rates by 5.7% and false-negative rates by 9.4%” has been disaggregated by dataset: “reductions in false-positive rates of 5.7% (US dataset) and 1.2% (UK dataset), and in false-negative rates of 9.4% (US) and 2.7% (UK), in independent retrospective evaluations […] though the cross-cohort variation underscores the well-known generalization challenge of screening AI, and these gains have not yet been confirmed in prospective trials.”
  3. Placido et al. (pancreatic cancer). Added: “The model was trained on Danish national-registry data and showed reduced performance when transferred to US Veterans Affairs data, illustrating both the promise of trajectory-based AI for early detection and the substantial domain-shift challenge of cross-health-system deployment.”

 

  1. The radiomics section contains several technical inaccuracies and conceptual oversimplifications. The statement that traditional radiomics relies on “subjective” features is misleading, as most radiomic features are mathematically defined and reproducible, although sensitive to preprocessing. The distinction between traditional radiomics and deep learning should be clarified more carefully. Furthermore, reported accuracy ranges such as 80–90% for brain tumor diagnosis is presented without context regarding datasets, task definitions, or validation strategies, which reduces their interpretability.

Response: We thank the reviewer for catching this technical inaccuracy—“subjective features” is indeed misleading. The framing has been corrected, and the 80–90% accuracy claim has been re-contextualized.

Changes in manuscript (Section 3.1.2). The relevant paragraph has been rewritten as follows:

“Traditional radiomics relies on hand-crafted features that are mathematically defined: shape descriptors (e.g., compactness, sphericity), first-order intensity statistics, and second-order texture features computed from gray-level co-occurrence and run-length matrices. These features are reproducible in principle, but their values are sensitive to image acquisition parameters and preprocessing choices (resampling, intensity quantization, ROI delineation), which has motivated harmonization efforts such as the Image Biomarker Standardisation Initiative (IBSI) [Zwanenburg et al., Radiology 2020]. Deep learning, by contrast, learns task-specific representations from the images themselves, enabling discovery of patterns that hand-crafted descriptors may miss, at the cost of reduced interpretability and much higher data requirements.

A review by Kocher et al. summarized DL-enhanced radiomics studies for brain tumor diagnosis with reported accuracies generally in the 80-90% range across heterogeneous tasks (e.g., glioma grading, IDH mutation prediction, MGMT methylation status), datasets (single-center vs. multi-center, sample sizes ranging from tens to hundreds of patients), and validation strategies (k-fold cross-validation vs. external test set). Direct comparison across these reported accuracies is not meaningful, and the figures should be interpreted as a proxy for active research progress rather than head-to-head performance.”

 

  1. The discussion of radiogenomics and prognostic modeling is promising but lacks critical depth. Associations between imaging features and gene expression are mentioned without addressing reproducibility issues, small cohort sizes, and lack of standardization, which are well-known challenges in the field. Similarly, claims that radiomics can outperform clinical factors should be supported with stronger evidence or tempered to avoid overstatement.

Response: We accept both critiques. The radiomics-versus-clinical-factors language has been softened, and a sentence on reproducibility limitations has been added.

Changes in manuscript (Section 3.1.2):

“Studies have reported associations between radiomics features and tumor gene expression profiles [26-28] under the concept of ‘radiogenomics,’ highlighting the potential for non-invasive tumor molecular typing. Radiomic features have also demonstrated prognostic value, with performance comparable to and complementary to traditional clinical factors [29,30]; the most robust signals to date come from combined radiomics-clinical nomograms rather than radiomics alone. Radiogenomic and prognostic radiomics studies do, however, face well-documented reproducibility challenges—small single-institution cohorts, the lack of standardized feature-extraction pipelines, sensitivity to scanner and acquisition variability, and limited independent validation—which currently limit clinical translation [29]. These advances position radiomics as a promising research tool for precision oncology, but they underscore the need for harmonization, prospective validation, and transparent reporting.”

 

  1. The statement that deep learning has “overcome” the limitation of filtered back projection is too strong. Deep learning methods mitigate artifacts but introduce new concerns such as hallucination, lack of interpretability, and dependence on training data. A more balanced discussion is needed. The explanation of reconstruction as shifting from inverse problems to restoration tasks is conceptually interesting but should be clarified, as most modern approaches still incorporate elements of both.

Response: Both points are well taken. The “overcome” language has been softened, and caveats about hallucination, interpretability, and training-distribution dependence have been added. The inverse-problem-versus-restoration discussion has been reframed to acknowledge that modern DL reconstruction methods typically retain elements of both formulations.

Changes in manuscript (Section 3.2.1):

“Traditional filtered back projection (FBP) algorithms produce severe streak artifacts under sparse sampling conditions. The emergence of DL methods has substantially mitigated this limitation, although new concerns have arisen, including hallucination of plausible but non-existent anatomical structures, loss of interpretability of the reconstruction process, and dependence on training-distribution coverage. Distribution shifts at inference time (e.g., scanner change, anatomy outside the training distribution) can degrade reconstruction quality in ways that are not always evident from the output image alone [Antun et al., PNAS 2020]. […]

The DD-Net method proposed by Zhang et al. reframed sparse-view reconstruction as an image-domain restoration task applied to an initial FBP reconstruction [35]. This ‘reconstruct first, then restore’ strategy combines physical reconstruction with a data-driven prior. More broadly, modern DL-based CT reconstruction approaches retain elements of both formulations: pure image-domain restoration (post-processing of FBP), pure projection-domain learning, and hybrid model-based or unrolled approaches that explicitly enforce the imaging physics. The conceptual contribution of this line of work is therefore not to replace inverse-problem formulations with restoration, but to allow data-driven priors to complement physics-based forward models.”

We also identified and corrected a typographical inconsistency in this subsection: the original text described the method of Liu et al. [36] as “geometry-aware attention learning”, whereas the actual paper title (and ref [36]) is “Geometry-aware attenuation learning”. This has been corrected.

 

  1. The 4D imaging section is generally coherent, but some claims require clarification. For example, reporting gamma pass rates without specifying evaluation protocols or comparison baselines limits interpretability. The discussion of transformer-based models is relevant, but the explanation of why they outperform CNNs in this context remains vague. More emphasis on the challenges of temporal consistency and motion modeling would improve depth.

Response: We thank the reviewer for these clarifications. The gamma pass rate is now reported alongside the held-out test set context and the conventional clinical acceptance threshold, and the transformer-versus-CNN explanation has been made explicit.

Changes in manuscript (Section 3.2.2). For the Thummerer et al. study:

“Using deep CNNs to learn the mapping relationship between sparse CBCT and high-quality CT, they achieved mean absolute error within 50 HU and 3%/3mm gamma pass rates above 90% on a held-out test set, computed against high-quality 4D-CT reference plans. For clinical context, routine acceptance thresholds for proton dose calculation typically require ≥95% pass rate at 3%/3mm; the reported performance therefore demonstrates feasibility but suggests further refinement is needed before unsupervised clinical deployment, particularly for high-dose-gradient regions and motion-management–critical anatomy.”

For the Cao et al. (MBST) discussion:

“To address the limited spatiotemporal modeling capacity of conventional CNNs—which rely on local convolutional kernels and have difficulty capturing long-range dependencies across the temporal dimension of the 4D volume— Cao et al. [42] introduced a Mask-Based Swin Transformer network (MBST). Window-based attention enables the model to relate distant respiratory phases without inflating receptive fields through deeper convolutional stacks, which is particularly relevant for reconstructing motion boundaries where information at one phase is informative for another. […] Open challenges remain in enforcing temporal consistency across phases (avoiding flickering artifacts) and in disentangling intra-fraction motion from underlying anatomical change.”

 

 

  1. Describing image synthesis as a paradigm shift may be justified, but the argument would be stronger with concrete clinical adoption examples. The MRI-to-CT synthesis subsection is informative, though the claim that MRI-guided radiotherapy “has been emerged” should be corrected for grammar and clarity. Reported error ranges from challenges should be contextualized with clinical thresholds for acceptability.

Response: We thank the reviewer for the three specific suggestions. All three have been addressed: concrete clinical-adoption examples have been added, the grammar corrected, and the SynthRAD HU MAE values placed in the context of clinical thresholds.

Changes in manuscript:

Section 3.3 opening paragraph:

“Image synthesis represents a meaningful methodological shift in radiation therapy workflow—from passively accepting limitations of available imaging to actively generating clinically relevant information. Concrete clinical-adoption examples include MR-only prostate radiotherapy on the Elekta Unity and ViewRay/MRIdian MR-Linac systems, and CBCT-based synthetic-CT workflows on the Varian Ethos system, both of which have entered routine clinical use at multiple institutions. From MRI-only workflows to on-line adaptive therapy, image synthesis is increasingly reshaping the boundaries of modern radiotherapy practice.”

Section 3.3.1 opening sentence (grammar):

“Over the past decade, MRI-guided radiotherapy has emerged as a clinically established approach to improve treatment precision and outcomes for selected anatomical sites.”

Section 3.3.1 SynthRAD context:

“According to the official challenge leaderboard, top-performing methods achieved approximately 16-28 HU Mean Absolute Error (MAE) for pelvis and brain regions respectively. For clinical context, dose calculation on synthetic CT is typically considered acceptable when the resulting dose distribution agrees with reference CT-based dose to within ~1% in target dose metrics and within clinical OAR tolerances, which corresponds approximately to soft-tissue MAE thresholds of 30-50 HU and tighter requirements in bone. The reported pelvis MAE therefore approaches clinically acceptable performance, while accuracy in regions with metal implants, air-tissue interfaces, and outside-FOV structures remains a known challenge.”

 

  1. The CBCT-to-CT synthesis discussion appropriately highlights diffusion models, but the explanation assumes familiarity with these techniques. A brief clarification of why diffusion models outperform GANs in this context would improve accessibility. The limitation regarding single-institution datasets is well noted, though it could be expanded to include domain shift and scanner variability.

Response: We appreciate the request for clarity on both points. The diffusion-versus-GAN comparison now includes an explicit explanation of the underlying mechanism, and the domain-shift discussion has been expanded.

Changes in manuscript (Section 3.3.2):

“Recently, latent diffusion models—particularly Denoising Diffusion Probabilistic Models (DDPMs)—have emerged as a promising alternative to GAN-based approaches for CBCT-to-sCT synthesis. Compared with adversarial training in GANs, the iterative denoising objective of diffusion models avoids mode collapse and provides more stable optimization, which in practice translates into better preservation of fine anatomical detail and more faithful modeling of tissue-specific HU distributions. The trade-off is significantly higher inference cost (multiple sampling steps per image), although recent latent-space and consistency-model variants have begun to close this gap. […]

Despite these advances, most current CBCT-to-sCT diffusion models are trained and validated on single-institution datasets, leaving domain shift across CBCT scanner vendors, kVp/mAs settings, scatter-correction algorithms, anatomical sites, and patient body-habitus distributions as a major open problem. Multi-institutional federated training and explicit domain-adaptation strategies represent active research directions but have not yet produced widely benchmarked solutions.”

 

  1. The truncated image synthesis subsection introduces several methods but lacks a unifying narrative. The progression from GAN-based to diffusion-based approaches is implied but not explicitly articulated. Additionally, references to challenges such as MICCAI should be described more clearly for readers unfamiliar with them.

Response: We thank the reviewer for asking us to make this progression explicit. The methodological progression is now stated directly, and the vague “MICCAI 2024” reference has been removed. In its place we use a more measured framing that acknowledges diffusion-based outpainting as an emerging direction without overclaiming an established literature in the truncation-correction setting.

Changes in manuscript (Section 3.3.3):

“Building upon these adversarial foundations, later approaches incorporated explicit anatomical guidance to improve boundary fidelity […]. The methodological evolution in this subfield has been dominated by adversarial frameworks (Comp-GAN, SC-GAN, CycleGAN-based approaches), which demonstrated proof of concept for joint synthesis-and-extrapolation but face training-stability and peripheral-fidelity limitations. More recently, diffusion-based outpainting has been explored in adjacent medical-image-synthesis settings and represents a natural extension, although peer-reviewed studies specifically applying diffusion outpainting to radiotherapy CBCT/MR truncation correction remain limited at the time of writing. For both GANs and diffusion models, key open challenges remain: validation on truly out-of-distribution truncation patterns, dosimetric impact characterization, and prospective clinical evaluation.”

 

  1. The idea that radiomics or functional imaging can reliably guide daily adaptation is still largely investigational. There are significant challenges related to temporal stability of imaging biomarkers, noise in longitudinal measurements, and unclear thresholds for clinical decision-making. The text would benefit from clarifying that most current implementations are exploratory rather than standard clinical practice. Additionally, the description of linking imaging signals to “objective setting and gating of re-optimization” is vague and should be explained more concretely.

Response: Thank reviewer for the suggestion. We clarified it accordingly and added extra references about the clinical trials leveraging biological information for adaptation.

“Practical near-term implementations focus on incorporating biology-informed signals to trigger adaptive therapy and guide target contour adjustments. Currently, biology-aware adaptation remains largely in the clinical trial and exploratory phase, rather than routine clinical practice, though broader adoption is expected in the near future. AI can further AI extract imaging signatures and ties them to re‑optimization logic.”

 

  1. The table summarizing AI-powered image assistance is useful, but it simplifies complex processes to the point of being somewhat misleading. For example, MRI-to-sCT synthesis is described as producing “planning-quality” images and eliminating registration uncertainties, which may not hold across all anatomical sites or scanners. Similarly, CBCT-to-sCT approaches are described as enabling dose calculation without noting residual uncertainties in HU accuracy. It would be helpful to include limitations or confidence levels alongside each application.

Response: Thank you for this helpful comment. We have revised the table accordingly to include limitations and confidence levels for each application and to avoid overstatement of current capabilities.

  1. The patient-specific QA section is one of the stronger parts of the manuscript, but it still tends to emphasize positive findings. The concept of “virtual IMRT QA” is introduced clearly, yet the limitations are somewhat underdeveloped. The discussion correctly mentions concerns about gamma criteria and interpretability, but it could go further by addressing issues such as dataset bias, sensitivity to rare failure modes, and medico-legal implications of reducing measurement-based QA. Some reported workload reduction percentages appear optimistic and should be contextualized with implementation details.

Response: Thank you for pointing out the additional issues. We have added dataset bias and insensitivity to rare failure modes. As for the medico-legal implications of reducing measurement-based QA, we think it is out of the scope of this document to offer a nuanced discussion. Also, we agree that some of the workload reduction percentages appear optimistic, and have clarified that these are time saving estimates and are dependent on variations in clinical implementation/environment.

 

  1. The treatment toxicity section is well structured and highlights an important application area. However, the critique of traditional NTCP models is somewhat oversimplified. While these models have limitations, they are still clinically validated and widely used. The comparison with AI-based approaches should be more balanced, particularly given that many AI models lack external validation and standardization.

Response: We thank the reviewer for this important and constructive comment. We agree that our original description of NTCP models may have been overly simplified and did not sufficiently acknowledge their established clinical value. We have revised the paragraph (section 8.3) to explicitly recognize that DVH-based and NTCP models are clinically validated, interpretable, and widely used in routine practice, while adding that many AI models currently lack extensive external validation and standardization, and that further work is required before widespread clinical adoption.

 

  1. The challenges and limitations section is an important component of the manuscript, but it remains relatively high-level and somewhat generic. Many points, such as data heterogeneity and lack of interpretability, are well known and could be discussed in more depth. For example, the discussion of domain shift could include specific mitigation strategies such as federated learning or domain adaptation. The statement that “there are no guidelines” for AI should be revised to reflect the evolving regulatory landscape.

Response: Thank you for the comments. We have expanded the challenges and limitations section to include mitigations of the limitations. We also revised the statement that “there are no guidelines” in QA section to reflect the evolving regulatory landscape.

 

  1. The regulatory and ethical considerations section is appropriate but somewhat superficial. Issues such as algorithmic bias and automation bias are mentioned, yet there is little discussion of how these risks can be mitigated in practice. Including concrete examples or existing frameworks would strengthen this section.

Response: Thank you for the reviewer’s helpful comment. We use algorithm bias as example and add more discussion on how the risk can be mitigated.

 

Regulatory and Ethical Considerations. The rapid evolution of AI models presents a unique challenge for traditional regulatory approval processes, which are designed for more static medical devices. Critical concerns include algorithmic bias [257], which occurs when AI models exhibit systematic disparities in performance due to unrepresentative training data or flawed model design, and automation bias, where clinicians may over-rely on machine-generated decision-making [257,258]. These risks may exacerbate existing healthcare disparities if unaddressed. Mitigating these risks requires proactive strategies as well as clear guidelines and standards. Using algorithmic bias as an example, mitigation approaches include data-level strategies (ensuring training data are representative and unbiased), model-level strategies (designing algorithms that explicitly prioritize fairness), validation and evaluation (rigorously assessing model performance across subgroups), and deployment and monitoring (ensuring fairness persists in real-world use). Frameworks such as NIST AI Risk Management Framework (AI RMF) can support systematic auditing and governance, while all efforts should align with evolving legal and ethical standards, such as the FDA’s Predetermined Change Control Plan (PCCP). Guidance statements for AI ethics and good practices have been issued by multiple governing bodies and organizations and continues evolve rapidly alongside technical advances[259].

 

  1. The multi-omics paragraph is relevant but somewhat repetitive and loosely connected to the rest of the section. While the potential is clear, the practical challenges of integrating heterogeneous data, such as data harmonization and interpretability, are not discussed.

Response: Thank you for pointing out the repetitive information. We have rewritten this paragraph and make it more related to this section. We also included the challenges as future efforts.

“While radiomics has demonstrated considerable promise in enhancing cancer detection and refining radiation outcome analysis, especially in scenarios involving extensive imaging data, recent advances in multi-omics research have further highlighted the significant potential of integrating heterogeneous data modalities. The combination of genomics, transcriptomics, proteomics, metabolomics, and radiomics offer a powerful, system-level framework to comprehensively characterize tumor biology, microenvironment, and treatment response mechanisms, enabling the development of more accurate prognostic and predictive models, and highly personalized and patient-specific radiotherapy strategies [217, 218, 264-266]. Future efforts will need to focus on overcoming technical and analytical challenges – such as data harmonization, model interpretability, and clinical validation – to achieve the goal of biologically guided radiation oncology.”

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

No comments

Back to TopTop