Artificial Intelligence and Machine Learning in Pediatric Endocrine Tumors: Opportunities, Pitfalls, and a Roadmap for Trustworthy Clinical Translation
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe article titled "Artificial Intelligence and Machine Learning in Pediatric Endocrine Tumors: Opportunities, Pitfalls, and a Roadmap for Trustworthy Clinical Translation" reviews the literature about the recent advances in AI/ML in pediatric endocrine tumors. The manuscript has been written and structured well, and the citations are timely. The work can be useful for readers. Also, the authors have clearly highlighted the age-impacted motivations for dividing pediatric and adult studies. However, there are few concerns that should be addressed by the authors:
- Although focusing on pediatric studies is plausible, reutilizing the relevant methods is crucial. Typically, other fields of AI/ML are widely used in studies. I advise mentioning Gene Profile ML, Network-based ML (association and causality), and multi-omics ML. If the authors cannot find any studies, they can report the gap and provide a brief perspective.
- Figure 1 is a mixture of heterogeneous titles (Methods, Techniques, Domains). If the authors intend to present a taxonomic hierarchy, it is not accurate. Otherwise, the figure can be rearranged by separating element types with meaningful arrows.
Author Response
The article titled "Artificial Intelligence and Machine Learning in Pediatric Endocrine Tumors: Opportunities, Pitfalls, and a Roadmap for Trustworthy Clinical Translation" reviews the literature about the recent advances in AI/ML in pediatric endocrine tumors. The manuscript has been written and structured well, and the citations are timely. The work can be useful for readers. Also, the authors have clearly highlighted the age-impacted motivations for dividing pediatric and adult studies.
However, there are few concerns that should be addressed by the authors:
- Although focusing on pediatric studies is plausible, reutilizing the relevant methods is crucial. Typically, other fields of AI/ML are widely used in studies. I advise mentioning Gene Profile ML, Network-based ML (association and causality), and multi-omics ML. If the authors cannot find any studies, they can report the gap and provide a brief perspective.
We agree and have added a new cross-entity subsection (Section 3.7 Multi-omics, gene-expression, and network-based ML) that briefly introduces these method families and summarizes representative adult endocrine applications, while clearly marking the pediatric evidence gap. Specifically, we cite: supervised gene-expression ML for subtype/metastasis prediction in small-intestinal NETs; transcriptomic classifiers distinguishing adrenocortical adenoma vs carcinoma and defining prognostic subgroups; WGCNA/radiogenomic work in papillary thyroid carcinoma (including links to lateral lymph-node metastasis); an endocrine epidemiology review on integrating causal inference with ML; and multi-omics applications in PTC lymph-node metastasis. We also added a Table 2 row (“Multi-omics integration & network methods”) to map pediatric guardrails (fusion strategy, batch correction, assay QC, feature stability, calibration) to reporting frameworks
- Figure 1 is a mixture of heterogeneous titles (Methods, Techniques, Domains). If the authors intend to present a taxonomic hierarchy, it is not accurate. Otherwise, the figure can be rearranged by separating element types with meaningful arrows.
We redesigned Figure 1 into three columns—Methods, Clinical domains, and Explainability/XAI and added a caption clarifying flow semantics.
Reviewer 2 Report
Comments and Suggestions for AuthorsTo enhance the clarity of the manuscript and strengthen its pediatric-specific focus, the authors are recommended to revise the text in accordance with the following points:
- Although the abstract aims to provide a broad overview of artificial intelligence and machine learning applications in pediatric endocrine oncology, the content remains general. The unique contribution of the study, how it differs from existing reviews in the literature, and which concrete new insights it offers to the reader should be clarified in the abstract.
- In the introduction, the clinical-epidemiological summaries for each tumor group are quite detailed. However, it is not clear how this information is directly linked to artificial intelligence/machine learning applications.
- In the “Current Evidence” subsection, while a general panorama of pediatric oncology is presented, it should be explained why the literature specific to pediatric endocrine tumors is particularly limited and how this review addresses that gap.
- Given the scope and ranking algorithms of Google Scholar, the methods section should more clearly explain how this database was used to ensure reproducibility.
- Although it is stated that no meta-analysis was performed, the reason why quantitative synthesis was not possible should be more explicitly justified. Some tumor groups potentially have comparable outcome measures. It should be explained why a structured comparison or at least a semi-quantitative summary was not conducted.
- Presenting the extensive Boolean “OR” terms in the text as a long list reduces readability. Organizing these terms into conceptual clusters in a table would improve transparency and reproducibility of the methods section.
- In the third section, performance metrics (AUROC, C-index, etc.) are reported in detail, but clinical effects remain largely hypothetical, and it is unclear how the models would be used in practice. Greater clarity is needed in this regard.
- Excessive reliance on adult data throughout the third section is also noticeable. Adult perspectives provide useful context, but in some subsections (especially MTC, PGL, and GEP-NEN), the near-complete absence of pediatric-specific content weakens the pediatric focus of this section. These subsections should be revised.
- The discussion of explainable AI (XAI) in the third section, while technically accurate, is repetitive and unnecessarily long. The limitations of SHAP, saliency maps, and radiomic features are repeatedly mentioned without offering new conceptual insights. A more integrated and synthetic XAI discussion is preferred.
- In the fourth section, the text largely focuses on listing reporting standards and explaining tables. The original methodological discussion or critical appraisal is limited, and the scope should be expanded.
- Also in the fourth section, numerous standards and guidelines are merely listed. Their applicability or limitations specific to pediatric endocrine tumors should be analyzed.
- In the fifth section, the text mostly summarizes institutional and policy frameworks. Field-applicable recommendations or illustrative case analyses should be added.
- In the sixth section, the existence of networks such as EXPeRT, ERN PaedCan, and PARTNER and the potential for pilot projects are described at length. However, concrete examples or evidence-based strategies for real-world implementation should be provided.
Author Response
To enhance the clarity of the manuscript and strengthen its pediatric-specific focus, the authors are recommended to revise the text in accordance with the following points:
- Although the abstract aims to provide a broad overview of artificial intelligence and machine learning applications in pediatric endocrine oncology, the content remains general. The unique contribution of the study, how it differs from existing reviews in the literature, and which concrete new insights it offers to the reader should be clarified in the abstract.
Thank you for this constructive suggestion. We have revised the abstract to (i) state our unique contribution—a pediatric-first, entity-structured synthesis paired with a guardrails checklist mapped to contemporary reporting/evaluation standards and an EU-anchored roadmap (EXPeRT, ERN PaedCan); and (ii) add concrete insights beyond generalities (promising pediatric signals in ACT survival and DTC early non-remission; adult-led areas for PGL/GEP-NEN; cross-cutting points on calibration, validation hierarchy, and XAI limits).
- In the introduction, the clinical-epidemiological summaries for each tumor group are quite detailed. However, it is not clear how this information is directly linked to artificial intelligence/machine learning applications.
We thank the reviewer for this helpful observation. We have condensed the clinical-epidemiological summaries for each tumor entity.
- In the “Current Evidence” subsection, while a general panorama of pediatric oncology is presented, it should be explained why the literature specific to pediatric endocrine tumors is particularly limited and how this review addresses that gap.
We agree that the “Current Evidence” subsection should explain why pediatric endocrine AI/ML is comparatively sparse and how our review addresses this. We have reworded this subsection to integrate a brief rationale—rarity, genotype heterogeneity, long-horizon outcomes, protocol/assay variability across centers, and cross-border consent/GDPR constraints—while avoiding any anticipation of results. We then state how the review responds: a pediatric-first, entity-structured synthesis that clearly separates pediatric from adult-only evidence, consolidates studies in Table 1, maps guardrails to reporting/evaluation standards in Table 2, and outlines an EU-anchored pathway for harmonized multi-site validation.
- Given the scope and ranking algorithms of Google Scholar, the methods section should more clearly explain how this database was used to ensure reproducibility.
We appreciate the concern about Google Scholar’s ranking and reproducibility. After auditing our search, Google Scholar did not yield any unique records beyond those already captured in PubMed/MEDLINE. To avoid adding a non-contributory source, we removed Google Scholar from the Methods.
- Although it is stated that no meta-analysis was performed, the reason why quantitative synthesis was not possible should be more explicitly justified. Some tumor groups potentially have comparable outcome measures. It should be explained why a structured comparison or at least a semi-quantitative summary was not conducted.
We appreciate the suggestion to consider a meta-analysis or semi-quantitative summary. In this topic area, most entity–task pairs had very few pediatric studies (often k ≤ 1) with non-comparable endpoints, heterogeneous inputs/protocols, and incomplete reporting of calibration and action thresholds. Under these conditions, quantitative pooling or score aggregation would be misleading. We now state this explicitly in the Methods and retain a structured narrative.
- Presenting the extensive Boolean “OR” terms in the text as a long list reduces readability. Organizing these terms into conceptual clusters in a table would improve transparency and reproducibility of the methods section.
We have removed the long Boolean strings from the main text and now present the search terms as conceptual clusters in Appendix Table A1 alongside the verbatim PubMed query strings. The Methods section now references Table A1 for transparency and reproducibility.
- In the third section, performance metrics (AUROC, C-index, etc.) are reported in detail, but clinical effects remain largely hypothetical, and it is unclear how the models would be used in practice. Greater clarity is needed in this regard.
We rewrote Section 3 and added a brief reading guide that frames each subsection by the decision the model would inform if validated (triage, risk estimation, peri-operative safety, treatment response). Within each entity, we now describe how a model would be used in practice (e.g., FNA vs observation in DTC; extent of surgery/RAI and surveillance; alpha-blockade/anesthesia planning in PGL), while minimizing numeric detail (retaining only essential figures) and emphasizing calibration and validation. Conditions for use and decision thresholds are referenced to Section 4 and Table 2 rather than repeated.
- Excessive reliance on adult data throughout the third section is also noticeable. Adult perspectives provide useful context, but in some subsections (especially MTC, PGL, and GEP-NEN), the near-complete absence of pediatric-specific content weakens the pediatric focus of this section. These subsections should be revised.
We substantially condensed adult material in each subsection (MTC, PPGL, GEP-NEN, and for symmetry DTC, ACT, and patient-facing AI). Each entity now opens with the pediatric evidence (or explicit acknowledgment of its absence), and adult studies are cited only as methodological context in 1–3 sentences. This re-centers the section on pediatrics while preserving useful scaffolding where pediatric data are lacking.
- The discussion of explainable AI (XAI) in the third section, while technically accurate, is repetitive and unnecessarily long. The limitations of SHAP, saliency maps, and radiomic features are repeatedly mentioned without offering new conceptual insights. A more integrated and synthetic XAI discussion is preferred.
We removed repeated XAI explanations from the entity subsections and created a single, integrated interpretability paragraph in Section 4. Entity sections now point to Section 4 for XAI considerations, eliminating redundancy while keeping the practical implications (feature stability, IBSI alignment, calibration) visible.
- In the fourth section, the text largely focuses on listing reporting standards and explaining tables. The original methodological discussion or critical appraisal is limited, and the scope should be expanded.
We have rewritten Section 4 to provide a critical, pediatric ET–specific appraisal beyond checklists. The revised text addresses label quality and harmonization (including IBSI for radiomics and assay variability for thyroglobulin/calcitonin), analysis under small-N/p≫n, calibration and action thresholds with decision-curve analyses, integrated interpretability and appropriate-reliance endpoints, genotype-aware subgroup reporting, and lifecycle governance under the EU AI Act. Table 2 remains as a quick checklist but the narrative now explains how and why each guardrail matters in pediatric ETs.
- Also in the fourth section, numerous standards and guidelines are merely listed. Their applicability or limitations specific to pediatric endocrine tumors should be analyzed.
Addressed in Section “Where the standards fit—and where they don’t”, which explicitly maps TRIPOD-AI/PROBAST-AI, CLAIM 2024, STARD-AI, METRICS+IBSI, DECIDE-AI, SPIRIT-AI/CONSORT-AI to pediatric ET use and notes gaps (e.g., thresholds not specified, lesion- vs patient-level endpoints, ultrasound variability, stability requirements). The text clarifies how we would adapt each framework to the constraints of rare, genotype-diverse, small-sample pediatric ETs.
- In the fifth section, the text mostly summarizes institutional and policy frameworks. Field-applicable recommendations or illustrative case analyses should be added.
We appreciate this suggestion. In addition to summarizing institutional/policy frameworks, we have added a field-applicable subsection with illustrative cases (DTC ultrasound triage; PGL peri-operative risk; MEN2 patient-facing navigator). These short scenarios show how we would scope tools, couple outputs to pre-specified pediatric thresholds, ensure human oversight, monitor equity (age/genotype/language/vendor strata), and implement post-deployment safety logs and update procedures within ERN PaedCan/EXPeRT governance. We believe this strengthens Section 5 by moving from principles to concrete, pediatric endocrine–relevant practice.
- In the sixth section, the existence of networks such as EXPeRT, ERN PaedCan, and PARTNER and the potential for pilot projects are described at length. However, concrete examples or evidence-based strategies for real-world implementation should be provided
We appreciate the request for concrete, evidence-informed implementation strategies. Section 6 has been condensed and reframed around four entity-specific pilot examples (ACT survival risk, DTC early non-remission, PGL peri-operative instability, GEP-NEN imaging via federated learning) embedded in CPMS workflows with prespecified thresholds, calibration checks, audit trails, and equity dashboards. We also specify quasi-experimental roll-outs, lifecycle safety (drift, recalibration, rollback), and “starter packs” (model cards, data dictionaries, monitoring templates) to support real-world replication within EXPeRT/ERN PaedCan.
Reviewer 3 Report
Comments and Suggestions for Authors
This review article provides a comprehensive synthesis of the current landscape, challenges, and future directions for AI and ML in domain. The authors focus on five main tumor types, critically appraise the available evidence, discuss methodological and ethical guardrails, and propose a pragmatic roadmap for clinical translation. The manuscript comes on time, is well-setup, and addresses a clear gap in the literature by focusing on the pediatric context, which is often underrepresented in AI/ML oncology research. The authors are to be commended for their thoughtful, balanced, and forward-looking synthesis of a rapidly evolving field. The review is accessible to clinicians, but some sections (e.g., on explainability methods, calibration, and model validation) could benefit from deeper technical detail or illustrative examples, especially for readers less familiar with AI/ML. The section called "patient-facing AI tools" is brief and mostly negative. I suggest expanding this section with more examples, even from adjacent pediatric fields, and discuss ongoing efforts to improve the reliability and safety of such tools. The conclusion calls for multi-center collaboration and federated validation but could be more specific about the types of studies or pilots needed. Provide then concrete examples of high-priority research questions or pilot projects that could be initiated within the proposed networks. The table 1 is highly informative but dense. I would suggest splitting it by tumor type or highlighting key pediatric studies for clarity. The Figure 1 while referenced is not fully described in the text or footnotes and a more detailed caption or in-text explanation would help. The conclusion calls for multi-center collaboration and federated validation but could be more specific about the types of studies or pilots needed. I would encourage the authors to provide concrete examples of high-priority research questions or pilot projects that could be initiated within the proposed networks.
Author Response
This review article provides a comprehensive synthesis of the current landscape, challenges, and future directions for AI and ML in domain. The authors focus on five main tumor types, critically appraise the available evidence, discuss methodological and ethical guardrails, and propose a pragmatic roadmap for clinical translation. The manuscript comes on time, is well-setup, and addresses a clear gap in the literature by focusing on the pediatric context, which is often underrepresented in AI/ML oncology research. The authors are to be commended for their thoughtful, balanced, and forward-looking synthesis of a rapidly evolving field.
We thank the reviewer for the encouraging assessment. Although no specific changes were requested in this comment, we have nonetheless strengthened the manuscript in line with the spirit of the feedback: (i) Section 3 has been rewritten into a fully narrative, pediatric-first synthesis with clearer links to clinical decisions and reduced numeric emphasis; (ii) adult material is retained only as brief methodological context; (iii) explainability content has been consolidated into a single integrated paragraph in Section 4 with corresponding items in Table 2; and (iv) the roadmap (Section 6) now includes concrete, network-based pilot examples and evaluation strategies. We appreciate the reviewer’s recognition of the manuscript’s timeliness and focus on pediatric needs.
The review is accessible to clinicians, but some sections (e.g., on explainability methods, calibration, and model validation) could benefit from deeper technical detail or illustrative examples, especially for readers less familiar with AI/ML.
We appreciate the request to make the review more concrete for clinicians. We have added short “Clinician’s note” callouts (calibration, small-N/uncertainty, harmonization, interpretability) and three Supplementary Boxes: S1 (calibration in practice with a pediatric DTC example), S2 (decision curves and threshold setting with a PGL screening example), and S3 (a concise validation hierarchy). An optional Box S4 summarizes how to read SHAP/saliency/radiomics explanations responsibly. These additions keep the manuscript readable while giving non-specialist readers actionable handles for appraisal and use.
The section called "patient-facing AI tools" is brief and mostly negative. I suggest expanding this section with more examples, even from adjacent pediatric fields, and discuss ongoing efforts to improve the reliability and safety of such tools.
We have the section with examples from adjacent pediatric settings (symptom-triage assistants, pre-op/survivorship education, adherence and care-coordination support) and added a paragraph on ongoing reliability efforts (grounded content, safety classifiers/abstention, intent detection, audit logs, offline access). We also added a short, field-applicable paragraph in Section 5 describing how to operationalize reliability and equity (grounding, hard stops with escalation, appropriate-reliance metrics, governance artifacts). This preserves a cautious stance while offering concrete, pediatric-relevant paths to safer patient-facing tools.
The table 1 is highly informative but dense. I would suggest splitting it by tumor type or highlighting key pediatric studies for clarity.
We have split the original Table 1 into: Table 1 (main text): Pediatric studies only, grouped by entity and clinical question; and Appendix Table A2: Adult evidence (context only).
The Figure 1 while referenced is not fully described in the text or footnotes and a more detailed caption or in-text explanation would help.
We have revised Figure 1 and now provide a full descriptive caption and in-text explanation.
The conclusion calls for multi-center collaboration and federated validation but could be more specific about the types of studies or pilots needed. I would encourage the authors to provide concrete examples of high-priority research questions or pilot projects that could be initiated within the proposed networks.
We agree and have revised the manuscript accordingly. While keeping the Conclusion concise in line with journal conventions, we substantially revised Sections 4 to 6 to provide concrete, high-priority pilots and study designs, including examples such as a DTC early non-remission decision aid, external validation of a four-variable ACT survival tool, a PGL peri-operative instability pilot embedded in anesthesia checklists, federated imaging studies for adolescent GEP-NEN, and a genotype-aware MEN2/MTC aid. These revisions specify intended actions, thresholds, outcomes, calibration maintenance, and equity monitoring within EXPeRT/ERN PaedCan workflows.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors revised the manuscript by taking the recommendations into account. The paper is acceptable in its current form.
