Next Article in Journal
External Evaluation of a Predictive Model of Suboptimal Cytoreduction in Advanced Ovarian Cancer
Previous Article in Journal
An Interpretable Ensemble Transformer Framework for Breast Cancer Detection in Ultrasound Images
 
 
Article
Peer-Review Record

IDH Mutation Assessment in Gliomas from Anatomical MRI Using Deep Learning: A Comparative Analysis of Centralized and Federated Learning Frameworks

Diagnostics 2026, 16(4), 623; https://doi.org/10.3390/diagnostics16040623
by Abdullah Bas 1,* and Esin Ozturk-Isik 1,2
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Diagnostics 2026, 16(4), 623; https://doi.org/10.3390/diagnostics16040623
Submission received: 4 December 2025 / Revised: 10 February 2026 / Accepted: 11 February 2026 / Published: 20 February 2026
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript presents a comparative study of centralized learning (CL) and federated learning (FL) frameworks for predicting IDH mutation status in diffuse gliomas using routine anatomical MRI (T1c, T2w, FLAIR). The authors implement a 2D U-Net–based classifier incorporating age and sex covariates and evaluate two tumor-focused preprocessing strategies. They report that CL-NSF achieves the highest performance, while FL performance is lower but sensitive to aggregation strategy, with the Trimmed Mean outperforming FedAvg. Bellow are my comments to improve the manuscript:

  1. The training/test imbalance in IDH-mutation prevalence (≈15% in training vs. ≈55% in the test set) requires deeper justification and evaluation, particularly with respect to calibration stability, class-conditional performance, and the possibility that the observed precision–recall profiles are influenced by this distributional mismatch.
  2. The interpretation of performance differences between Naive Soft Filtering and Gradient-Based Soft Filtering would benefit from supporting evidence, such as saliency/attribution maps or subregion-specific error analyses, to substantiate the claim that boundary-weighted filtering attenuates intratumoral signal relevant to IDH discrimination.
  3. Was the U-Net encoder pretrained on the same UCSF-PDGM dataset used for classification, and if so, how did the authors mitigate implicit target-leakage arising from feature transfer between segmentation and classification tasks; alternatively, if pretrained on an external dataset, what were the domain differences and how might they influence representation bias?
  4. Can the authors provide the full optimization schedule for all experiments, including local epoch counts, communication frequency, batch size, learning-rate schedule, weight-decay, and early-stopping criteria, and confirm that the effective optimization budget was harmonized between centralized and federated runs to avoid confounding the performance gap with differing convergence regimes?

  5. Were input intensities normalized globally or site-partition–specifically prior to pseudo-client creation, and did the authors evaluate whether per-client normalization pipelines influenced non-IID behavior and the relative performance of FedAvg versus Trimmed Mean aggregation?

  6. Clarify in Section 2.2 how skull-stripping and segmentations were quality-controlled

Author Response

Comments 1: The training/test imbalance in IDH-mutation prevalence (≈15% in training vs. ≈55% in the test set) requires deeper justification and evaluation, particularly with respect to calibration stability, class-conditional performance, and the possibility that the observed precision–recall profiles are influenced by this distributional mismatch.

 

Response 1: We thank the reviewer for raising this important point regarding the discrepancy in IDH-mutation prevalence between the training and test sets. The test cohort was intentionally balanced to enable fair, unbiased estimates of accuracy and other threshold-dependent metrics, which can otherwise be dominated by the majority class under highly imbalanced conditions. As a result, the remaining training data showed a lower prevalence of IDH mutations.

We agree with the reviewer that this design introduces a prior shift between the training and test distributions and that precision–recall–based metrics are inherently sensitive to class prevalence. For this reason, we interpret precision and F1-score in conjunction with class-conditional metrics. Notably, the simultaneously high sensitivity and specificity observed across models indicate strong class-conditional discrimination rather than performance gains driven solely by test-set balancing.

Furthermore, we observe consistent performance trends between ROC-AUC, which is less sensitive to prevalence, and precision–recall–based metrics across centralized and federated settings. This consistency suggests that the reported performance differences primarily reflect genuine differences in discriminative capability rather than artifacts introduced by the balanced test distribution.

To avoid misinterpretation, we have explicitly noted in the revised Results section at Table 3 that all reported metrics were obtained under a balanced test-set configuration and should be interpreted accordingly as  (page 7, second paragraph, lines 211-215): 

“Test-set class proportions were intentionally balanced to avoid majority-class bias in threshold-dependent metrics; as a result, precision–recall values reflect performance under balanced prevalence and may differ under natural class distributions, whereas prevalence-insensitive metrics such as ROC-AUC, sensitivity, and specificity provide more generalizable assessments of discriminative ability.”

Comments 2: The interpretation of performance differences between Naive Soft Filtering and Gradient-Based Soft Filtering would benefit from supporting evidence, such as saliency/attribution maps or subregion-specific error analyses, to substantiate the claim that boundary-weighted filtering attenuates intratumoral signal relevant to IDH discrimination.

Response 2: We agree with the reviewer. We do not have saliency maps for all the data or any other quantitative proof to write such strong claims. As a result, we have modified the corresponding sentences in the discussion section as,  

(page 10, first paragraph, lines 266-269):


“…suggesting that context-preserving attenuation around the lesion may be beneficial for discrimination under the evaluated conditions, without sacrificing specificity or recall.”


(page 10, first paragraph,  lines 272-276)


“This degradation may be related to differences in how tumor subregions are emphasized by the spatial weighting scheme; however, direct attribution analyses would be required to confirm the contribution of intratumoral signal to IDH mutation detection. “


Comments 3:Was the U-Net encoder pretrained on the same UCSF-PDGM dataset used for classification, and if so, how did the authors mitigate implicit target-leakage arising from feature transfer between segmentation and classification tasks; alternatively, if pretrained on an external dataset, what were the domain differences and how might they influence representation bias?


Response 3: We agree that this represents a critical methodological consideration. Our U-Net encoder was pretrained . on T2-weighted MRI of an in-house glioma dataset to delineate the whole tumor volume, and none of the UCSF-PDGM data were used in pretraining. Additionally, the dataset used for this pretraining did not employ IDH mutation labels, thereby mitigating the risk of implicit target leakage in the subsequent classification task. Potential domain differences include variations in MRI scanner protocols and patient demographics across institutions; however, both datasets underwent similar preprocessing pipelines (skull stripping, intensity normalization) following standard practices in glioma imaging studies, which minimize domain shift. We fine-tune the 2D UNet model (weights are not frozen) to adapt pretrained anatomical features to UCSF-PDGM characteristics while preserving beneficial inductive biases from segmentation pretraining, ensuring that learned representations capture glioma-relevant morphological features without introducing molecular classification biases. To provide a clear explanation of the pretraining, we added the following sentences. (on page 5, first paragraph, lines 170-175),


“The U-Net encoder was pretrained on an independent in-house glioma dataset, with no patient overlap and no molecular labels used during pretraining. Additionally, both datasets underwent standard preprocessing (skull-stripping, registration, intensity normalization), and encoder weights were fine-tuned rather than frozen during classification training to allow adaptation to potential domain differences while leveraging pretrained anatomical feature representations.” 



Comments 4: Can the authors provide the full optimization schedule for all experiments, including local epoch counts, communication frequency, batch size, learning-rate schedule, weight-decay, and early-stopping criteria, and confirm that the effective optimization budget was harmonized between centralized and federated runs to avoid confounding the performance gap with differing convergence regimes?


Response 4:

Thank you for this comment. 


It is important to note that in both centralized and federated models, the same architecture, optimizer, learning rate schedule, loss function, and preprocessing pipeline were employed.


We added a new paragraph to the methods section of the manuscript to provide these important details (page 6, first paragraph, lines 186-192).


“Each model was trained with RMSProp optimizer (learning rate=1×10⁻⁴) and a cosine annealing learning rate schedule (T_max=40, η_min=0). Centralized models were trained on aggregated data for 40 epochs. In Federated Learning experiments, a single local epoch was performed on each model per round, followed by server aggregation. The total number of rounds was chosen to ensure an equal optimization budget for federated and centralized experiments. Early stopping was not used.” 


Comments 5: Were input intensities normalized globally or site-partition, specifically prior to pseudo-client creation, and did the authors evaluate whether per-client normalization pipelines influenced non-IID behavior and the relative performance of FedAvg versus Trimmed Mean aggregation?


Response 5:

Thank you for this comment. The input intensities were normalized per patient, with each MRI modality independently zero-mean and unit-variance normalized prior to pseudo-client creation. No global, site-specific, or client-specific intensity statistics were computed or shared between the pseudo-clients.

This uniform per-modality normalization reduced the intensity variability  of different scanners, but did not impose full IID conditions across federated clients due to potential heterogeneity in patient sample composition and higher-order feature distributions. The resulting performance differences between FedAvg and Trimmed Mean did not stem from heterogeneous normalization pipelines, and the results are consistent with the known robustness of Trimmed Mean under heterogeneous client updates.

Comments 6: Clarify in Section 2.2 how skull-stripping and segmentations were quality-controlled.


Response 6:

Skull stripping and tumor segmentation were provided by the data publishers. All segmentations were subsequently visually inspected by the authors to identify gross errors prior to model training, and no manual corrections were performed.


We added the following sentence for this point (page 4, first paragraph,  lines 139-141).


“Skull-stripping and tumor segmentations were obtained from the data publishers and visually inspected by the authors prior to use.” 

 

Reviewer 2 Report

Comments and Suggestions for Authors

The presented manuscript (ms) " IDH Mutation Identification in Gliomas from Anatomical MRI Using Deep Learning: A Comparative Analysis of Centralized and Federated Learning Frameworks" is well written and discussed the results in detail. To strengthen the ms, the following concerns should be addressed and added in ms.

Major Comments:

Soft Filtering

1. The proposed Naive Soft Filtering (NSF) approach is strongly impactful in the study; but the choice of the fixed attenuation value (0.3) in NSF appears arbitrary; Need to explain this choice in ms.

2. Clarify, soft filtering was applied only during training or also during inference or testing.

3. Federated Learning Design and Generalizability, The test shows a substantially higher IDH mutation rate (~55%) than the training data. Clarify the rationale and discuss its potential impact on reported metrices.

Minor Comments:

1. Authors should include additional training details like learning rate, optimizer, batch size.

2. Figure 7 is dense and difficult to interpret; consider simplifying or summarizing key trends.

3. FedAvg & FA for Federated Averaging, the terminology should be consistent and explained.

Author Response

Comments 1:  The proposed Naive Soft Filtering (NSF) approach is strongly impactful in the study; but the choice of the fixed attenuation value (0.3) in NSF appears arbitrary; Need to explain this choice in ms.

 

Response 1: The attenuation factor in the Naive Soft Filtering (NSF) method, pointed out by the reviewer, was set as an empirically guided recommendation for the removal of the intensity of the background with the preservation of tumor context information. The choice of 0.3 was driven by preliminary tests to avoid fully removing the tissue beyond tumor borders

 

Specifically, this parameter was fixed throughout our experiments in both the centralized and federated settings and was not varied across scenarios or learning methods to compare the relative performance differences.

 

We have revised the related section as (page 4, second paragraph, lines 150-155), 



“In the first approach, we applied a fixed-weight soft-filtering scheme (Naive Soft Filtering; NSF) in which non-tumorous regions outside the tumor borders were assigned a heuristically selected weight of 0.3 during both training and testing. This weighting ensured that these regions contributed partially to the loss function (Figure 1). The 0.3 factor was kept constant across all experiments to maintain consistency and comparability.”




Comments 2: Clarify, soft filtering was applied only during training or also during inference or testing.

Response 2: Soft filtering was applied to both training and test datasets. We have updated the corresponding sentence as (page 4, second paragraph, line 150-155) 


“In the first approach, we applied a fixed-weight soft-filtering scheme (Naive Soft Filtering; NSF) in which non-tumorous regions outside the tumor borders were assigned a heuristically selected weight of 0.3 during both training and testing. This weighting ensured that these regions contributed partially to the loss function (Figure 1). The 0.3 factor was kept constant across all experiments to maintain consistency and comparability.”


Comments 3:Federated Learning Design and Generalizability. The test shows a substantially higher IDH mutation rate (~55%) than the training data. Clarify the rationale and discuss its potential impact on reported metrices.


Response 3: Thank you for this comment. The test set was intentionally constructed to have an approximately balanced IDH-mut/IDH-wt ratio (yielding an IDH mutation rate of ~55%) so that performance could be assessed on a class-balanced cohort and not be dominated by the majority class. The remaining training cohort had an IDH-mutant rate of approximately 15%. Moreover, the difference in class proportions can influence reported metrics. Prevalence-dependent metrics such as accuracy, precision (PPV), NPV, and F1 score may differ when evaluated on a balanced test set compared with a cohort in which IDH-mutant cases are less frequent. By contrast, prevalence-insensitive metrics such as ROC-AUC, and sensitivity and specificity at a fixed decision threshold, are less directly affected. We therefore interpret predictive values from the balanced test set with caution and note that they may not directly translate to settings with lower IDH mutation rates.


 We have clarified these points by adding the following sentences to the manuscript  (page 7, second paragraph, line 211-215) 


“Test-set class proportions were intentionally balanced to avoid majority-class bias in threshold-dependent metrics; as a result, precision–recall values reflect performance under balanced prevalence and may differ under natural class distributions, whereas prevalence-insensitive metrics such as ROC-AUC, sensitivity, and specificity provide more generalizable assessments of discriminative ability.”


Comments 4: Authors should include additional training details, like learning rate, optimizer, and batch size.


Response 4:

Thank you for this comment. We added the following paragraph to the manuscript (page 6, first paragraph, lines 186-192).



“Each model was trained with RMSProp optimizer (learning rate=1×10⁻⁴) and a cosine annealing learning rate schedule (T_max=40, η_min=0). Centralized models were trained on aggregated data for 40 epochs. In Federated Learning experiments, a single local epoch was performed on each model per round, followed by server aggregation. The total number of rounds was chosen to ensure an equal optimization budget for federated and centralized experiments. Early stopping was not used.” 


Comments 5:  Figure 7 is dense and difficult to interpret; consider simplifying or summarizing key trends.


Response 5:

We thank the reviewer for this comment and agree that Figure 7 contains multiple overlaid metrics and is not intended for detailed, metric-by-metric inspection. The purpose of this figure was to provide a qualitative visualization of training dynamics, convergence behavior, and relative stability across centralized and federated learning strategies. We have simplified the figure by removing precision and recall and adding smoothing.

We have added a new paragraph to the manuscript to address your comment (page 9, second paragraph, line 253-260)

“ Figure 7 is intended to provide qualitative insight into the training dynamics rather than to serve as a direct quantitative comparison of metric performance. Consistently across experiments, CL exhibits more stable convergence and superior overall performance compared to FL, which shows higher volatility. Within federated methods, the FTM strategy demonstrates improved stability compared to FA. It is important to note that these plots have been smoothed to enhance the visibility of general convergence trends; consequently, they should be interpreted as identifying behavioral patterns rather than recording exact performance values at specific steps.”

Comments 6: FedAvg & FA for Federated Averaging, the terminology should be consistent and explained.


Response 6:

We thank the reviewer for this comment. We have modified the manuscript such that FA was consistently used throughout the document.



 

Reviewer 3 Report

Comments and Suggestions for Authors

In their manuscript "IDH Mutation Identification in Gliomas from Anatomical MRI4 Using Deep Learning: A Comparative Analysis of Centralized and Federated Learning Framework", Bas and Ozturk-Isik analysed 501 datasets of glioma images from the UCSF-PDGM image database (this data is missing in the abstract). Although interesting, the study design is flawed. The authors claim to identify IDH mutations; however, this is not accurate. What they detect are different tumor types. IDH-mutant and IDH-wildtype gliomas are at least two different tumor types with different tumor biology (and different MR images, they are able to identify glioblastomas vs. other gliomas). Moreover, the molecular data the authors present is incomplete. What is the 1p/19q status of the IDH-mutant gliomas? What is the diagnosis of the "WHO grade II" and "WHO grade III" gliomas IDH-wildtype? Minor comment: The WHO classification is outdated; Roman numbers are no longer used in grading, it should read "WHO CNS grade 3". The study design should be completely revised, the datasets should be ordered by diagnoses, complete molecular information must be presented, the title must be changed, and all information of the data sources used in the study must be presented in the abstract.  

Comments on the Quality of English Language

ok

Author Response

Comments 1:  In their manuscript "IDH Mutation Identification in Gliomas from Anatomical MRI4 Using Deep Learning: A Comparative Analysis of Centralized and Federated Learning Framework", Bas and Ozturk-Isik analysed 501 datasets of glioma images from the UCSF-PDGM image database (this data is missing in the abstract). 

Response 1:

Thank you for the reviewer's attention to the abstract, which does not mention the dataset we used in this study. We adapted the line to include the UCSF-PDGM dataset in the abstract as (on the first page, abstract paragraph, lines 21-24),

“Anatomical MRI from 501 diffuse glioma patients in the UCSF Preoperative Diffuse Glioma MRI (UCSF-PDGM) dataset was analyzed using a deep learning classifier built on a 2D U-Net encoder, with age and sex included as covariates.”

Comments 2: Although interesting, the study design is flawed. The authors claim to identify IDH mutations; however, this is not accurate. What they detect are different tumor types. IDH-mutant and IDH-wildtype gliomas are at least two different tumor types with different tumor biology (and different MR images, they are able to identify glioblastomas vs. other gliomas).

Response 2: 

We thank the reviewer for this important observation and respectfully clarify the objective and clinical rationale of our study.

We fully acknowledge that IDH-mutant and IDH-wildtype gliomas represent biologically distinct tumor entities with characteristic imaging features. However, IDH mutation status itself remains a critical biomarker with independent clinical utility that extends beyond tumor taxonomy. IDH mutation is one of the strongest prognostic factors in gliomas and directly influences treatment decisions, including the extent of resection, selection of adjuvant therapy, and eligibility for targeted therapies. Non-invasive preoperative IDH prediction from anatomical MRI provides immediate clinical value in surgical planning and therapeutic decision-making, particularly in settings where advanced MRI sequences or rapid molecular testing may be unavailable or delayed.

Recent clinical advances further underscore the independent value of IDH mutation prediction. The landmark INDIGO trial demonstrated that Vorasidenib, an IDH-targeted therapy, significantly improved progression-free survival (median 27.7 vs. 11.1 months) and delayed the need for chemotherapy/radiation in IDH-mutant gliomas, leading to FDA approval in 2024 [1], the first new treatment for these tumors in over 20 years. Moreover, clinical trials have shown that chemotherapy benefit in gliomas depends on IDH mutation status, independent of 1p/19q codeletion status, confirming that IDH status has a standalone predictive value for treatment selection beyond integrated tumor classification. 

This study focuses on developing and validating deep learning approaches for non-invasive IDH mutation prediction from anatomical MRI. As a secondary objective, we compare centralized and federated learning frameworks to assess whether multi-institutional collaboration under privacy constraints can achieve performance comparable to that of pooled-data approaches.

Our model is trained exclusively on binary IDH mutation labels using anatomical MRI and clinical covariates (age, sex). No tumor type, grade, or integrated diagnostic classification labels are provided to or inferred by the model. While IDH status correlates with tumor biology and imaging phenotype, this correlation does not diminish the validity or clinical utility of IDH prediction as an independent objective.

We have added some clarification about the importance of the study to the Introduction (page 2, paragraph 2, lines 55-61) as, 

“Although IDH-mut and IDH-wt gliomas represent biologically distinct tumor entities with characteristic imaging features, IDH mutation status remains an independent prognostic and therapeutic marker. This has been demonstrated in a recent clinical trial, INDIGO, showing that vorasidenib, a small-molecule inhibitor targeting the IDH1 and IDH2 enzymes, significantly improved progression-free survival in IDH-mut gliomas [16], and was approved by the FDA in 2024. Anatomical MRI-based noninvasive estimation of IDH status thus meets a significant clinical need in this regard, especially with the advent of targeted treatments for IDH-mut gliomas.”

References:

[1] Mellinghoff, I.K.; van den Bent, M.J.; Blumenthal, D.T.; Touat, M.; Peters, K.B.; Clarke, J.; Mendez, J.; Yust-Katz, S.; Welsh, L.;Mason, W.P.; Ducray, F.; Umemura, Y.; Nabors, B.; Holdhoff, M.; Hottinger, A.F.; Arakawa, Y.; Sepulveda, J.M.; Wick, W.; Soffietti, R.; Perry, J.R.; Giglio, P.; de la Fuente, M.; Maher, E.A.; Schoenfeld, S.; Zhao, D.; Pandya, S.S.; Steelman, L.; Hassan, I.; Wen, P.Y.; Cloughesy, T.F. Vorasidenib in IDH1- or IDH2-Mutant Low-Grade Glioma. New England Journal of Medicine 2023, 389, 589–601. 423

 

Comments 3: Moreover, the molecular data the authors present is incomplete. What is the 1p/19q status of the IDH-mutant gliomas?

 

Response 3: Although 1p/19q codeletion status is indeed one of the critical molecular markers in integrated glioma classification, it was not included in the analysis due to data limitations.. The UCSF-PDGM dataset contains only 13 1p/19q codeleted oligodendrogliomas out of 501 total cases, representing severe class imbalance (only 2.6% 1p/19q codeleted) that would preclude reliable statistical analysis or model training for this molecular marker. The dataset composition is: 374 glioblastomas (IDH-wt), 90 astrocytomas (IDH-mut), 13 oligodendrogliomas (IDH-mut with 1p/19q codeletion), and 24 astrocytomas (IDH-wt).



We have added this limitation to the Discussion section (page 12, first paragraph, lines 344-351) as,

"Although 1p/19q codeletion status is a critical molecular marker for integrated glioma classification, it was not included in this study due to data limitations. The UCSF-PDGM dataset contains only 13 1p/19q codeleted oligodendrogliomas out of 501 total cases, representing severe class imbalance (only 2.6% 1p/19q codeleted) that would preclude reliable statistical analysis or model training for including this molecular marker. Access to larger datasets with complete molecular annotation would enable broader application of federated learning for integrated glioma classification, including 1p/19q status and other biomarkers.”



Comments 4: What is the diagnosis of the "WHO grade II" and "WHO grade III" gliomas IDH-wildtype? 

Response 4: We thank the reviewer for raising this point. In the UCSF PDGM dataset, adult diffuse gliomas labeled as WHO grade 2 or WHO grade 3 and IDH wildtype are categorized as “Astrocytoma, IDH wildtype” under the dataset’s simplified labeling scheme. Under the 2021 WHO CNS5 classification, however, an adult diffuse astrocytic glioma that is IDH wildtype and shows histologic grade 2 or 3 features may be diagnosed as “Glioblastoma, IDH wildtype (CNS WHO grade 4)” if specific molecular criteria are met, including TERT promoter mutation, EGFR amplification, or the combined whole chromosome 7 gain and whole chromosome 10 loss signature. Because the publicly available UCSF PDGM dataset does not include comprehensive genomic profiling for all cases, we cannot systematically apply these WHO CNS5 molecular criteria and therefore retained the dataset-provided diagnostic labels in our analyses. 

Comments 5: Minor comment: The WHO classification is outdated; Roman numbers are no longer used in grading, it should read "WHO CNS grade 3".

 

Response 5: We thank the reviewer for pointing out this mistake.  

We have updated Table 1 on page 1 to use Arabic numerals 2-4 instead of Roman numerals..

 

Comments 6: The proposed Naive Soft Filtering (NSF) approach is strongly impactful in the study, but the choice of the fixed attenuation value (0.3) in NSF appears arbitrary; need to explain this choice in the manuscript.

 

Response 6: The attenuation factor in the Naive Soft Filtering (NSF) method, pointed out by the reviewer, was set as an empirically guided recommendation for the removal of the intensity of the background with the preservation of tumor context information. The choice of 0.3 was driven by preliminary tests to avoid fully removing the tissue beyond tumor borders

 

Specifically, this parameter was fixed throughout our experiments in both the centralized and federated settings and was not varied across scenarios or learning methods to compare the relative performance differences.

 

We have revised the related section as (page 4, second paragraph, lines 139-142), 



“In the first approach, we applied a fixed-weight soft-filtering scheme (Naive Soft Filtering; NSF) in which non-tumorous regions outside the tumor borders were assigned a heuristically selected weight of 0.3 during both training and testing. This weighting ensured that these regions contributed partially to the loss function (Figure 1). The 0.3 factor was kept constant across all experiments to maintain consistency and comparability.”

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

The authors' response to the reviewer's comments is not convincing. 

Response 2:

The authors' response is not correct. IDH mutations are not prognostic; they are diagnostic.  IDH-mutant and IDH-wildtype gliomas are at least two distinct tumor types with distinct biology (and different MR images).

The clinical study INDIGO compared the IDH inhibitor vorasidenib vs. placebo. All included patients were IDH mutant. Therefore, the argument of the authors is not appropriate.

Moreover, an IDH mutation can only be diagnosed using molecular methods, e.g., sequencing technologies, but not MRI.

Therefore, even the title is misleading, since the authors cannot detect IDH mutations.

Response 3 and 4:

The 1p/19q co-deleted oligodendrogliomas and the IDH-wildtype astrocytomas should be excluded from the study, since no diagnostic molecular data are available for these datasets.

Comments on the Quality of English Language

ok

Author Response

Comments 1: The authors' response is not correct. IDH mutations are not prognostic; they are diagnostic.  IDH-mutant and IDH-wildtype gliomas are at least two distinct tumor types with distinct biology (and different MR images).

Response 1: We respectfully disagree with this characterization. While IDH mutation status is primarily a diagnostic classifier under the WHO 2021 classification, it is also one of the strongest survival-associated molecular features in diffuse gliomas, consistently used for prognostic stratification in clinical studies. Multiple studies have demonstrated that IDH-mut gliomas are associated with significantly longer overall survival compared to IDH-wt tumors. Therefore, IDH mutation status actually serves both diagnostic and prognostic purposes.

Comments 2: The clinical study INDIGO compared the IDH inhibitor vorasidenib vs. placebo. All included patients were IDH mutant. Therefore, the argument of the authors is not appropriate.

Response 2: The INDIGO trial demonstrated significant clinical benefit of vorasidenib in IDH-mut glioma patients, leading to FDA approval. Critically, all patients in INDIGO were required to have confirmed IDH mutations as an inclusion criterion. This underscores that IDH mutation detection, as a standalone biomarker, is clinically actionable and serves as a gateway to an FDA-approved therapy. One of our work’s aims is to provide a non-invasive method to identify patients, who may benefit from such targeted treatments, which is precisely why accurate IDH status prediction is clinically valuable.

Comments 3: Moreover, an IDH mutation can only be diagnosed using molecular methods, e.g., sequencing technologies, but not MRI. 

Response 3: The reviewer correctly states that a definitive diagnosis requires molecular confirmation. However, our study does not claim to replace molecular testing. Rather, we propose a non-invasive screening tool that can:

  1. Prioritize patients for molecular testing
  2. Provide preliminary information when tissue is unavailable
  3. Guide surgical planning and treatment decisions in time-sensitive situations
  4. Serve as an adjunct to molecular diagnosis

Furthermore, the reviewer acknowledges that "IDH-mutant and IDH-wildtype gliomas have different MR images." Imaging phenotypic differences associated with molecular subtypes has been repeatedly reported. Our approach leverages these population-level imaging correlates without making direct molecular inferences.

Comments 4: Therefore, even the title is misleading, since the authors cannot detect IDH mutations.

Response 4: We acknowledge this concern and are willing to revise the title to more accurately reflect our methodology as, "IDH Mutation Assessment in Gliomas from Anatomical MRI Using Deep Learning: A Comparative Analysis of Centralized and Federated Learning Frameworks". This clarifies that our method assesses rather than definitively detects IDH mutations.

Comments 5: The 1p/19q co-deleted oligodendrogliomas and the IDH-wildtype astrocytomas should be excluded from the study, since no diagnostic molecular data are available for these datasets.

Response 5: We have excluded both oligodendrogliomas and IDH-wildtype astrocytomas from our study, as detailed in our Materials section and illustrated in Figure 1.

Specifically:

  1. Oligodendrogliomas (IDH-mutant, 1p/19q co-deleted): The UCSF-PDGM dataset contains only 13 oligodendrogliomas compared to 398 glioblastomas (IDH-wildtype). Given this extreme class imbalance, oligodendrogliomas were excluded to avoid introducing bias and instability into model training and evaluation.
  2. IDH-wildtype astrocytomas: The dataset includes 24 astrocytomas labeled as IDH-wildtype, without accompanying molecular markers. Due to incomplete molecular characterization and the resulting diagnostic ambiguity per the WHO 2021 criteria, these cases were excluded to ensure a more reliable, biologically coherent cohort.

The complete exclusion criteria and patient flow are detailed in Figure 1 of our manuscript. After these exclusions, we retrained our model on the refined cohort and updated the Results section accordingly. The revised results reflect a more homogeneous, molecularly well-characterized study population, thereby strengthening the validity and reproducibility of our findings. We have added: (page 4, paragraph 2, lines 146-153)

The UCSF-PDGM dataset is dominated by glioblastomas (IDH-wt; n = 398) and contains a highly imbalanced number of oligodendrogliomas (IDH-mut, 1p/19q co-deleted, n = 13). Given this extreme class imbalance, oligodendrogliomas were excluded to avoid introducing bias and instability into the reported results.
In addition, the dataset includes 24 astrocytomas labeled as IDH-wt without any accompanying molecular markers. Due to incomplete molecular characterization and the resulting diagnostic ambiguity, IDH-wildtype astrocytomas were excluded to comply with the WHO 2021 classification requirements. The patient cohort distribution and the exclusion criteria are shown in (Figure 1)

 

Round 3

Reviewer 3 Report

Comments and Suggestions for Authors

Response 1:
I respectfully disagree with the authors' response:
IDH mutation has a prognostic impact, because it defines 2 different entities (glioblastoma, astrocytoma) with different prognosis. It is a diagnostic factor! In lung cancer, small cell lung cancers and adenocarcinomas of the lung also have different prognoses. Would you then argue that RB1 inactivation is a prognostic marker in lung cancer? 

Response 2:
In a clinical trial like the INDIGO study, it is clear that IDH mutations must be present. For this, a molecular diagnosis is essential (in INDIGO, a special assay based on the Oncomine Dx Target Test). In your text, you argued: "Although IDH-mut and IDH-wt gliomas represent biologically distinct tumor entities with characteristic imaging features (sic!), IDH mutation status remains an independent prognostic and therapeutic marker." "Prognostic": see Response 1, "therapeutic": the treatment of glioblastoma and astrocytoma is different.

Response 3:
I did not state: "The reviewer correctly states that a definitive diagnosis requires molecular confirmation." but "an IDH mutation can only be diagnosed using molecular methods". The diagnosis does not require a molecular confirmation, the diagnosis is an integrated diagnosis based on molecular data.
The authors further state: "However, our study does not claim to replace molecular testing" but the title is "IDH Mutation Assessment in Gliomas from Anatomical MRI Using Deep Learning: A Comparative Analysis of Centralized and Federated Learning Frameworks".

 

Author Response

Comments 1:
I respectfully disagree with the authors' response:
IDH mutation has a prognostic impact, because it defines 2 different entities (glioblastoma, astrocytoma) with different prognosis. It is a diagnostic factor! In lung cancer, small cell lung cancers and adenocarcinomas of the lung also have different prognoses. Would you then argue that RB1 inactivation is a prognostic marker in lung cancer? 

Response 1: We appreciate the reviewer's perspective and, in accordance with their recommendation, have removed the sentence: 

Although IDH-mut and IDH-wt gliomas represent biologically distinct tumor entities with characteristic imaging features, IDH mutation status remains an independent diagnostic and prognostic marker.”  (page 2, paragraph 2, lines 53-56)

 

Comments 2:
In a clinical trial like the INDIGO study, it is clear that IDH mutations must be present. For this, a molecular diagnosis is essential (in INDIGO, a special assay based on the Oncomine Dx Target Test). In your text, you argued: "Although IDH-mut and IDH-wt gliomas represent biologically distinct tumor entities with characteristic imaging features (sic!), IDH mutation status remains an independent prognostic and therapeutic marker." "Prognostic": see Response 1, "therapeutic": the treatment of glioblastoma and astrocytoma is different.

Response 2: We acknowledge the reviewer's concern. As noted in Response 1, the disputed sentence has been removed from the revised manuscript.

Although IDH-mut and IDH-wt gliomas represent biologically distinct tumor entities with characteristic imaging features, IDH mutation status remains an independent diagnostic and prognostic marker.”  (page 2, paragraph 2, lines 53-56)

Comments 3:
I did not state: "The reviewer correctly states that a definitive diagnosis requires molecular confirmation." but "an IDH mutation can only be diagnosed using molecular methods". The diagnosis does not require a molecular confirmation, the diagnosis is an integrated diagnosis based on molecular data.
The authors further state: "However, our study does not claim to replace molecular testing" but the title is "IDH Mutation Assessment in Gliomas from Anatomical MRI Using Deep Learning: A Comparative Analysis of Centralized and Federated Learning Frameworks".

Response 3: We appreciate the reviewer's clarification and apologize for any misinterpretation of their earlier comment.

Regarding the title, we respectfully maintain that the term "assessment" accurately reflects the scope and intent of our study. Our work aims to provide a non-invasive predictive tool for IDH mutation status, not to replace integrated molecular diagnosis. The term "assessment" is deliberately chosen to convey prediction or evaluation, which is consistent with established terminology in the radiogenomics literature. We believe the title does not imply any claim to replace molecular testing.

Back to TopTop