Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessReview

Peer-Review Record

Machine Learning in MRI Brain Imaging: A Review of Methods, Challenges, and Future Directions

Diagnostics 2025, 15(21), 2692; https://doi.org/10.3390/diagnostics15212692

by Martyna Ottoni^1,2,*

, Anna Kasperczuk¹

and Luis M. N. Tavora^2,3

Reviewer 1: Anonymous

Reviewer 2:

Paul-Andrei Stefan

Diagnostics 2025, 15(21), 2692; https://doi.org/10.3390/diagnostics15212692

Submission received: 25 September 2025 / Revised: 15 October 2025 / Accepted: 21 October 2025 / Published: 24 October 2025

(This article belongs to the Special Issue Brain/Neuroimaging 2025–2026)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Author,

This paper is a review article that aims to survey studies conducted in the field of using machine learning and deep learning methods for human central nervous system imaging. With the advancement of artificial intelligence, a review study is highly beneficial for providing a fresh perspective on recent progress, particularly Table 2, which offers a comprehensive overview of previous studies. However, the manuscript requires several revisions, and a list of my comments is provided below.

1-Numerous review studies have already been conducted on artificial intelligence and machine learning in neuroimaging. What, precisely, are the specific advantages and key differentiators of your study compared to these previous works?

2-The number of articles excluded from the study should be reported in greater detail.

3-Table 1 includes studies that exclusively utilized datasets from T1-weighted, T2-weighted, FLAIR, and post-contrast T1-weighted MRI images. Please add studies to this table that have employed other MRI sequences or weightings such as DWI, PWI etc.

4-In the section on machine learning studies, it is recommended to incorporate recent studies that utilize AutoML (Automated Machine Learning) methods. https://doi.org/10.3390/jimaging11100336

5-It is suggested that Section 5 addresses the issue of Data Imbalance in AI studies, briefly outlining the methods used to mitigate it.

6-It is recommended that the conclusion identifies the most effective network architectures for both segmentation and classification tasks.

Best Regards.

Author Response

Comment 1: Numerous review studies have already been conducted on artificial intelligence and machine learning in neuroimaging. What, precisely, are the specific advantages and key differentiators of your study compared to these previous works?

Response 1: We agree that the distinct features of our review should be clearly articulated. Accordingly, we have revised the Introduction section by adding a new paragraph at the end (page 3, lines 137–145 in the revised manuscript). This addition highlights the novelty of our work compared to previous reviews.
Specifically, the text emphasizes that our study:

focuses on works published between 2020 and 2025,
integrates both classical ML and modern deep learning approaches, including hybrid and transformer-based models,
provides a structured overview of key studies and methodologies, and
discusses emerging challenges such as data imbalance, reproducibility, and the role of AutoML.

These points clarify how our review provides a more updated and methodologically diverse synthesis of recent research in the field.

Comment 2: The number of articles excluded from the study should be reported in greater detail.

Response 2: We have expanded the Materials and Methods section (page 4, lines 181-185 to include a detailed numerical summary of the selection process. The revised text now reports the total number of records retrieved, duplicates removed, records screened, and full-text articles assessed and included.

Comment 3: Table 1 includes studies that exclusively utilized datasets from T1-weighted, T2-weighted, FLAIR, and post-contrast T1-weighted MRI images. Please add studies to this table that have employed other MRI sequences or weightings such as DWI, PWI etc.

Response 3: We appreciate the reviewer’s insightful suggestion. In response, we have expanded Table 1 by adding two additional studies (Park et al. [47] and Gates et al. [48]) that employ advanced MRI sequences such as Diffusion-Weighted Imaging (DWI) and Perfusion-Weighted Imaging (PWI). These additions ensure broader coverage of multimodal MRI datasets beyond conventional sequences (T1, T2, FLAIR, T1c).

To reflect these updates in the text, we have also added a short paragraph at the end of Section 3.2 (lines 243–251) discussing the importance of incorporating functional MRI modalities (DWI, PWI) and their contribution to improved prognostic and grading performance.

Comment 4: In the section on machine learning studies, it is recommended to incorporate recent studies that utilize AutoML (Automated Machine Learning) methods. https://doi.org/10.3390/jimaging11100336

Response 4: In response, we have added a new paragraph at the end of Section 4.4 (lines 829–841) introducing AutoML as an emerging approach in neuroimaging. This paragraph describes the recent study by Khorasani et al. [56]. This addition highlights AutoML’s growing role in automating model optimization, improving reproducibility, and accelerating clinical translation.

Comment 5: It is suggested that Section 5 addresses the issue of Data Imbalance in AI studies, briefly outlining the methods used to mitigate it.

Response 5: We expanded Section 5.1 (paragraph 4, lines 868-888) to include a new paragraph discussing the issue of data imbalance as a major methodological challenge in AI-based neuroimaging. The added text briefly outlines commonly applied mitigation strategies, including data augmentation, oversampling techniques (e.g., SMOTE), GAN-based sample synthesis, and algorithm-level approaches such as cost-sensitive learning, class weighting, and specialized loss functions (Focal and Dice Loss).
Comment 6: It is recommended that the conclusion identifies the most effective network architectures for both segmentation and classification tasks.

Comment 6: It is recommended that the conclusion identifies the most effective network architectures for both segmentation and classification tasks.

Response 6: The Conclusions section has been expanded to explicitly identify the most effective architectures reported in the reviewed studies for both classification and segmentation tasks. Specifically, new paragraphs (paragraphs 2-5, lines 1010-1041) now summarize top-performing models such as the Swin Transformer, CNN-based hybrid ensembles (DBFS-EC, ResNet18 with CART-ANOVA), and U-Net–derived architectures (Multi-Scale Attention U-Net, MUNet, and 3D CNN–U-Net ensembles). These additions provide a concise synthesis of the highest-performing models across both categories, directly addressing the reviewer’s recommendation.

Reviewer 2 Report

Comments and Suggestions for Authors

dear authors,

thank you very much for allowing me to express my opinions related to your work. as a researcher myself, i admire and respect the effort you put into constructing your study and building this manuscript.

bellow, you can find my comments regarding certain issues. i hope these comments will help you improve both your current and future work.

abstract

the abstract says it’s a “comprehensive review” but it doesn’t really explain how. there’s no info about the number of papers or what databases were used properly. should be clarified if this is systematic or narrative.
they wrote they searched Mendeley, which is not actually a database. probably they meant they used it just to store papers. better to replace with something real like Embase or Web of Science.
claims like “cnn dominate the field” are vague and repetitive. it would help to show a few actual numbers — for example typical accuracy or dice scores — instead of general adjectives like “promising”.

introduction

the introduction feels too general. it repeats what’s already well known (ai helps radiology, manual analysis is slow) but doesn’t say what the review adds new.
some sentences are very textbook-like, e.g. “automatic detection remains a challenge”. needs more specific info, maybe examples or references about segmentation errors or interobserver variation.

method

search strategy is not well described. it lists only PubMed and Scopus (and again Mendeley...) but not the search strings or time frame. this makes the review hard to reproduce.
the authors mention “validation” as inclusion criteria but they don’t explain what that means exactly. internal cv? external test set? this must be defined.
the exclusion of “inaccessible full texts” is a bit problematic. could introduce bias if not handled correctly. they should mention if they tried to contact authors or request the papers.
they started from 2178 records, then 1595 after duplicates, and stop there. we never find out how many were finally included. a prisma flowchart would solve this easily.
the chatgpt mention is fine, but they should clearly state that no part of the scientific content was written by ai, only grammar or formatting.

results / main content

dataset dscription table is inconsistent. some rows incomplete or misaligned. should include dataset name, modalities, sample size and data splits in a uniform way.
reported accuracies like “99.62%” seem way too high and probably due to overfitting. better to provide context (dataset size, type of validation).
parts about svm, random forest, cnn etc. are mostly descriptive, just list the methods. no comparative insight, no mention of which performs best in which scenario.
claiming “100% accuracy” without specifying dataset or sample size looks suspicious. the authors must explain these cases carefully and warn readers about overfitting.

discussion

the discussion just repeats results without deeper critique. should include the main limitations found in the reviewed sudies like small samples or data leakage.
interpretability is only mentioned in passing. could add short examples like Grad-CAM or LIME, and why they are not yet reliable for clinical decision making.
ethics and bias part is too brief. they could discuss dataset imbalance, fairness, and transparency issues — that would make the paper more complete.

the conclusions sound generic, restating that “deep learning achieved high performance” without any real synthesis. it would be better to include actual numbers and a short comparative message, like “most cnn achieve dice around 0.9 in Brats dataset”.

the list of challenges misses reproducibility and reporting standards. they should recomend using CLAIM or CONSORT-AI guidelines and encourage sharing code and dataset splits.

some references appear truncated or inconsistent, maybe due to formatting errors. double check that all entries are complete, have DOIs, and match in-text citations.

thank you very much for allowing me to express my opinions.

sincerly,

Author Response

Comment 1: the abstract says it’s a “comprehensive review” but it doesn’t really explain how. there’s no info about the number of papers or what databases were used properly. should be clarified if this is systematic or narrative.

Response 1: The Abstract has been revised to clarify the review type and to provide specific methodological details. The phrase “comprehensive review” was replaced with “narrative review” to reflect the study design accurately. Additionally, the sentence describing the literature search was expanded to specify that PubMed and Scopus were used as databases, while the Mendeley Catalog was identified as a publicly accessible bibliographic catalog linked to Elsevier’s Scopus indexing system. The total number of included studies (108) was also added.These changes can be found in the Abstract – Background/Objectives (lines 15-16), Abstract – Methods section (lines 17–20 and lines 25-26).

Comment 2: they wrote they searched Mendeley, which is not actually a database. probably they meant they used it just to store papers. better to replace with something real like Embase or Web of Science.

Response 2: As clarified in the revised Abstract and Materials and Methods section, PubMed and Scopus were used as the main databases, while the Mendeley Catalog was described as a publicly accessible bibliographic catalog linked to Elsevier’s Scopus indexing system, not as an independent database. This clarification was also addressed in Response 1.

Comment 3: claims like “cnn dominate the field” are vague and repetitive. it would help to show a few actual numbers — for example typical accuracy or dice scores — instead of general adjectives like “promising”

Response 3: We thank the reviewer for this valuable suggestion. Quantitative performance results have been added to the Results section of the Abstract to replace general statements with specific numerical ranges. These changes can be found in lines 28–33.

Comment 4: the introduction feels too general. it repeats what’s already well known (ai helps radiology, manual analysis is slow) but doesn’t say what the review adds new.

Response 4: We appreciate this observation. The Introduction has been revised to include a new paragraph clarifying the novelty and specific contributions of this review compared to previous studies. This addition, located at the end of the Introduction (page 2, lines 138–146 in the revised manuscript), highlights the focus on recent works (2020–2025), integration of classical ML and deep learning methods (including hybrid and transformer-based models), and discussion of emerging challenges such as data imbalance, reproducibility, and AutoML. This revision addresses the reviewer’s concern and aligns with a similar comment provided by Reviewer 1.

Comment 5: some sentences are very textbook-like, e.g. “automatic detection remains a challenge”. needs more specific info, maybe examples or references about segmentation errors or interobserver variation.

Response 5: The second paragraph of the Introduction (lines 51–63 in the revised manuscript) has been substantially revised to replace general statements with specific information and supporting references. The updated text now includes details regarding inter-observer variability, tumor heterogeneity, irregular segmentation boundaries, data scarcity, and scanner-related artifacts that affect model performance.

Comment 6: search strategy is not well described. it lists only PubMed and Scopus (and again Mendeley...) but not the search strings or time frame. this makes the review hard to reproduce.

Response 6: The Materials and Methods section has been revised to provide a clearer and more reproducible description of the search strategy (page 4, lines 162–170). The role of the Mendeley Catalog (MC) was clarified as a public, read-only bibliographic catalog linked to Elsevier’s Scopus indexing system. In addition, Boolean search strings were added, and the time frame was refined to specify the exact period from January 2020 to April 2025.

Comment 7: the authors mention “validation” as inclusion criteria but they don’t explain what that means exactly. internal cv? external test set? this must be defined.

Response 7: This comment has been addressed in the Materials and Methods section (lines 173–175). A short explanation was added in parentheses to clarify that “validation” refers to model evaluation, including internal (e.g., k-fold CV, train/test split) or external validation.

Comment 8: the exclusion of “inaccessible full texts” is a bit problematic. could introduce bias if not handled correctly. they should mention if they tried to contact authors or request the papers.

Response 8: This issue was clarified in the lines 179–180. A sentence was added explaining that, when full texts were inaccessible, the authors attempted to obtain them. If access was not granted, comparable accessible papers were included to minimize selection bias.

Comment 9: they started from 2178 records, then 1595 after duplicates, and stop there. we never find out how many were finally included. a prisma flowchart would solve this easily.

Response 9: This comment overlaps with a previous observation from Reviewer 1. The Materials and Methods section was already expanded to include a detailed numerical summary of the selection process (page 4, lines 181–185). The revised text now reports the total number of records retrieved, duplicates removed, records screened, full-texts assessed, and the final 108 studies included in the review.

Initially, we planned to include a PRISMA flowchart to visualize the study selection process. However, since the manuscript has now been explicitly defined as a narrative review, the PRISMA diagram was not incorporated at this stage, as it is typically associated with systematic reviews. Nevertheless, we remain open to including the flowchart if the editorial team considers that it would enhance the transparency and clarity of the selection process.

Comment 10: the chatgpt mention is fine, but they should clearly state that no part of the scientific content was written by ai, only grammar or formatting.

Response 10: This point was already addressed in the previous version of the manuscript, where it was clearly stated that ChatGPT was used exclusively for grammar and style suggestions. To further emphasize this, one additional sentence was added in the Materials and Methods section (lines 187–189) stating that no part of the scientific content, data analysis, or interpretation was generated by AI.

Comment 11: dataset dscription table is inconsistent. some rows incomplete or misaligned. should include dataset name, modalities, sample size and data splits in a uniform way.

Response 11: The table has now been fully revised to ensure a uniform and consistent structure. All rows have been completed and aligned. Each entry now includes the dataset name, imaging modalities, total sample size, and data split or validation method in a standardized format. Where the original publications did not report specific details (e.g., image resolution or data size), the corresponding cells were intentionally left blank to avoid introducing unverified information.

Comment 12: reported accuracies like “99.62%” seem way too high and probably due to overfitting. better to provide context (dataset size, type of validation).

Response 12: Additional context regarding the interpretation of very high reported accuracies has been added at the end of Section 4.4 (lines 816-828). The new paragraph explains that such results are often linked to small datasets, slice-wise rather than patient-wise validation, or extensive data augmentation, which may lead to overfitting and limited clinical generalizability. The revision clarifies that dataset size and validation strategy are critical factors for assessing the robustness of machine learning models.

Comment 13: parts about svm, random forest, cnn etc. are mostly descriptive, just list the methods. no comparative insight, no mention of which performs best in which scenario.

Response 13: Section 4.4 (Comparative Overview of ML Methods) has been substantially expanded to include a detailed comparative analysis of the main algorithmic groups. Specifically, paragraphs 1–6 (lines 747–793) were added to critically discuss the performance of SVM, Random Forest, CNN, and hybrid architectures under various data and task conditions. This new content emphasizes which methods perform best in specific scenarios (e.g., small datasets, multi-class classification, multimodal MRI, limited computational resources). The section now provides explicit comparative insight supported by quantitative results and literature references.

Comment 14: claiming “100% accuracy” without specifying the dataset or sample size looks suspicious. the authors must explain these cases carefully and warn readers about overfitting.

Response 14: This issue has already been addressed in the newly added discussion at the end of Section 4.4 (lines 816–828), as noted in Response 12. The added paragraph explains that reported 100% accuracies were obtained on very small datasets and clarifies the associated risk of overfitting. The revision explicitly warns readers about the limited clinical generalizability of such results.

Comment 15: the discussion just repeats results without deeper critique. should include the main limitations found in the reviewed sudies like small samples or data leakage.

Response 15: The Discussion section has been expanded (lines 944–955) to provide a more critical analysis of the reviewed studies. The revised text now highlights key methodological limitations such as small dataset sizes, excessive data augmentation, and potential data leakage between training and testing subsets. It also includes examples illustrating the difference between image-level and patient-level validation, emphasizing their impact on model generalization and reliability.

Comment 16: interpretability is only mentioned in passing. could add short examples like Grad-CAM or LIME, and why they are not yet reliable for clinical decision making.

Response 16: The Discussion section has been revised (paragraph 6, lines 973–984) to include a detailed paragraph on model interpretability. The updated text introduces commonly used explainability techniques such as Gradient-weighted Class Activation Mapping (Grad-CAM) and Local Interpretable Model-agnostic Explanations (LIME). It explains their principles, illustrates their application in MRI-based tumor analysis, and discusses current limitations that hinder their reliability for clinical diagnostic decision-making.

Comment 17: ethics and bias part is too brief. they could discuss dataset imbalance, fairness, and transparency issues — that would make the paper more complete.

Response 17: The Discussion section has been expanded (paragraph 7, lines 985-999) to address ethical and bias-related aspects in greater depth. The revised text discusses key issues such as dataset imbalance, fairness, and transparency in AI-driven diagnostics. It explains how class imbalance contributes to model bias toward majority classes and outlines mitigation strategies including data augmentation, GAN-based sample generation, class weighting, and Dice Loss. Additionally, the text highlights the impact of non-standardized imaging protocols and scanner-dependent variability on fairness and generalization across institutions.

Comment 18: the conclusions sound generic, restating that “deep learning achieved high performance” without any real synthesis. it would be better to include actual numbers and a short comparative message, like “most cnn achieve dice around 0.9 in Brats dataset”.

Response 18: We appreciate this valuable suggestion. This point has already been addressed through the previous revision made in response to Reviewer 1 (paragraphs 2-5, lines 1010-1041). The Conclusions section was expanded to provide a concise synthesis of the reviewed findings, specifying the most effective architectures for both classification and segmentation tasks. The updated version includes explicit references to top-performing models such as the Swin Transformer, CNN-based hybrid ensembles (DBFS-EC, ResNet18 with CART-ANOVA), and U-Net–derived architectures (Multi-Scale Attention U-Net, MUNet, and 3D CNN–U-Net ensembles), along with their reported accuracies and Dice coefficients.

Comment 19: the list of challenges misses reproducibility and reporting standards. they should recomend using CLAIM or CONSORT-AI guidelines and encourage sharing code and dataset splits.

Response 19: The Conclusions section has been expanded (lines 1045–1051) to address reproducibility and reporting standards. The revision introduces a new paragraph emphasizing the importance of adhering to established frameworks such as CLAIM (Checklist for Artificial Intelligence in Medical Imaging) and CONSORT-AI guidelines. It also recommends open sharing of code, dataset partitions, and preprocessing protocols to enhance transparency, comparability, and reproducibility in future studies.

Comment 20: some references appear truncated or inconsistent, maybe due to formatting errors. double check that all entries are complete, have DOIs, and match in-text citations.

Response 20: We audited the reference list for completeness and consistency. Journal titles and capitalization were standardized. DOIs were verified and added where available; stable URLs with access dates were inserted when DOIs were not assigned. Formatting breaks were corrected and conference entries harmonized. In-text citations now align with the final numbering.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

Thanks for providing the revised manuscript. I think the revised manuscript improved significantly based on previous comment. you were answered to all of my comment.

Best regards.

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,

I would like to thank you for carefully considering my comments and suggestions during the revision process. I appreciate the effort you invested in improving the manuscript, and I am glad to see the revisions have strengthened the paper.

Wishing you success with the publication.

Article Menu

Machine Learning in MRI Brain Imaging: A Review of Methods, Challenges, and Future Directions

Further Information

Guidelines

MDPI Initiatives

Follow MDPI