Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Robust ConvNeXt-Based Framework for Efficient, Generalizable, and Explainable Brain Tumor Classification on MRI

Bioengineering 2026, 13(2), 157; https://doi.org/10.3390/bioengineering13020157

by Kirti Pant¹, Pijush Kanti Dutta Pramanik^2,*

and Zhongming Zhao^3,*

Reviewer 1:

Abdussalam Elhanashi

Reviewer 2: Anonymous

Bioengineering 2026, 13(2), 157; https://doi.org/10.3390/bioengineering13020157

Submission received: 8 January 2026 / Revised: 23 January 2026 / Accepted: 24 January 2026 / Published: 28 January 2026

(This article belongs to the Section Biosignal Processing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This article presents a ConvNeXt-based framework for brain tumor classification from MRI images, addressing challenges of variability, generalization, and interpretability in existing deep learning models. The study emphasizes cross-dataset evaluation, systematic benchmarking, and explainable AI to support robust and clinically reliable automated diagnostic systems.

The manuscript requires MAJOR revision based to the following comments :-

How would the proposed model generalize to real-world clinical MRI data acquired from different hospitals and scanners. Please explain this in the introduction.
The framework limited to 2D MRI slices instead of leveraging 3D volumetric or multimodal MRI information. Please explain this further
What is the research gap of this research.
How sensitive is the model’s performance to variations in preprocessing, normalization, and augmentation strategies?
Could the high reported accuracy indicate potential data leakage or hidden dataset bias despite cross-dataset testing?
Why are hyperparameter choices and training configurations not justified through systematic optimization or sensitivity analysis. Please explain the purpose for each parameter in the methodology
How does the model perform in case we have small dataset . please explain this
The authors should reduce repetition and improve structural clarity, particularly between the introduction and related work sections
How feasible is real-time deployment given computational requirements in low-resource clinical environments?
What future extensions could integrate segmentation, tumor grading, or radiogenomic prediction into the proposed framework?

Author Response

Response to Reviewer #1:

The manuscript requires MAJOR revision based to the following comments:

Comment 1: How would the proposed model generalize to real-world clinical MRI data acquired from different hospitals and scanners. Please explain this in the introduction.

Response: We appreciate the reviewer’s emphasis on real-world clinical generalization, which is indeed a critical requirement for medical imaging systems. In the revised manuscript, we have clarified that while this study does not yet include prospective multi-institutional clinical data, generalization has been explicitly addressed through cross-dataset evaluation using three independently curated MRI datasets with differing acquisition characteristics. The consistent performance observed across these datasets suggests resilience to dataset shift and scanner variability within the limits of publicly available data.

We further acknowledge that true clinical generalization requires validation on multi-center, multi-vendor MRI cohorts and prospective deployment scenarios. This aspect is explicitly stated in the Discussion (Section 5) and in the Limitations and Future Work (Section 6) to outline concrete steps toward real-world clinical validation, including external institutional testing and analysis of protocol heterogeneity.

Comment 2: The framework limited to 2D MRI slices instead of leveraging 3D volumetric or multimodal MRI information. Please explain this further

Response: Thank you for raising this important point. The current framework intentionally adopts a 2D slice-based formulation rather than a full 3D or multimodal MRI setup. This choice was driven by practical and methodological considerations rather than technical limitations.

First, a 2D approach keeps the model computationally efficient and scalable, aligning with the study’s focus on deployability in real-world clinical environments, including settings with limited computational resources. Full 3D CNNs or transformer-based volumetric models substantially increase memory and training requirements and often require larger, consistently annotated datasets, which are not always available in routine clinical practice.

Second, the publicly available datasets used in this study differ in acquisition protocols, slice thickness, and completeness of volumetric coverage. A slice-wise formulation enables uniform preprocessing and fair cross-dataset evaluation without introducing dataset-specific interpolation or resampling biases that could confound generalization analysis.

Third, prior studies have shown that well-designed 2D models can achieve performance comparable to 3D approaches for tumor classification tasks, particularly when the objective is category-level diagnosis rather than precise volumetric delineation. In this work, the emphasis is on robust classification, statistical validation, and interpretability across heterogeneous datasets, for which a 2D formulation is well suited.

That said, we fully acknowledge that 3D and multimodal MRI inputs (e.g., T1, T1c, T2, FLAIR) can provide richer spatial and contextual information. Extending the proposed framework to volumetric or multimodal learning represents a natural and important direction for future work, particularly for tumor grading, progression analysis, and radiogenomic modeling. We have clarified this rationale in the Introduction (Section 1) and the limitation in the Limitations and Future Work (Section 6).

Comment 3: What is the research gap of this research.

Response: Thank you for raising this point. We would like to clarify that the research gap addressed by this study is already explicitly identified in both the Related Work section and the latter part of the Introduction.

In the Related Work section, we highlight that most existing brain tumor classification studies (i) rely on single-dataset evaluation, (ii) emphasize accuracy without statistical validation, (iii) overlook computational efficiency and deployment feasibility, and (iv) provide limited or single-method interpretability analysis. These gaps are discussed in the context of recent CNN, transformer, and hybrid approaches.

Similarly, in the Introduction, the research gap is framed around the limited exploration of ConvNeXt-based architectures for brain tumor MRI analysis, particularly from a clinically grounded perspective that jointly considers cross-dataset generalization, statistical robustness, efficiency, and explainability.

To avoid any ambiguity for readers, we have slightly refined the wording in the Introduction to make the research gap more immediately visible, without adding new claims or technical content.

Comment 4: How sensitive is the model’s performance to variations in preprocessing, normalization, and augmentation strategies?

Response: This is an important question, and we appreciate the reviewer’s focus on robustness with respect to data handling choices. In the present study, sensitivity to preprocessing, normalization, and augmentation was examined indirectly through ablation experiments and training stability analysis, rather than via an exhaustive sweep over all possible preprocessing configurations.

Across all experiments, a fixed and standard ImageNet-based normalization scheme was used, and ConvNeXt Base maintained consistently high performance when evaluated across multiple independent datasets. Specifically, the model achieved accuracy values of 99.83% (D2), 99.69% (D3), and 99.86% (D4) under the same preprocessing and augmentation pipeline, indicating strong cross-dataset stability despite differences in data source and acquisition characteristics.

The ablation analysis further supports this robustness. Removing data augmentation resulted in a modest and consistent performance reduction (approximately 0.9–1.5% across datasets), while changes in input resolution led to minimal degradation (below 0.3%). These variations remain well within acceptable bounds for reliable deployment and suggest that the model does not depend on aggressively tuned or dataset-specific preprocessing strategies. In addition, the close alignment of training and validation curves across datasets indicates that the chosen normalization and preprocessing pipeline does not induce overfitting or brittle learning behavior.

While alternative normalization schemes or more extensive augmentation policies could introduce minor metric variations, the observed stability across datasets, ablation settings, and repeated training runs suggests that ConvNeXt Base is not overly sensitive to these design choices. We have clarified this interpretation in the Discussion (Section 5.3) and explicitly identified a systematic preprocessing sensitivity analysis as part of future work (Section 6).

Comment 5: Could the high reported accuracy indicate potential data leakage or hidden dataset bias despite cross-dataset testing?

Response: We agree that unusually high accuracy warrants careful scrutiny, and we appreciate the reviewer for raising this point. In the present study, all datasets were treated as independent sources, with no shared images or overlapping splits across D2, D3, and D4. Training and evaluation were conducted using dataset-level separation rather than random pooling, which substantially reduces the risk of conventional train–test leakage.

That said, we acknowledge that publicly curated MRI datasets may still exhibit implicit biases, such as consistent acquisition protocols, image quality filtering, or slice-level correlations within subjects. While cross-dataset evaluation provides a stronger safeguard than single-dataset validation, it does not fully eliminate the possibility of subtle dataset bias. We have therefore explicitly acknowledged this risk in the Limitations section and clarified that further patient-level verification and multi-institutional validation are necessary steps toward clinical translation.

Comment 6: Why are hyperparameter choices and training configurations not justified through systematic optimization or sensitivity analysis. Please explain the purpose for each parameter in the methodology

Response: Thank you for this observation. The purpose and rationale of the training hyperparameters and preprocessing choices are explicitly documented in Table 2 (Normalization Parameters) and Table 4 (Hyperparameter Settings of ConvNeXt Base). Our design choice was to prioritize stability, reproducibility, and cross-dataset comparability rather than dataset-specific hyperparameter optimization.

The normalization parameters in Table 2 were selected to align MRI inputs with the statistics of ImageNet-pretrained ConvNeXt weights, ensuring compatibility during transfer learning and stable gradient propagation under heterogeneous MRI intensity distributions. Similarly, the training parameters in Table 4 follow established best practices for fine-tuning pretrained convolutional architectures in medical imaging. The learning rate and optimizer were chosen to support gradual adaptation of pretrained representations without disrupting learned feature hierarchies, the batch size balances gradient stability and GPU memory constraints, and the number of epochs was guided by early convergence behavior observed during pilot experiments.

Rather than performing exhaustive hyperparameter sensitivity analysis, robustness to training and preprocessing choices was evaluated indirectly through repeated runs, ablation experiments, and consistent cross-dataset performance. This approach was adopted deliberately to avoid overfitting hyperparameters to any single dataset and to focus on architectural robustness and generalization. The revised tables clarify the role of each parameter to make this intent explicit.

Comment 7: How does the model perform in case we have small dataset? please explain this.

Response: This is an important and fair question. In our study, dataset D2 (Table 1) represents a comparatively small-scale dataset relative to D3 and D4 and therefore serves as a practical proxy for low-data clinical scenarios. Despite its smaller size, ConvNeXt Base achieved consistently high performance on D2, with an overall accuracy of 99.83%, AUC of 1.0, and near-perfect class-wise precision and recall.

This behavior indicates that the model does not rely on large sample volumes to achieve stable performance. The strong results on D2 can be attributed primarily to the use of transfer learning, which enables effective feature reuse from pretrained representations, and to the regularization introduced by data augmentation. Importantly, no performance instability or overfitting behavior was observed on D2, as confirmed by aligned training–validation curves and confusion matrix analysis.

These findings suggest that ConvNeXt Base remains reliable even in limited-data settings commonly encountered in clinical practice, such as rare tumor cohorts or institution-specific datasets. We have clarified this aspect explicitly in the Discussion (Section 5.2).

Comment 8: The authors should reduce repetition and improve structural clarity, particularly between the introduction and related work sections

Response: We appreciate the reviewer’s observation regarding potential repetition and structural clarity between the Introduction and Related Work sections. We carefully re-examined both sections and made targeted refinements to strengthen their separation in purpose and narrative flow.

Specifically, the Introduction has been retained as the primary location for clinical motivation, problem framing, and high-level positioning of the study, while the Related Work section has been sharpened to function strictly as a structured synthesis of existing literature. In the revised Related Work, we explicitly organize prior studies by families of deep learning approaches applied to MRI-based brain tumor classification. The discussion begins with traditional CNN-based models, progresses through transfer learning–based architectures, advances to transformer and hybrid CNN–transformer methods, and finally includes the limited but emerging body of work involving ConvNeXt-style architectures.

This progression is intentional and reflects the chronological and methodological evolution of deep learning techniques in brain tumor classification, allowing readers to clearly understand how architectural innovations have incrementally addressed earlier limitations. In addition, the Related Work now includes a dedicated discussion of explainable AI (XAI) techniques—such as Grad-CAM–based and attribution-based methods—used in prior brain tumor studies, highlighting how interpretability has been handled in the literature and where gaps remain.

To further reduce overlap, we removed general motivational statements from the Related Work and added a concise transition at the end of the Introduction to clearly signal the shift from problem context to literature analysis. These revisions improve structural clarity and readability while preserving the completeness and technical integrity of the manuscript.

Comment 9: How feasible is real-time deployment given computational requirements in low-resource clinical environments?

Response: This is a valid concern. ConvNeXt Base is not a lightweight architecture, and its computational footprint is higher than mobile-optimized models such as MobileNetV3. However, its selection in this study was guided by clinical reliability and diagnostic robustness rather than extreme resource minimization.

As reported in Sections 4.5 and 5.5, ConvNeXt Base achieves high inference throughput (≈370 FPS) while maintaining moderate GPU memory usage and stable power consumption. These characteristics indicate that real-time or near–real-time inference is feasible on standard clinical GPU workstations. In current clinical practice, GPU-enabled systems are increasingly available in radiology departments, academic hospitals, and centralized imaging facilities, which substantially reduces the practical constraints associated with deploying moderately complex deep learning models.

It is also important to note that brain tumor classification is not a latency-critical task in the same sense as edge or IoT applications such as autonomous driving or continuous physiological monitoring. MRI-based tumor classification is typically performed in offline or semi-offline workflows, where inference time on the order of milliseconds does not constitute a bottleneck. Consequently, the use of ultra-lightweight architectures optimized for edge deployment is not a strict requirement in this clinical context.

More importantly, brain tumor classification represents a life-critical diagnostic scenario. In such settings, prioritizing classification accuracy, robustness, and interpretability is clinically more appropriate than optimizing solely for computational efficiency. The design choice of ConvNeXt Base reflects this priority, as it provides superior diagnostic consistency and generalization while remaining computationally tractable for routine clinical infrastructure.

For genuinely low-resource environments, this study does not position ConvNeXt Base as a final edge-deployable solution. Instead, it serves as a high-performance reference model. These aspects are added in Discussion (Section 5.5).

Comment 10: What future extensions could integrate segmentation, tumor grading, or radiogenomic prediction into the proposed framework?

Response: This is an important and constructive point. The current study deliberately focuses on robust and generalizable tumor classification as a foundational step, since reliable class discrimination is a prerequisite for downstream tasks. That said, the proposed ConvNeXt-based framework is naturally extensible to more advanced clinical objectives.

First, segmentation can be integrated by coupling ConvNeXt Base with encoder–decoder architectures (e.g., U-Net–style decoders or hybrid CNN–attention decoders), where the pretrained ConvNeXt encoder provides strong multi-scale feature representations for tumor boundary delineation (in fact, we are working on such a study). Second, tumor grading and subtype prediction can be approached through multi-task learning, where shared ConvNeXt features support parallel classification heads for grade or molecular risk stratification. Third, radiogenomic prediction represents a logical extension by fusing ConvNeXt-derived imaging embeddings with clinical or genomic features, enabling non-invasive inference of molecular markers.

We have clarified these extension pathways explicitly in the revised Limitations and Future Scope section to position them as natural progressions of the current framework, rather than remedies for methodological gaps.

We thank Reviewer #1 for all the insightful comments above, which will motivate us to carefully develop more generalizable and robust algorithms. We acknowledged these valuable comments in the Acknowledgements section of the revised manuscript.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors proposed a brain tumor classification framework based on the ConvNeXt Base structure. The paper conducted very detailed research in the area. The authors used four different dataset to show the efficiency of ConvNeXt Base. The authors explained all technical explanations.

The paper is well written and designed. The authors examined each possibility during classification and statistical analysis process.

I have some minor issues.

The Section 2.2 and some of Section 2.4 was written in a different font. Please revise the sections.
Please make the font size smaller in Table 5 for readers to read the numbers easily.
The captions for figure and tables are in different font. Please revise.
Table 3 is in two different places. The first is at Line 547, and the second is at Line 561. Please revise.
Can we consider the input size as one of the ablation steps? The first three ablation steps are okay to implement, but I could not understand why changing the input size considered as an ablation. Please add an explanation to the related section.
The paper is already too long, but please add at least three different images for each dataset to Figure 6. It is better to make the images smaller, and make the text bigger to read easily.
The authors chose D1 to train. D2, D3 and D4 were chosen for test process. If you train the model with D3, and test with the rest, what would be the result? Please explain briefly in a related section.
The paper is too long. It would be better to remove some general information such as Table 2 to increase the readability.
Table 15 may be shorter by deleting some of comparison studies. I leave the decision to the authors.

Author Response

Response to Reviewer #2:

The paper is well written and designed. The authors examined each possibility during classification and statistical analysis process.

I have some minor issues.

Comment 1: The Section 2.2 and some of Section 2.4 was written in a different font. Please revise the sections.

Response: Thank you for carefully reading our manuscript and pointing this out. The font inconsistency was an oversight on our part and has now been corrected.

Comment 2: Please make the font size smaller in Table 5 for readers to read the numbers easily.

Response: We have revised Table 5 by adjusting the font size of the numerical entries to improve readability. We note that such formatting is often finalized during proofing, but we have addressed it at this stage as suggested.

Comment 3: The captions for figure and tables are in different font. Please revise.

Response: Thank you for highlighting this formatting inconsistency. The captions have now been standardized throughout the manuscript.

Comment 4: Table 3 is in two different places. The first is at Line 547, and the second is at Line 561. Please revise.

Response: We appreciate the reviewer bringing this to our attention. The duplication occurred due to an unintended cross-referencing error and has now been corrected.

Comment 5: Can we consider the input size as one of the ablation steps? The first three ablation steps are okay to implement, but I could not understand why changing the input size considered as an ablation. Please add an explanation to the related section.

Response: This is a valid observation, and we appreciate the reviewer’s point. In this study, input size was included as an ablation factor not to tune resolution for performance gains, but to examine the model’s robustness to scale variation, which is a practical concern in MRI-based deployment. In real clinical settings, MRI scans differ in native resolution due to scanner vendors, acquisition protocols, and resampling pipelines. By modifying the input size while keeping the architecture and training configuration unchanged, we assessed whether ConvNeXt Base remains stable under such realistic variations. The resulting performance change was minimal (accuracy variation below 0.3% across D2, D3, and D4), indicating that the model is not tightly coupled to a specific spatial resolution. We have clarified this motivation and its implications in Section 4.6, with further interpretation provided in the ablation discussion (Section 5.6).

Comment 6: The paper is already too long, but please add at least three different images for each dataset to Figure 6. It is better to make the images smaller, and make the text bigger to read easily.

Response: We thank the reviewer for this valuable suggestion. To strengthen the interpretability analysis while maintaining clarity, we have revised Figure 6 to include representative examples for each tumor class (Glioma, Meningioma, Pituitary, and No Tumor) from each dataset. For every representative case, we now present the original MRI slice alongside the corresponding Grad-CAM++ and Gradient SHAP visualizations. This structured layout allows direct visual comparison between the input image and the two complementary explanation methods.

To improve readability, the figure layout has been reorganized with larger text labels and clearer spacing, while the individual images have been resized to preserve visual fidelity without overcrowding the figure. We believe this revised presentation provides broader qualitative coverage across classes and datasets, while keeping the figure interpretable and consistent with the explanatory purpose of these methods.

Comment 7: The authors chose D1 to train. D2, D3 and D4 were chosen for test process. If you train the model with D3, and test with the rest, what would be the result? Please explain briefly in a related section.

Response: We thank the reviewer for this insightful question. In this study, the training–testing configuration was intentionally designed to assess generalization under realistic data availability constraints rather than to exhaustively explore all dataset permutations. D1 was selected for training because it is the largest and most diverse dataset among those considered, providing the richest representation of tumor morphology and imaging variability for learning stable feature representations.

Training on a smaller dataset such as D3 and testing on larger or more heterogeneous datasets would likely result in reduced generalization performance, particularly for challenging classes such as Meningioma and Glioma, due to limited exposure during training. This behavior is well documented in medical imaging literature, where model robustness scales strongly with training data diversity and volume. By contrast, our chosen setup—training on the largest dataset and testing on smaller, independent datasets—represents a stricter and more clinically relevant validation protocol, as it evaluates whether a model trained under favorable conditions can reliably transfer to more constrained or institution-specific data.

We have clarified this rationale in the manuscript (Section 3.1) to explicitly state that the objective of the experimental design was to test cross-dataset robustness under data scarcity, rather than to report exhaustive cross-training combinations.

Comment 8: The paper is too long. It would be better to remove some general information such as Table 2 to increase the readability.

Response: We appreciate the reviewer’s concern about the manuscript's overall length and agree that readability is an important consideration. Table 2 (now Table 3) was intentionally included to clarify the clinical interpretation of the evaluation metrics, specifically in the context of brain tumor classification. While these metrics are standard in machine learning, their diagnostic implications—particularly for sensitivity, false negatives, and agreement measures such as Kappa—may not be immediately clear to all readers, especially those from biomedical or clinical backgrounds.

Our aim was to provide a concise, self-contained reference that helps interdisciplinary readers understand why each metric was selected and how its values should be interpreted in a neuro-oncology setting, without requiring repeated explanation in the main text. This was particularly relevant given the clinical orientation of the journal.

That said, we have carefully reviewed the table to ensure it is succinct and directly relevant, and we are open to further streamlining its presentation if the editor feels it would improve readability while preserving its explanatory value.

Comment 9: Table 15 may be shorter by deleting some of comparison studies. I leave the decision to the authors.

Response: We thank the reviewer for this valuable suggestion. We carefully reassessed Table 16 with respect to length, redundancy, and relevance. The intention behind retaining a comparatively comprehensive set of studies was to provide a transparent and balanced state-of-the-art comparison across datasets, model families, evaluation metrics, and validation practices.

Unlike many summary tables that focus only on peak accuracy, Table 16 was designed to highlight methodological completeness—reporting not only performance metrics but also whether prior studies included cross-dataset validation, statistical analysis, efficiency considerations, or explainability. Several of the included works differ substantially in these aspects, even when their reported accuracies are numerically close. Removing them would risk obscuring these distinctions and could give a misleading impression of equivalence across approaches.

We also note that ConvNeXt Base is evaluated across three datasets, whereas many prior studies report results on a single dataset only. Retaining a broader set of comparisons allows readers to contextualize performance claims alongside differences in experimental rigor and validation scope, which is particularly important for a clinically oriented audience.

We have reviewed the table to ensure that all included studies contribute meaningfully to the comparison and that no redundant entries remain. We believe the current version strikes a reasonable balance between completeness and readability, while supporting a fair and well-contextualized assessment of the proposed method.

We thank Reviewer #2 for the valuable comments above, which helped us improve the manuscript. We acknowledged these valuable comments in the Acknowledgements section of the revised manuscript.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thanks to the authors for implementing the manuscript

Article Menu

A Robust ConvNeXt-Based Framework for Efficient, Generalizable, and Explainable Brain Tumor Classification on MRI

Further Information

Guidelines

MDPI Initiatives

Follow MDPI