4.1. Overall Performance
The cross-validation learning dynamics of the proposed MA-MSCNet are presented in
Figure 6, which illustrates the mean ± standard deviation of accuracy and loss across all five folds. As shown in
Figure 6a, both training and validation accuracy increase steadily and remain closely aligned throughout the training process, with narrow variance bands indicating consistent convergence across folds. Correspondingly,
Figure 6b demonstrates a smooth and progressive reduction in training and validation loss, without noticeable divergence or oscillatory behavior. The small gap between training and validation curves, together with the limited inter-fold variability, confirms stable optimization dynamics and strong generalization capability of the proposed framework.
On the held-out test set, the final MA-MSCNet model achieves an overall accuracy of 99.31%, with a corresponding 95% confidence interval of [98.86%, 99.70%]. Balanced classification performance is further reflected by macro-averaged precision, recall, and F1-score values of 99.30%, 99.28%, and 99.29%, respectively. In addition, the model achieves a Matthews Correlation Coefficient (MCC) of 99.08% with a 95% confidence interval of [98.47%, 99.69%], further confirming the robustness and reliability of the classification performance. Weighted and micro-averaged precision, recall, and F1-score are all 99.31%, confirming consistent performance across classes and robustness to class distribution.
To assess sensitivity to stochastic initialization, MA-MSCNet was trained using five fixed random seeds (11, 22, 42, 55, and 77) under identical conditions. As shown in
Table 3, performance remains consistently high (accuracy: 98.63–99.24%, mean 99.02% ± 0.23%), with minimal variation in macro F1-score and MCC. This confirms that the reported results are stable and not dependent on a specific random seed.
Bootstrap resampling (1000 iterations) was employed to compute 95% confidence intervals for key evaluation metrics on the test set. The narrow intervals (e.g., accuracy: 99.31% [98.86%, 99.70%]; MCC: 99.08% [98.47%, 99.69%]) indicate stable and reproducible model behavior across different data splits. These results demonstrate that performance is not sensitive to a particular split but remains consistent across repeated resampling.
The normalized confusion matrix shown in
Figure 7 further confirms the strong class-wise performance of MA-MSCNet. The model achieves high sensitivity for glioma (99.00%), meningioma (98.37%), no-tumor (99.75%), and pituitary tumors (100.00%). Corresponding specificity values exceed 99.70% for all classes, with misclassification rates remaining below 1.00%, demonstrating reliable discrimination across all tumor categories.
To further quantify the model’s discriminative performance, receiver operating characteristic (ROC) analysis was conducted. The proposed MA-MSCNet achieves consistently high area under the curve (AUC) values across all tumor classes, with a macro-averaged AUC of 0.9986 and a micro-averaged AUC of 0.9983. Class-wise evaluation yields AUCs of 0.9976 for Glioma, 0.9993 for Meningioma, 0.9975 for No Tumor, and 1.0000 for Pituitary, indicating excellent class separability. These results further confirm the robustness and reliability of the proposed approach under varying decision thresholds. To assess the reliability of predicted probabilities, calibration performance was evaluated using the Brier score and Expected Calibration Error (ECE). The proposed model achieved low Brier scores (0.0048–0.0090) and low ECE values (0.0329–0.0404) across all classes, indicating that predicted probabilities are well aligned with observed outcomes. Moreover, to evaluate clinical utility beyond conventional performance metrics, a decision-theoretic analysis was conducted using the Entechne framework. The model demonstrated consistently high standardized net benefit (SNB = 0.975–0.994) across all classes, substantially outperforming treat-all and treat-none strategies. Notably, the highest net benefit was observed in the no tumor vs. rest setting (NB = 0.307), highlighting the model’s effectiveness in screening and ruling out disease. For tumor detection tasks, near-optimal decision performance was achieved (e.g., glioma SNB = 0.994), confirming strong clinical applicability.
Overall, these results demonstrate that MA-MSCNet achieves robust and consistent performance across cross-validation folds and the independent test set, supporting its effectiveness for reliable multi-class brain tumor classification.
4.2. Explainability and Visual Interpretation Using Grad-CAM
To enhance the transparency of the proposed MA-MSCNet and verify that its decisions rely on meaningful anatomical cues, we employ Gradient-weighted Class Activation Mapping (Grad-CAM) to generate class-discriminative visual explanations. Grad-CAM produces a coarse localization map by backpropagating gradients from the target class score to the final convolutional feature maps and combining these gradients to estimate the spatial contribution of each region to the prediction. In our experiments, Grad-CAM was computed using the final convolutional layer and then upsampled to the input resolution for visualization. For improved clarity and to suppress spurious responses outside the brain region, the resulting heatmaps were masked using a simple brain-region mask derived from intensity thresholding followed by morphological closing, and then overlaid on the original grayscale MRI slices.
Figure 8 illustrates Grad-CAM visualizations for correctly classified test samples from all four classes. For each class, two representative samples are displayed, showing the input grayscale slice (left) and the corresponding Grad-CAM overlay (right). Overall, the activation patterns indicate that MA-MSCNet consistently focuses on discriminative regions associated with tumor presence and morphology. In glioma and meningioma cases, the highlighted areas are concentrated around the lesion regions, whereas pituitary tumor samples show strong activation near the sellar region. For the
No Tumor class, the responses are distributed within normal brain tissue without a focal pathological hotspot, which is consistent with the absence of tumor-related structures. These observations support that the proposed architecture learns clinically relevant cues rather than relying on background artifacts.
4.4. Ablation Study
An extensive ablation study was conducted to systematically evaluate the contribution of each architectural component in the proposed MA-MSCNet. The results, summarized in
Table 4, demonstrate that the full model configuration (A0) consistently outperforms all ablated variants. In particular, the complete MA-MSCNet achieves the highest accuracy (99.31%), macro-averaged F1-score (99.29%), and macro-averaged one-vs-rest AUC (99.86%), confirming the effectiveness of jointly integrating multi-scale feature extraction, in-block morphological operations, and morphology-aware pooling.
4.4.1. Effect of Morphology Design (Group A)
Removing all morphological operations (variant A1: Multi-scale CNN without morphological operators, parameter-matched) leads to a substantial performance degradation, with accuracy and F1-score decreasing by approximately 2.7% and 2.8%, respectively, compared with the full MA-MSCNet configuration. This drop directly quantifies the contribution of the proposed morphology-aware inductive bias, since all other architectural components (multi-scale convolutions, residual connections, and parameter count) remain identical. Introducing morphology either within the MA-MSC blocks (A2) or exclusively in the downsampling stage (A3) partially recovers performance; however, neither configuration matches the full model. These findings confirm that jointly integrating morphology during both feature extraction and downsampling is essential for achieving optimal performance.
4.4.2. Effect of Multi-Scale Feature Extraction (Group B)
Single-scale configurations (B1 and B2) consistently underperform compared with the full multi-scale design. In particular, relying solely on larger receptive fields (B2) yields the weakest performance. Incorporating both and convolutions (B3) improves robustness over single-scale variants, validating the effectiveness of multi-scale feature fusion in capturing complementary spatial information.
4.4.3. Effect of Morphological Operations (Group C)
Using only dilation (C1) or only erosion (C2) results in reduced performance compared with the full morphology configuration. In contrast, combining dilation and erosion (C3) yields clear improvements, demonstrating that these operations provide complementary structural representations that are jointly required for effective morphology-aware learning.
4.4.4. Effect of Downsampling Strategy (Group D)
Replacing the proposed morphology-aware pooling with conventional AvgPool (D2) or MaxPool (D3) leads to inferior performance. The morphology-aware pooling strategy (D1) consistently outperforms standard pooling operations, confirming that morphology-informed downsampling better preserves discriminative structural cues than conventional approaches.
Overall, the ablation results indicate that the performance gains of MA-MSCNet arise from the synergistic integration of multi-scale convolution, trainable morphological operations, and morphology-aware pooling, rather than from any single component in isolation.
4.4.5. Effect of Data Augmentation
To further investigate the impact of data augmentation on model performance, an additional experiment was conducted by training the model without augmentation under the same settings, as summarized in
Table 5.
The results indicate that data augmentation improves generalization, leading to higher accuracy and MCC values, and contributes to more stable classification performance.
4.4.6. Effect of Input Resolution
To evaluate the impact of input resolution, additional experiments were conducted using 96 × 96, 160 × 160, and 224 × 224 inputs under identical training settings. As shown in
Table 6, performance improves from 96 × 96 to 160 × 160, but does not further improve at 224 × 224, while computational cost increases substantially. Notably, the proposed 125 × 125 configuration achieves the best overall performance (99.31% accuracy), indicating that classification accuracy does not monotonically increase with resolution. Instead, an intermediate resolution provides a better balance between spatial detail and generalization. These results suggest that the proposed morphology-aware architecture effectively captures discriminative structural features without requiring high-resolution inputs.
4.5. Per-Class Performance Analysis
The class-wise evaluation (
Table 7) demonstrates consistently high and well-balanced performance across all tumor categories. Sensitivity exceeds 98.37% for all classes and reaches 100% for Class 3, while specificity remains above 99.70% across all categories, indicating effective suppression of false-positive predictions. Precision and F1-score values are uniformly high, with particularly strong performance observed for Class 2. In addition, the one-vs-rest AUC values approach or reach 100% for all classes, confirming excellent discriminative capability and robust class separation.
All reported per-class metrics are accompanied by 95% confidence intervals computed using bootstrap resampling, which exhibit narrow uncertainty ranges and confirm the statistical reliability and stability of the proposed MA-MSCNet.
The strong agreement between the overall evaluation metrics, per-class performance results, and the normalized confusion matrix confirms the robustness and reliability of the proposed MA-MSCNet across all tumor categories. Misclassifications are rare and primarily occur between visually similar tumor classes, particularly in cases with ambiguous boundaries or overlapping structural characteristics. Importantly, the misclassification rate remains below 1.63% for all classes, and no systematic bias toward any specific category is observed, indicating effective control of both false-negative and false-positive errors.
4.7. Contextual Comparison with Existing Methods
To further assess the effectiveness of the proposed MA-MSCNet, its performance was compared with several recent representative methods for multi-class brain tumor classification, as summarized in
Table 9. The comparison includes both conventional deep learning models and transfer learning-based approaches reported in the literature.
It is important to emphasize that the comparisons presented in
Table 9 are based on reported results from the literature and were not reproduced under a unified experimental setting. Differences in dataset splits, preprocessing strategies, input resolutions, augmentation pipelines, and training protocols may therefore exist. As such, these comparisons are intended to provide contextual insight rather than direct, controlled performance evaluation.
As shown in the table, earlier CNN-based and hybrid methods generally achieve accuracies in the range of 97.00–97.84%, while deeper or fine-tuned architectures report improved performance approaching 98.90–99.66%. The proposed MA-MSCNet achieves an overall accuracy of 99.31%, along with macro-averaged precision, recall, and F1-score values of 99.30%, 99.28%, and 99.29%, respectively, placing it within the upper range of reported results.
To ensure clarity and avoid misleading comparisons, a rigorous and fair evaluation under identical experimental conditions is presented separately in
Table 10, where all models are trained and evaluated using the same dataset split, preprocessing pipeline, augmentation strategy, and comparable training protocol.
Unlike approaches that rely primarily on deeper backbones or transfer learning, MA-MSCNet explicitly incorporates multi-scale feature extraction and trainable morphological operations, enabling enhanced structural representation learning. This design contributes to stable and balanced performance across evaluation metrics, as demonstrated in the controlled experiments.
To address the limitation of restricted benchmarking and to provide a more comprehensive evaluation, we expanded the controlled comparison to include a diverse set of contemporary baseline architectures spanning multiple design paradigms. Specifically, the evaluation now includes conventional CNNs (ResNet50, DenseNet201), modern convolutional architectures (ConvNeXt-Tiny, EfficientNetV2), transformer-based models (ViT-B16, Swin-Tiny), and hybrid architectures (CvT).
All models were trained and evaluated under identical experimental conditions, including the same dataset split, 125 × 125 grayscale input, preprocessing pipeline, augmentation strategy, and a unified training protocol, including the same optimizer, learning rate schedule, batch size, and number of epochs. This ensures a fair and unbiased comparison across different architectural families. The results are summarized in
Table 10.
The expanded benchmarking results demonstrate that, while modern transformer-based and hybrid architectures achieve competitive performance, they do not consistently outperform convolutional models in this setting. This can be attributed to the relatively limited dataset size and the domain-specific characteristics of medical imaging, where convolutional inductive biases remain advantageous.
Notably, the proposed MA-MSCNet achieves the highest performance across all reported metrics, indicating its effectiveness in capturing both local structural details and multi-scale contextual information. These findings highlight that the proposed architecture provides a favorable balance between accuracy and robustness when compared with diverse contemporary models.