2.1. Traditional and Deep Learning Approaches for Brain Tumor Classification
Traditional methods often rely on handcrafted features extracted from images, such as texture, intensity, and shape descriptors, which are then fed into classical classifiers. While these methods can achieve reasonable accuracy, they are limited by the quality of the handcrafted features and may not generalize well to diverse datasets. Padmavathy et al. (2024) [
18] proposed a classical framework combining Gray-Level Co-occurrence Matrix (GLCM) texture features with an RBF-kernel SVM for benign and malignant tumor discrimination. After preprocessing and region-of-interest extraction, they computed GLCM descriptors and achieved 95.56% accuracy, outperforming wavelet-based feature baselines. MM Ghazvini et al. (2024) [
19] integrated morphological segmentation with discrete wavelet transform (DWT) features, followed by PCA reduction and SVM classification. This model achieved an accuracy of 95% and a precision of 88% on a dataset of MRI images, highlighting the feature-driven SVM pipeline’s significance for clinical decision assistance. MJ Adamu et al. (2024) [
20] proposed a MobileNetV2–SVM model for four-class tumor classification. They compacted high-level embeddings from a 7023-image MRI dataset using MobileNetV2 and replaced the dense head with an SVM to improve the non-linear separation, achieving very high AUC values, which highlight both accuracy and efficiency for resource-constrained settings. SM Alqhtani et al. (2024) [
21] combined Wiener filtering, fuzzy C-means segmentation, and SVM classification to identify meningioma, glioma, and pituitary tumors. The method showed 98.2% accuracy and a 96.1% Dice score, indicating strong segmentation–classification synergy, when evaluated on the CE-MRI dataset. Aggarwal (2022) [
22] proposed a handcrafted texture-based approach for binary brain tumor classification from T1-weighted MRI images. The method extracts second-order statistical descriptors from Gray-Level Co-occurrence Matrices (GLCMs) and feeds them into a Random Forest classifier while analyzing different GLCM parameter settings. Evaluated on 245 MR images, the optimized configuration achieved 83.3% accuracy on the test set, indicating that well-tuned GLCM features provide a computationally efficient baseline for tumor and non-tumor discrimination.
More recent hybrids replace handcrafted descriptors with transformer or CNN embeddings and then apply classical classifiers to improve robustness. H Allahem et al. (2025) [
23] proposed a ViT-PCA-RF pipeline in which vision transformer features were compressed via PCA and classified with Random Forest. On the BTM dataset, the approach reached 99% accuracy with balanced sensitivity and specificity, demonstrating the effectiveness of pairing transformer representations with lightweight ML heads. AA Abdulla (2025) [
24] similarly built an automated computer-aided diagnosis system using Wiener denoising, HOG features, PCA reduction, and Bayesian-optimized kNN/SVM. Tested on the public Figshare MRI dataset, the model achieved a high accuracy of 99.2% with the optimized kNN classifier, surpassing existing state-of-the-art methods. Tiwary et al. (2025) [
25] proposed a hybrid framework that automatically extracts features using a custom CNN and feeds them into multiple machine learning classifiers, with Random Forest performing best against SVM, kNN, Decision Tree, and Naïve Bayes. Experiments on the Kaggle Brain Tumor MRI Dataset reported 99.61% training accuracy, 92.16% validation accuracy, and 71.2% accuracy on a held-out CSV testing split, demonstrating that CNN feature fusion with Random Forest can be effective, though generalization drops on the final test set.
Deep learning approaches, particularly CNNs, have revolutionized the field by automatically learning relevant features from raw MRI scans. Early supervised CNN pipelines showed that stacked convolution–pooling hierarchies can capture tumor-specific appearance cues and outperform many handcrafted baselines when trained end-to-end. Nurtay et al. (2025) [
26] presented a comparative deep CNN study on the four-class Kaggle MRI benchmark, evaluating a custom CNN against common transfer learning backbones (ResNet50, VGG-16, and Xception). Their separable-convolution custom CNN achieved about 93–94% accuracy and the best ROC-AUC, exceeding the pretrained alternatives, demonstrating that task-specific CNN design can rival heavier ImageNet-initialized models on brain MRI classification tasks. Building on this direction, customized CNN architectures have been proposed to better match the structural characteristics of tumor MRI. Albalawi et al. (2024) [
27] introduced four progressively refined CNN variants trained on multiple public MRI datasets and reported that their best task-tailored architecture reached 99.76% test accuracy, outperforming standard transfer learning baselines.
Because training large CNNs from scratch can be computationally expensive and label-hungry, Alemayehu (2025) [
28] presented a compact CNN optimized with Keras-Tuner hyperparameter search and contour-based cropping to suppress background noise. Evaluated via five-fold cross-validation on a four-class 7023-image public MRI set, the model achieved 98.78% test accuracy while remaining parameter-efficient, supporting the practicality of low-complexity CNNs in resource-constrained clinical settings. Ilgün et al. (2025) [
11] conducted experiments with various fine-tuned convolution neural network backbones using a combined public brain tumor MRI dataset (Figshare, SARTAJ, and Br35H). Their approach demonstrated the efficacy of CNN-based transfer learning for brain tumor classification, achieving up to 98.47% test accuracy with ResNet50. Prayogo et al. (2025) [
29] proposed a hybrid CNN-based transfer learning framework that fine-tunes multiple lightweight pretrained backbones and fuses their deep embeddings. Their best hybrid (ResNet50V2 + MobileNetV2 + DenseNet121) reached 98.75% accuracy with equally strong precision/recall, outperforming single-backbone variants and demonstrating that feature fusion across pretrained CNNs provides richer tumor representations than any single model alone.
2.2. Self-Supervised Learning (SSL) and Vision Transformers (ViTs)
Self-supervised learning enables effective representation learning without the need for labeled data. Based on this approach, DINOv3 provides transferable and generalizable representations that can be efficiently applied in medical image applications with a limited number of annotations. Mughal et al. (2024) [
30] examined the current self-supervised learning methods that can be applied in the medical imaging field and showed that pretraining through contrastive, clustering, and reconstruction objectives can significantly improve the performance of tumor classification when the labeled MRI data are limited. They also discussed the ability of SSL to mitigate domain shift and enhance model robustness across imaging protocols.
Beyond general SSL, several works combine SSL with transformer-based backbones to model MRI more effectively. Karagoz et al. (2024) [
31] introduced ResViT, a hybrid residual CNN–ViT model pretrained through a generative SSL objective before fine-tuning on tumor classification, demonstrating notable gains over ImageNet initialization by achieving accuracies of 90.6% and 98.5%, respectively, on the BraTS 2023 and Figshare datasets. Rudro et al. (2025) [
32] introduced an SSL method for brain tumor segmentation and classification using SimCLR and an EfficientNetB3 backbone. They used SSL-based model pretraining on extensive unlabeled datasets to acquire significant feature representations before executing supervised fine-tuning with a superior classifier head. The proposed model achieved 98.32% test accuracy on a four-class Kaggle dataset. In a similar direction, Nunes et al. (2025) [
33] employed masked autoencoding pretraining for a 3D ViT on unlabeled brain MRI volumes and reported improved F1 scores of 91% under five-fold cross-validation after fine-tuning on BraTS tumor classes, particularly in low-label settings. Safwan et al. (2025) [
34] further confirmed the benefit of contrastive SSL by introducing T3SSLNet, a triple-strategy SSL framework for MRI tumor classification, which evaluates SimCLR, MoCo, and BYOL with a ResNet-50 backbone, yielding accuracies around 96–97% after fine-tuning, confirming the value of contrastive SSL when labeled tumor MRI data is limited.
Parallel to SSL advances, vision transformers have also been improved directly for brain tumor classification. Khaniki et al. (2024) [
35] proposed a ViT enhanced with selective cross-attention and feature calibration to fuse multi-scale tumor cues, achieving high accuracies of 98.9% and 99.2% with stochastic depth on public brain MRI benchmarks. Wang et al. (2024) [
36] presented RanMerFormer, which introduces randomized token merging to reduce redundancy and computational cost while maintaining competitive classification performance with an accuracy of 98.86%.
2.3. Attention Mechanisms and Saliency Mapping
Attention mechanisms have been incorporated into deep learning models to improve feature extraction by focusing on relevant regions of the image. For instance, Zarenia et al. (2025) [
2] proposed a deformable attention module for brain tumor classification and segmentation, which captures irregular and complex tumor patterns. Saliency mapping is another technique used to visualize and interpret the regions of the image that contribute most to the model’s decision, enhancing transparency and trust in automated systems. The model achieved around 96.6% multi-class accuracy on a 15-class MRI dataset, outperforming conventional CNN and ViT baselines.
More recent deep models incorporate explicit attention and saliency mechanisms to sharpen tumor-focused representations and improve interpretability. Masoudi et al. (2024) [
37] proposed an optimized dual-attention network that uses a ResNet50 backbone followed by a depth-separable channel-attention module and a multi-head spatial-attention block to refine discriminative MRI features, achieving 99.32% accuracy and consistently high per-class performance using the Figshare dataset, indicating that joint channel–spatial attention can substantially boost multi-class tumor recognition. Srivastava et al. (2025) [
38] similarly leveraged transformer attention but in a multi-scale relational setting, introducing an Automated Classification and Grading Diagnosis Model (ACGDM) that combines a Multi-Scale Graph Neural Network with a Spatio-Temporal Transformer Attention Mechanism (MSGNN-STTAM) to capture hierarchical spatial dependencies and cross-frame MRI evolution. Tested on BraTS 2018/2019/2020 and Br35H multimodal MRI datasets, the model reported up to 99.8% accuracy for tumor type detection, highlighting the benefit of attention-guided graph reasoning for robust grading and classification.
Tomar et al. (2024) [
39] proposed a visual attention-based detection pipeline that builds an on-center saliency map to highlight tumor-relevant regions and then applies superpixel segmentation to preserve boundary structure before extracting the final lesion mask. Their model achieved 99.63% accuracy with strong Jaccard and Dice overlap scores, outperforming prior detection baselines. MA Khan et al. (2023) [
40] advanced saliency usage further by introducing an automated multimodal framework that first enhances tumor visibility using deep saliency maps and then fuses deep features and selects an optimal subset via an improved dragonfly optimization strategy before classification. Evaluated on three BraTS available datasets, the model achieved an improved accuracy of 95.14%, 94.89%, and 95.94%, respectively. R Khan et al. (2025) [
41] proposed X-SCSANet, an explainable stack convolutional self-attention network that improves both discrimination and interpretability by stacking the outputs of parallel CNN and self-attention branches and applying a customized Grad-CAM procedure. Evaluated on a four-class Kaggle MRI dataset, the model achieved 96.44% accuracy with 96.5% precision and 98.83% specificity, while producing saliency heatmaps that highlight tumor-relevant regions to justify predictions.
Beyond performance gains, saliency mapping is increasingly positioned as an explainability tool for validating deep predictors. Keles et al. (2023) [
42] presented a focused case study showing how post hoc gradient-based saliency maps on brain MRI reveal that classifiers rely primarily on the tumor core and its surrounding context, especially shape-related cues. Their analysis argues that such visual explanations are essential for identifying model biases and improving trustworthiness in clinical decision support.