1. Introduction
The growing volume and complexity of medical imaging data present a significant challenge for healthcare professionals who must analyze and interpret these images accurately and efficiently. Traditional diagnostic workflows are highly dependent on the expertise and experience of radiologists, making them susceptible to human error, variability between observers, and long analysis times. These limitations can delay diagnosis and compromise early detection of critical conditions, such as brain tumors, which are among the most aggressive and life-threatening forms of cancer.
In recent years, Artificial Intelligence (AI), particularly Machine Learning (ML) and Deep Learning (DL), has emerged as a transformative solution for medical image analysis. Convolutional Neural Networks (CNNs) have shown remarkable performance in tasks such as segmentation, classification, and localization of pathological structures in modalities such as Magnetic Resonance Imaging (MRI), Computed Tomography (CT), and X-rays. MRI is the preferred method for brain image analysis due to its non-invasive nature and ability to produce high-resolution, detailed images of brain anatomy [
1]. Its superior soft-tissue contrast enables precise localization of tumors and accurate assessment of their size, shape, and extent, which are critical for diagnosis and treatment planning. MRI also provides valuable insight into the impact of tumors on surrounding tissues and supports longitudinal monitoring of tumor progression over time. Therefore, MRI still remains a clinically indispensable tool for comprehensive brain tumor assessment.
Brain tumor MRI analysis presents significant challenges, including subtle lesion boundaries, high intra and inter-tumor variability, overlapping visual characteristics between tumor types, and slice-level ambiguity. In [
2], the authors propose an automatic brain MRI tumor segmentation based on a deep fusion of Weak Edge and Context features (AS-WEC). The method emphasizes lesion edges through adaptive weak edge detection, a GRU-based edge branching structure, and a maximum index fusion mechanism for multi-layer feature maps. The experimental results show that AS-WEC outperforms other models in Dice, MAE, and mIoU, demonstrating superior feature extraction and fusion capabilities. The work by [
3] provides a comprehensive review of DL methods for brain tumor analysis, organizing approaches by task and model taxonomy. The review summarizes commonly used public datasets, MRI preprocessing techniques, and evaluation metrics. It also discusses key challenges and future directions for improving the clinical applicability and reliability of AI-based brain tumor analysis systems. The study by [
4] proposes a 3D brain tumor segmentation framework that addresses multimodality, spatial information loss, and boundary inaccuracies. The model integrates three modules, MIE, SIE, and BSC, applied to the input, backbone, and loss functions of deep convolutional networks. Experiments on BraTS2017–2019 datasets demonstrate that the proposed approach outperforms the state-of-the-art methods. The framework effectively improves modality utilization, spatial feature representation, and boundary segmentation accuracy. The work of [
5] highlights the challenges posed by the versatility of tumor morphology, shape, size, location, and texture on MRI, which can vary widely between patients and tumor types. The study presents a comprehensive review of brain tumor imaging, covering analysis tasks, models, features, evaluation metrics, datasets, challenges, and future research directions. The survey [
6] provides an in-depth analysis of common brain tumor segmentation techniques, highlighting methods that automatically learn complex features from MRI images for both healthy and tumor tissues. It reviews recent segmentation architectures, discusses evaluation metrics, and identifies directions for improving segmentation and accurate tumor diagnosis. The authors in [
7] present a method to characterize variability in segmenting complex tumor shapes with heterogeneous textures and its impact on volumetric assessments across multiple time points. Probabilistic error maps are generated to quantify edge segmentation variability across algorithms, and equivalence testing determines whether observed volume changes are significant.
A key limitation of current AI-based diagnostic systems is their reliance on large, well-annotated datasets, which are often difficult to obtain in medical contexts due to patient privacy, cost, and variability in imaging protocols. Moreover, the black-box nature of DL models hinders clinical adoption, as clinicians require interpretable explanations to trust and validate automated predictions. Additionally, variability in image acquisition protocols and patient populations poses additional challenges to the generalization and robustness of these models.
To address these gaps, this work proposes an AI-based solution for a four-class slice-based classification of brain tumors from MRI images, combining state-of-the-art CNN architectures such as VGG-19, ResNet-50, EfficientNetB3, Xception, MobileNetV2, DenseNet201, InceptionV3, Ensemble and Vision Transformer with XAI techniques, including Grad-CAM, LIME, and Occlusion Sensitivity. This approach aims to achieve high diagnostic performance while improving model interpretability and clinical trust. The proposed solution is trained and evaluated on publicly available MRI datasets and demonstrates strong performance in key evaluation metrics, including Accuracy, Recall, and AUROC.
The main contributions of this work are as follows:
Development and evaluation of multiple DL architectures tailored for brain tumor classification from MRI data.
Integration of XAI methods to enhance interpretability and clinical validation of AI-driven decisions.
Comprehensive comparison of model performance using standard metrics and visualization-based explainability tools.
Exploration of ensemble strategies and Vision Transformers to further improve diagnostic accuracy and robustness.
The remainder of this article is organized as follows.
Section 2 presents a state-of-the-art overview of the application of AI and DL techniques in medical imaging, with a particular focus on the detection and classification of brain tumors.
Section 3 describes the main DL architectures, the evaluation metrics used, and the XAI techniques adopted in this work.
Section 4 gives details on dataset preparation, image preprocessing, model configuration, and training strategy.
Section 5 and
Section 6 present and discuss the simulation results, including a comparative analysis of the models and the interpretation of their decisions using XAI techniques. Finally,
Section 7 summarizes the main findings of the work, highlights its contributions and limitations, and suggests possible directions for future research.
2. Literature Review
The literature review conducted within the scope of this work analyzed several recent studies on the application of AI in medical image analysis, including DL and ML techniques aimed at tasks such as segmentation, classification, and disease detection. These studies demonstrate the growing impact of AI in precision medicine, highlighting significant improvements in accuracy, sensitivity, and interpretability of results.
The work of [
8] proposed a hybrid segmentation method based on the detection of the morphological edge and the watershed algorithm to identify cancerous cells. The model also integrated the U-Net architecture, achieving high accuracy and sensitivity values. The combination of classical morphological segmentation and DL demonstrated significant improvements in the precision of anomaly detection in medical images. Another study [
9] conducted a comparative analysis of different semantic segmentation methods applied in the medical field. Three main approaches were categorized: region-based methods, Fully Convolutional Networks (FCNs), and weakly supervised methods. The U-Net, V-Net, and DS-TransUNet architectures stood out for their performance on datasets such as BRATS, LiTS, and DRIVE, demonstrating that model selection strongly depends on the complexity and scale of the clinical images used.
In [
10], the authors presented an explainable approach for medical image segmentation, introducing the concept of double-dilated convolution and the Tversky Loss function, which improved model performance in breast tumor segmentation. The study applied XAI techniques such as Grad-CAM and Occlusion Sensitivity, which allow a visual understanding of the regions most relevant to the model’s predictions and reinforce the transparency of clinical decision-making. In [
11], the authors provide a systematic review of state-of-the-art XAI methods in medical image analysis, highlighting how they improve transparency and trust in ML and DL decisions. It discusses current challenges, evaluation metrics, and future research directions to enhance XAI adoption in clinical settings. The work by [
12] reviews the performance of DL–based classification and detection techniques for brain tumor diagnostics, including CNNs, transfer learning, ViTs, hybrid methods, and explainable AI. It summarizes standard datasets, preprocessing techniques, and recent methodological trends. A comprehensive analysis highlights the growing adoption of ViTs and interpretability-focused models in recent years.
In [
13], a comprehensive review of the use of DL in medical imaging is presented, covering architectures such as Le-U-Net, EfficientNetB4, DeepLung, Inception-ResNetV2, ResNet34 and SCDNet. The study analyzed the application of these networks for the detection of brain tumors, skin cancer, and lung and renal diseases, highlighting the role of transfer learning, data augmentation, LSTM, and GANs in improving model generalization when working with limited datasets.
The work of [
14] proposed an optimized model based on the VGG-19 architecture combined with an SVM classifier for AI-assisted medical diagnosis. The model incorporated a two-layer optimizer that reduced dimensionality and enhanced feature selection, achieving superior results in skin lesion detection compared to other conventional DL approaches.
In [
15], the authors presented a unified approach based on MobileNet, combined with ConvLSTM layers and XAI techniques, applied to various pathologies, including pneumonia, gliomas, and COVID-19. The proposed architecture achieved high levels of accuracy and recall, demonstrating the ability of the network to generalize between different medical modalities and provide interpretable results through Grad-CAM activation maps.
The authors in [
16] analyzed the performance of multiple CNN architectures, including EfficientNetB4, VGG16, ResNet-50, DenseNet-169, InceptionResNetV2, ResNet34, and SCDNet. The results showed that transfer learning and data augmentation are essential strategies to overcome medical data scarcity, allowing high accuracy and F1-score values in disease classification tasks.
In [
17], the authors conducted a comparative analysis of the main ML techniques applied to medical imaging. Both supervised algorithms—K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Decision Trees—and unsupervised ones—K-Means and PCA—were studied in the detection of brain tumors. The KNN model achieved 100% accuracy, while Decision Trees reached 99.4%, demonstrating the potential of classical algorithms in diagnostic contexts with small datasets.
The article by [
18] presented a detailed review of the use of advanced ML techniques for the diagnosis of COVID-19 using X-ray and computed tomography images. Architectures such as U-Net, DeepLabV3, and COPLE-Net, as well as hybrid techniques that combine SVM, PCA, and GANs, were explored. The study described a complete processing workflow from acquisition to classification, demonstrating the crucial role of AI in automated screening and clinical decision support during pandemics.
The authors of [
19] examined the use of ViT and multimodal models in digital pathology. The study highlighted the application of GANs for virtual staining of histological slides, molecular alteration prediction, and automatic generation of interpretable clinical reports. The authors concluded that transformer-based models represent a new trend in medical AI, combining high performance with strong clinical interpretability. The review by [
20] compares ViTs and CNNs in medical imaging, focusing on robustness, efficiency, scalability, and accuracy. It highlights the growing potential of ViTs, which often outperform conventional CNNs, especially when pre-trained. The findings aim to guide researchers and practitioners in selecting the most suitable DL models for medical image analysis. The study by [
21] evaluates several lightweight CNNs and ViTs for multi-class brain tumor classification on MRI images. Tiny-ViT-5M and EfficientNet-b0 achieved top performance with high accuracy, precision, recall, and F1-scores while maintaining low parameter counts, making them suitable for resource-limited clinical settings. The findings highlight the balance between predictive accuracy and computational efficiency, providing a reference for future autonomous diagnostic systems.
In summary, the reviewed studies converge on the conclusion that DL models, particularly convolutional networks and their variants, outperform traditional ML methods in both segmentation and classification tasks for medical images.
Moreover, there is a growing adoption of XAI techniques, which enable the interpretation of results and improve the trust of healthcare professionals in the clinical use of these technologies. Therefore, the literature highlights a consolidated trend towards the integration of explainable, efficient, and multimodal models that combine images, text, and clinical data, paving the way for a new generation of AI-based diagnostic support systems.
Table 1 presents a comparative analysis of selected articles reviewed in the literature that employ AI techniques in the processing and analysis of medical images, with applications in disease segmentation, classification, and diagnosis.
2.1. Application Examples
The application of AI in healthcare has evolved significantly, standing out in various commercial and scientific solutions that employ ML and DL algorithms for medical image analysis. These approaches have contributed to increased diagnostic accuracy, reduced analysis time, and optimized clinical workflow, becoming essential tools for medical decision support. Among the most relevant examples of AI applications are platforms widely recognized in the field of medical imaging.
Aidoc [
22] is an AI platform focused on radiology, capable of analyzing medical images in real time and helping diagnoses in critical contexts such as emergency departments and intensive care units. It integrates with hospital systems (PACS) to automatically detect severe clinical conditions, including intracranial hemorrhages, ischemic strokes, pulmonary embolisms, vertebral fractures, and aneurysms. Google Health [
23], a Google division focused on innovation in digital health, developed an AI system for mammography interpretation to support early detection of breast cancer. The system outperforms average radiologists in sensitivity and specificity while reducing false positives, and provides heatmaps highlighting suspicious regions to assist radiologists in understanding and using its diagnostic outputs. PathAI [
24] develops applied AI for digital pathology, analyzing histological biopsy images to identify cellular and morphological patterns associated with cancer. Its software supports the diagnosis of diseases such as liver, prostate, and breast cancers by transforming unstructured pathology images into structured insights that can be integrated with clinical and genomic data. Viz.ai [
25] is an AI platform for medical image analysis and clinical workflow optimization, widely used for the detection of emergency strokes. Integrates with PACS and EHR systems and applies DL to identify pathologies in real time, rapidly notifying medical teams. Arterys [
26] is a cloud-based AI platform founded in 2011 and acquired by Tempus in 2022, becoming part of a precision medicine strategy that integrates clinical, molecular and imaging data. It offers solutions in multiple specialties, including Cardio, Lung, Neuro, and Breast AI. Compatible with PACS and EHR systems, the platform enables secure remote analysis and collaboration while supporting personalized data-driven medical decision-making. Qure.ai [
27] is a health technology company founded in 2016 and based in Mumbai, India, that develops AI-based solutions for automated medical image interpretation. Its products integrate with hospital systems (PACS/RIS) and include tools for lung cancer detection, stroke analysis, and tuberculosis diagnosis, supporting faster screening, clinical follow-up, and improved care coordination through X-ray and CT analysis. Finally, Lunit [
28] is a South Korean company founded in 2013 that develops AI solutions for medical imaging and oncology. Its main products include Lunit INSIGHT CXR for chest X-ray analysis and Lunit INSIGHT MMG for mammography. The company also offers solutions for 3D mammography and digital pathology, including prediction of immunotherapy response. In 2025, Lunit partnered with the German Starvision network, deploying its technology in 79 healthcare facilities, and has published over 125 peer-reviewed studies.
2.2. Overview of Databases
The use of a database in this work is essential, as it enables the training, validation, and testing of the models with a substantial volume of medical images. The use of real data provides conditions that closely resemble clinical practice, facilitating the evaluation of the performance of the models in real diagnostic scenarios. This ensures that the models not only learn from the provided examples, but are also capable of correctly identifying diseases in previously unseen images.
Table 2 presents a diverse set of databases used in research in the field of AI applied to medical image analysis. Each database has specific characteristics and is designed for different clinical purposes. Several types of medical imaging modalities are represented, including MRI, CT, mammography, chest X-rays, ultrasound, and dermatoscopic images. The objectives include classification of skin lesion, detection of pneumonia and diagnosis of COVID-19, as well as brain, kidney, and liver tumor segmentation. These databases enable the development and validation of ML and DL models, covering a wide range of pathologies and clinically relevant contexts.
5. Results
This section discusses the experimental results obtained from the training and evaluation of the implemented models. It includes an analysis of classification performance based on several metrics as well as the interpretability of the models using XAI techniques.
5.1. Setup
The work was developed using the Python programming language, version 3.12.4, employing the Keras framework together with TensorFlow, which served as the execution engine (backend) [
45]. Both libraries are open-source and widely used in the development of DL models. For model development and testing, the Jupyter Notebook (version 7.3.2) environment was used as an interactive tool that allows the combination of executable code, visualizations, documentation, and notes in a single document. The combination of these tools proved to be well-suited to the complexity of the problem under study, allowing the training and validation of models with satisfactory performance and generalization capability on the medical image analysis dataset.
Table 8 presents the main specifications of the computer used in this work, including processor, memory, GPU, and storage.
5.2. Performance of the Models
This section presents a summary of the performance of the classification models used in this work.
Table 9 shows the per-class metrics, including Precision, Recall, F1-Score, Specificity, and AUROC, as well as the global metrics of each model, such as Accuracy and Loss. The per-class metrics allow for the evaluation of the individual performance of each model for each tumor type, while the global metrics provide an overall view of the model’s effectiveness throughout the dataset. Precision indicates the proportion of correctly predicted positive samples compared to all positive predictions, while Recall measures the model’s ability to correctly identify all positive samples. The F1-Score combines Precision and Recall into a single metric, providing a balance between precision and sensitivity. AUROC values provide a measure of the model’s ability to discriminate between positive and negative classes. Specificity evaluates the ability of the model to correctly identify negative samples. Accuracy indicates the total percentage of correct predictions and Loss measures the average error, reflecting the model’s ability to adapt to the data. A joint analysis of per-class and global metrics is essential to understand the strengths and limitations of the models, identify potential challenges in specific classes, and ensure a comprehensive evaluation of overall performance.
Figure 7 shows the Accuracy achieved by different models in brain tumor classification. It can be observed that all models achieved relatively high Accuracy values, ranging from 89.47% to 98.17%, demonstrating the overall strong performance of the tested architectures. The Vision Transformer model achieved the highest Accuracy, reaching 98.17% and clearly outperforming all other architectures. Next are EfficientNetB3 (94.28%), Xception (93.90%), Ensemble (93.82%), ResNet-50 (93.52%), DenseNet201 (93.21%) and InceptionV3 (92.37%), all of which showed comparable performance. Meanwhile, MobileNetV2 (90.69%) and VGG19 (89.47%) achieved slightly lower results. These results demonstrate that more recent and optimized architectures, such as Vision Transformer and EfficientNetB3, are capable of extracting more discriminative features, leading to significant improvements in performance for medical image classification tasks.
Table 10 presents the training times observed for each model implemented in this work. These values reflect the computational performance of the hardware used and allow a comparison of the relative efficiency of each architecture during model training. The inference times of the models, i.e., the average time required by each architecture to process a single image, were also evaluated. This metric is particularly relevant as it is directly related to the efficiency of the model in practical applications and real-time scenarios.
To enable statistically meaningful interpretation and robust comparison between models, confidence intervals were calculated to quantify the uncertainty in classification performance estimates.
Table 11 presents the 95% confidence intervals for the accuracy of each model, estimated using the percentile bootstrap method [
46]. This approach enables us to quantify the uncertainty associated with the accuracy metric, providing a more reliable and informed performance comparison amongst the models. The bootstrap-based confidence intervals reveal clear performance differences among the evaluated models. ViT achieves the highest and most stable accuracy, indicating superior predictive performance and low variability. VGG19 and MobileNetV2 exhibit lower accuracy ranges and greater uncertainty. EfficientNetB3, Xception, DenseNet201, and the Ensemble model show competitive and statistically similar performance. ResNet50 and InceptionV3 achieve intermediate accuracy levels. Overall, these results demonstrate that ViT consistently outperforms the remaining models.
5.3. Interpretation of Applied XAI Techniques
In addition to the quantitative analysis of the performance of the implemented models, it was essential to apply XAI techniques with the aim of understanding and interpreting the decisions made by the neural networks. Explainability is a fundamental requirement in clinical contexts, as it not only allows validation of the obtained results but also strengthens the trust of healthcare professionals in the use of AI models as decision-support tools.
In this work, several XAI techniques were employed, namely Grad-CAM (Gradient-weighted Class Activation Mapping), Occlusion Sensitivity, and LIME (Local Interpretable Model-Agnostic Explanations). Each of these techniques provides a distinct perspective on how the model processes the image and makes decisions, allowing a more comprehensive analysis of its internal behavior.
The Grad-CAM technique generates activation maps that highlight the regions of the image that contributed the most to the final prediction. By analyzing the heat maps (
Figure 8), it can be observed that the models mainly focus their attention on the anatomical regions corresponding to the brain tumor. The areas highlighted in red indicate higher relevance in the decision-making process, suggesting that neural networks effectively focus on clinically significant regions. This characteristic is particularly important, as it ensures that predictions are not based on noise or irrelevant image artifacts.
The Grad-CAM maps highlight regions that the models consider most important for classification. In most cases, these regions correspond to the approximate location of the tumor, capturing the general area where abnormal tissue is present. For example, in the VGG-19 and MobileNetV2 outputs, the highlighted areas roughly align with the visible tumor mass in the original MRI slice. The DenseNet201 and Ensemble maps also show activations concentrated in the central or upper portions of the tumor, consistent with the lesion’s main location. Overall, the Grad-CAM maps provide a qualitative understanding of model focus, showing that the networks largely attend to tumor regions, though some false positives are present. This emphasizes the need for careful clinical interpretation and potential integration with more precise segmentation methods for clinical use.
The Occlusion Sensitivity technique complements this analysis by measuring the variation in prediction confidence when different parts of the image are occluded. As shown in the figures associated with this technique (
Figure 9), a significant reduction in the probability of classification was observed when the tumor region was partially hidden. This behavior indicates that the model strongly relies on that area for its decision, which supports consistency between the reasoning of the network and the relevant clinical anatomy.
The LIME technique, in turn, provides local explanations by identifying which regions of the image contributed positively or negatively to the classification (
Figure 10). This approach enabled a more detailed analysis of the influence of each image segment on the decision-making process. In general, the regions highlighted by LIME showed a high correspondence with those identified by Grad-CAM, reinforcing the consistency of the explanations produced by the different methods.
Qualitative analysis of the results obtained using the XAI techniques demonstrated that the models developed not only achieved high quantitative performance but also made decisions based on clinically relevant features. This interpretation is crucial in the medical context, as it provides visual and objective evidence that the models focus on tumor regions rather than irrelevant areas of the image.
In summary, the use of XAI techniques proved indispensable in this study by enabling a qualitative assessment of model behavior, offering insight into the consistency and robustness of the implemented models, and reinforcing their potential usefulness as diagnostic support tools. Thus, explainability contributes to the safe and responsible integration of AI systems in clinical environments, promoting more transparent, trustworthy, and patient-centered medicine.
5.4. Vision Transformer
Figure 11 presents the results obtained with the Vision Transformer model applied to the test dataset. In each brain MRI image, it is possible to observe the indication of the true label (True) and the prediction of the model (Pred). The same figure shows correctly classified cases across all categories, where the model was able to identify both images of brains without tumors (no tumor) and different types of brain tumors, such as glioma, meningioma, and pituitary. This qualitative result demonstrates the ability of the ViT model to distinguish complex visual patterns in images, even when the differences between classes may be subtle. These observations complement the quantitative metrics obtained during model evaluation. The results show that the model can generalize effectively and correctly identify different conditions present in the images. Thus, qualitative analysis supports the quantitative results, reinforcing the robustness of the ViT model in the task of classifying brain tumors in magnetic resonance imaging.
For the Vision Transformer model, the HiResCAM technique was chosen because it is more suitable for architectures of this type.
Figure 12 shows the application of the HiResCAM method [
47], adapted for Transformers, on a correctly classified glioma image. It can be observed that the region highlighted by the model corresponds to the tumor location: the red and yellow areas indicate regions of higher relevance assigned by the model during the tumor identification process, whereas the blue areas represent regions of lower influence.
6. Discussion of the Results
The comparative evaluation of the different models implemented allowed for a comprehensive analysis of the impact of DL architectures on the performance of brain tumor classification on MRI. The results obtained demonstrate that, although all CNNs exhibited satisfactory performance, there were significant differences in terms of the accuracy, robustness, and interpretability of the models.
Comparative Analysis of Model Performance
According to the metrics presented in
Section 3, it was observed that the EfficientNetB3, Ensemble, and Vision Transformer models achieved the best balance between accuracy, recall, and AUROC, demonstrating a high capacity for generalization and effective classification of tumor patterns.
The EfficientNetB3, in particular, stood out due to its optimized architecture that balances depth, width, and resolution, resulting in accuracy values greater than 94%, with high recall and a low number of false negatives. This outcome confirms the efficiency of the model in recognizing subtle structural variations in brain tissue.
The ResNet-50 model also showed consistent performance, with an AUROC close to 0.98, a result that can be attributed to the use of residual connections, which mitigate the vanishing gradient problem and allow for more stable training in deep networks. Although slightly inferior to EfficientNetB3 in terms of accuracy, ResNet-50 demonstrated high sensitivity, making it particularly useful in clinical contexts where the priority is to detect all positive cases, even at the cost of some false positives.
The DenseNet201 model achieved results comparable to the previous two networks. Its dense layer connections promote feature reuse and facilitate information propagation, contributing to stable and efficient training performance.
However, models such as VGG-19 and MobileNetV2 showed lower accuracy and recall, ranging between 89% and 90%. The VGG-19, despite its architectural simplicity, exhibited reduced efficiency in detecting tumors with diffuse boundaries, possibly due to the absence of internal normalization mechanisms and skip connections. The MobileNetV2, designed for environments with limited computational resources, offered a good trade-off between performance and inference speed, making it a suitable alternative for mobile applications or hospital systems with hardware constraints.
The Xception model demonstrated performance similar to EfficientNetB3 in terms of precision, indicating the relevance of depthwise separable convolutions in extracting complex textural patterns. However, it exhibited greater variability between classes, suggesting lower robustness in distinguishing tumors with similar morphologies.
The Xception model demonstrated performance similar to EfficientNetB3 in terms of precision, indicating the relevance of depthwise separable convolutions in extracting complex textural patterns.
The InceptionV3 model maintained satisfactory performance, with accuracy around 92%, but showed a higher sensitivity to overfitting, possibly due to the high dimensionality of the data and the large number of trainable parameters.
The Ensemble model, which combined ResNet18, EfficientNetB0, and MobileNetV3, showed a slight overall improvement in performance, achieving accuracy close to 93%. The use of ensembling proved advantageous by integrating the individual strengths of different architectures, reducing variance, and increasing the reliability of the prediction. This behavior is consistent with recent literature, where ensemble models have been successfully applied in complex medical diagnosis tasks.
The ViT achieved highly promising results, with an accuracy greater than 98% and a good balance between precision and recall. The ViT demonstrated strong generalizability and robust performance in the identification of diffuse tumor regions. This effectiveness is due to its attention-based architecture, capable of capturing long-range spatial dependencies in medical images, a typical limitation of traditional CNNs.
The superior performance of ViT can be attributed to several factors. Unlike convolutional neural networks, which capture local patterns through receptive fields, ViTs leverage self-attention mechanisms that allow modeling long-range dependencies and global relationships across the entire image. This is particularly advantageous for brain MRI images, where tumor characteristics may span large spatial regions. By dividing the image into patches and embedding them linearly, the ViT can capture structural and textural patterns at multiple scales, improving its ability to discriminate between tumor classes. The ViT model benefits from deep transformer blocks with residual connections and layer normalization, which enable efficient feature extraction and representation learning without losing information across layers. With attention-based global aggregation and dropout regularization, ViTs can generalize well even with limited training data, which is often the case in medical imaging datasets. These architectural characteristics collectively explain why the ViT outperforms classical CNN architectures on this dataset.
The application of XAI techniques, namely Grad-CAM, LIME, and Occlusion Sensitivity, validated the interpretability of the models. The heatmaps generated by Grad-CAM showed that the networks consistently focused on relevant tumor regions, demonstrating clinical coherence.
LIME reinforced this evidence by identifying the areas of the image that contributed the most to the final classification, facilitating expert validation. The Occlusion Sensitivity, in turn, confirmed the robustness of the models by showing that the occlusion of tumor regions caused significant drops in prediction confidence, demonstrating the focus of the models on regions of interest.
Integration of these techniques improves the transparency of the system and its feasibility for clinical use, as it enables radiologists to understand the reasoning behind the model’s decisions, minimizing the perception of opacity often associated with DL systems.
The regions highlighted by the models were evaluated in comparison with expert knowledge and the available MRI annotations. In most cases, the heatmaps align well with the tumor boundaries, capturing the main regions of interest for classification. Occasional activations were also observed in regions outside the tumor, which likely reflect spurious correlations or limitations in slice-level labeling. These false highlights are discussed as a limitation, emphasizing the importance of careful interpretation in a clinical context.
7. Conclusions
This work demonstrated the potential of AI in automatic multi-class classification of brain tumors from magnetic resonance imaging. The implementation and evaluation of different architectures, such as VGG-19, ResNet50, EfficientNetB3, Xception, MobileNetV2, DenseNet201, InceptionV3, Ensemble and Vision Transformer, allowed performance comparison and identification of the most suitable solutions for the proposed problem.
The results obtained, with accuracy values exceeding 98% and high precision, recall, and AUROC metrics, confirm the effectiveness of the models in distinguishing between different types of tumors and healthy cases. Furthermore, the use of XAI techniques such as Grad-CAM, LIME, and Occlusion Sensitivity contributed to improving model transparency, providing qualitative insights into model behavior, and strengthening clinical confidence in diagnostic support. In the present work, the primary focus was on a qualitative interpretability analysis aimed at providing intuitive and visual insights into model behavior. Future developments will include quantitative evaluations, such as ablation studies, to further validate the interpretability results.
Among the main contributions of this work are the comparative exploration of various advanced DL architectures, the application of data augmentation and hyperparameter tuning techniques to improve robustness and generalization, and the integration of interpretability methods that bring AI closer to clinical practice.
Despite the encouraging results, some limitations were identified, such as the reliance on public datasets with potential class imbalances and the need for validation in real clinical environments. Nevertheless, this work shows that AI-based solutions can significantly reduce the time required for medical image analysis, support early diagnosis, and alleviate the workload of healthcare professionals.
Therefore, it can be concluded that the integration of AI into medical practice represents a promising path towards more efficient, preventive, and patient-centered healthcare, which should be continuously improved through future research aimed at expanding data diversity, exploring new architectures, and consolidating the clinical application of the solutions proposed here.