1. Introduction
Brain tumors develop when abnormal tissue structures grow in the brain [
1,
2]. The broad mortality range of these tumors makes them some of the most serious and dangerous conditions encountered in clinical practice [
3]. Meningiomas, glioblastomas, and gliomas are usually classified in neuro-oncology. This classification is based on their location, morphology, and size [
4]. Gliomas are more common than any other major brain tumor, accounting for approximately 26.5% of all malignant tumors of the brain and nervous system. Currently, many systems classify these tumors according to their site of origin and extent of spread [
2,
5]. However, reliable identification is difficult because tumor contours differ considerably, their dimensions vary from case to case, and tissues behave unpredictably. Their location in the body can add to this confusion. Furthermore, there are variations in scanning protocols and imaging techniques [
3].
According to the World Health Organization (WHO), gliomas are classified by tissue type into groups I to IV. Those in groups I and II are considered low-grade gliomas (LGGs), and those in groups III and IV are considered high-grade gliomas (HGGs) [
6,
7]. LGGs are typically characterized by a slow growth pattern, taking months or years to develop, while HGG spread rapidly, behave in an extremely malignant manner, and worsen quickly in actual cases. It is generally agreed that LGG patients can have a good prognosis after the tumor is removed. In contrast, HGG patients face significant challenges to their long-term survival after surgery, as they must continue to receive prolonged radiation and chemotherapy treatments. Oncologists know the importance of accurate classification when it comes to planning treatments and predicting disease progression [
8].
The primary diagnostic tool in brain cancer diagnosis is imaging, which provides a detailed overview of where the tumor is located in the brain, as well as its size and volume. In recent years, imaging technology has evolved into one of the most important diagnostic tools available to oncologists and radiologists [
9]. As a result of technological advancements, radiologists have been able to utilize a wide range of imaging techniques, including X-rays, computed tomography (CT), electroencephalography (EEG), ultrasound, positron emission tomography (PET), single-photon emission computed tomography, and magnetic resonance imaging (MRI). Each of these has contributed to more accurate diagnoses of brain tumors and improved the basis for future treatment decisions [
10,
11].
Of the imaging methods mentioned above, CT and MRI are standard techniques for detecting brain tumors and for mapping the regions of the brain that are involved [
5,
12,
13,
14]. Specifically, MRI has become the top choice of non-invasive modality for glioma detection because it produces high-resolution, multi-sequence images that provide varied scans of damaged tissue. [
1,
2,
8,
15,
16]. In comparison with CT, MRI offers more detailed anatomical information and markedly better soft-tissue contrast, which helps clinicians distinguish normal from diseased structures with greater confidence in practice [
2,
15,
17].
The burden on clinicians in radiology within healthcare systems represents a significant workload problem [
18]. In clinical practice, manual segmentation of gliomas is extremely time-consuming but contributes significantly to patient health outcomes. Progress in solving this problem involves the development of modules in a workflow before clinical validation, involving pre-processing, segmentation, feature extraction, and classification [
19]. Early diagnosis and classification with improved prediction accuracy are the most crucial steps in identifying and treating brain tumors to save a patient’s life [
14]. The literature addresses at least five domains that have benefited from the validation of DL and artificial intelligence, including reducing scan times, automating segmentation, optimizing workflows, decreasing reading times, and achieving general time savings or workload reductions [
20]. A comprehensive understanding of brain illnesses, including the classification of brain tumors, is necessary to assess the pathological tissue and assist patients in receiving the proper treatment based on their classification [
21]. There are several health-related and cognitive symptoms associated with brain tumors, and from a clinical perspective, it is difficult and time-consuming to diagnose and treat the wide variety of possible consequences of these tumors precisely. A multimodal assessment in clinical services is crucial for an accurate prediction of future health outcomes [
22], and assumptions about health and behavioral correlations with tumor classification increase reliability.
The diagnosis and classification of brain tumors often require MRI scans in three planes, axial, sagittal, and coronal, along with neurological evaluations and, if feasible, biopsy. Because symmetry in the axial and coronal planes is a hallmark of a healthy brain, MRI is a diagnostic tool for brain tumors, epilepsy, neurological disorders, and other conditions. Asymmetries in pixel intensity in axial MR brain images indicate a pathological state of the brain [
5,
23].
Multiple sequences, such as T1-weighted (T1), contrast-enhanced T1-weighted (T1ce), T2-weighted (T2), and fluid-attenuated inversion recovery (FLAIR), may reveal distinct tissues around the lesion site [
6,
13].
In their daily work, radiologists routinely view multiple MRI sequences simultaneously to try to diagnose the type of tumor. This practice is necessary but overwhelming, especially because gliomas present a wide range of morphological patterns that complicate visual interpretation and slow the diagnostic process. Furthermore, the final decision is heavily influenced by the radiologist’s training and experience, so personal judgment and other factors affect diagnostic accuracy. For these reasons, there is a strong motivation to reduce the workload of radiologists while simultaneously improving the reliability of diagnostic results [
23,
24,
25]. Taking into account the functional structure of the brain and the particularity of its tissues, it becomes clear why it is not easy to distinguish between the various types of tumors, such as gliomas, meningiomas and glioblastomas, especially when there are conditions that present neurological and neuropsychological symptoms that are intertwined [
26].
These challenges have resulted in significant interest in computer-aided diagnostic systems (CADs), since they allow for automatic and timely detection of diseases. By combining medical knowledge and expertise, these systems facilitate a quicker and often more accurate diagnosis of pathological abnormalities. They typically capture medical images, analyze them using various image processing and DL processes, and evaluate whether a specific disease is present or not [
27].
Recent advances in DL have been made in areas such as computer vision and human language understanding. This growth has been extended to imaging analysis in medicine in a significant way [
3,
16,
28,
29]. DL models excel at managing unstructured data and progressively identifying various elements within the analyzed information. In order to achieve this, they construct a multi-layered feature hierarchy that evolves from basic patterns to abstract and semantically meaningful representations [
30].
In the broader landscape of DL, CNNs are distinguished for their ability to solve challenging problems in computer vision [
31,
32,
33]. CNN-based solutions are commonly used as the first line of computation when the quality of MRI is suboptimal and may compromise clinical judgment [
34,
35]. In addition to combining feature extraction with classification, CNNs are able to learn the image features that are most relevant for discrimination automatically [
36,
37]. As a consequence, transfer learning (TL) has emerged as one of the most effective methods for detecting and classifying brain tumors at an early stage. Since TL makes use of parameters learned from large, related datasets, it has proven especially useful in medical imaging applications, such as the evaluation of brain lesions [
38].
Recent CNN architectures differ not only in the depth of the network and the number of filters but also in the ways they fuse information coming from different spatial scales or contexts. At the same time, advances in hardware—particularly in GPU and CPU performance—have mitigated some of the substantial computational costs associated with training and deploying such models, encouraging their use in both research and clinical environments [
39].
Due to the steady increase in the number of patients undergoing brain MRI scans, it has become increasingly difficult to rely solely on manual image interpretation. As a result, it is possible for the process to be slow, subject to fatigue-related errors, and difficult to standardize. As a result, there is a strong need for novel CAD systems that support and streamline the interpretation of brain MRIs [
40]. Despite this, many current CNN-based methods do not consider all of the complementary information that MRI sequences can provide, leaving valuable information unused. Many of these approaches use predefined areas of interest (ROIs) to assist in classification, which adds another layer of complexity to classification [
16].
The present study examines a DL-based approach for discriminating high-grade from low-grade tumors within this context. As part of the study, six Deep Neural Networks (DNNs) dedicated to binary tumor classification will be trained and evaluated.
A custom dataset was constructed, and the following models were trained:
deiT3_base_patch16_224, Inception_v4, Xception41, efficientnet_b0, convNextV2_tiny, and swin_tiny_patch4_window7_224. Each model was configured with two artificial neurons in the output layer for binary classification.
The novelty of this contribution lies in enhancing tumor classification accuracy in medical imaging, particularly in detecting malignant cells, by evaluating state-of-the-art DL algorithms and leveraging transfer learning (TL) techniques. This study examines the challenges of clinically classifying high-grade and low-grade gliomas. The purpose is to enhance the accuracy and efficacy of a classification task in complicated clinical settings with many medical images. The main contributions of this work are as follows:
Development of a novel approach for tumor classification in medical imaging.
Benchmarking state-of-the-art DL models through hyperparameter optimization.
Preparation of a bespoke dataset for tumor classification in medical imaging.
Optimization of six DNNs to enhance classification performance.
Exploration of the impact of applying DL and TL techniques in medical diagnostics for tumor classification.
The structure of this paper is as follows:
Section 2 provides an overview of related works in brain tumor classification.
Section 3 presents a detailed explanation of the model training process.
Section 4 demonstrates the performance of the proposed method and offers a comparative analysis with other binary classification approaches. This section also describes a real-world use scenario, addresses the limitations and future work of the study and provides a structure for future research. Finally,
Section 5 summarizes the conclusions drawn from this study.
2. Related Works
A number of studies in the literature have addressed the relevance of these techniques in different fields of neuroimaging feature extraction. In the case of different developed or applied classifiers, the accuracy achieved varies, indicating a dependence on the model and methodology used. In [
1], using a dual-CNN approach, internal features were extracted from brain images, and a second CNN module was used to classify these features. As a result of using the BRATS-IXI dataset for training, the authors achieved an accuracy of 98.85%. In [
41], the authors developed a novel multimodal classification approach leveraging DL for brain tumor type categorization. For BraTS2015, BraTS2017, and BraTS2018, the methodology achieved accuracies of 97.8%, 96.9%, and 92.5%, respectively.
In a related study, [
6], an autodistillation technique for glioma classification was presented that was discrepancy-sensitive. Based on the BRATS2018 and BraTS2019 datasets, four imaging modalities (T1, T2, T1ce, and FLAIR) were used to achieve binary classification of LGGs and HGGs. The evaluated models include ResNet-101, DenseNet-169, and ConvNeXt-S, with the highest accuracy recorded at 93.6%. Similarly, [
8] employed various versions of the BRATS datasets. The authors developed an innovative automated system named the Spatial Adaptive Dart Optimized Network (SADO-Net). The system was tested using the BRATS 2018, BRATS 2019, BRATS 2020, and Figshare datasets, achieving an impressive average accuracy of 99.2% in tumor detection, as well as the following results in other datasets: BRATS 2015 CNN-SVM, 94.11%; BraTS 2017 Hybrid-DANet, 97.23%; REMBRANDT CRNN, 98.49%; BRATS 2015 DNN, 93.10%; BRATS 2015, U-Net architecture, 83.23%; BRATS 2015, VGG19, 96.71% [
24].
CNNs can be customized for multi-class classification to more accurately develop an ideal architecture for brain tumor classification. In [
13], a CNN architecture modified through domain knowledge, and an evolutionary optimization approach was proposed to select hyperparameters. Tests conducted on the BraTS2020 and BraTS2021 datasets demonstrated an enhanced average accuracy of 98% and a maximum single-classifier accuracy of 99.80%. In addition, in [
42], four designs are constructed, each with unique layers and hyperparameters. Before the classification procedure is applied, the images are put into the convolutional layers for feature extraction and a softmax function. Our proposed CNN-based classification strategy achieves state-of-the-art accuracy, precision, and recall, with F1-Score of 99.76%, 99.64%, 99.62%, and 99.64%, as demonstrated by a comprehensive experimental analysis. Additionally, better performance is attained with a Micro-Avg Matthew correlation coefficient (MCC) of 0.929. In [
43], the authors proposed a spatial residual module (SRM) for volumetric glioma complexity representation, utilizing a 3D CNN design. The authors integrated Swin UNETR, a pre-trained segmentation model, to enhance the network without additional training. ResMT was tested on the BRATS 2019 dataset, achieving a maximum prediction accuracy of 97.01%. This work underscores the potential of hybrid CNN–Transformer models for classifying 3D magnetic resonance images.
According to the literature, binary image classification has been widely pursued in order to improve accuracy. A custom dataset of MRI scans was used in [
16] to evaluate the performance of Resnet18. It was found that ResNet18 had an accuracy of 95.54%. The [
24] study provides another example of a BRATS dataset using the BRATS 2020 version. In their study, the authors proposed a ConvNet-ResNeXt101 model for the classification of tumors, which achieved an accuracy rate of 99.27%.
Meanwhile, [
28] compared ResNet50, EfficientNetB3, and VGG-19 models on an MRI dataset from Kaggle. EfficientNetB3 achieved a training accuracy of 99.44% but a validation accuracy of 89.47%, highlighting the overfitting problem. In [
44], authors proposed a binary classification and detection approach to address this problem and to detect brain cancers earlier. This was made possible by the TL approach with pre-trained ResNet50, VGG19, and InceptionV3 DL models. The pre-trained methods InceptionV3, ResNet50, and VGG19 obtained accuracy rates of 99.72%, 98.84%, and 94.65%, respectively. In [
29], the BRATS 2021 dataset was utilized for a binary classification task. Based on deeplabV3+, the authors developed the Neuro-XAI model, an explainable DL framework. In comparison to previous strategies, the approach demonstrated an improved performance, achieving 97% classification accuracy and 98% overall accuracy.
As reported in [
45], the authors used the BRATS 2018 dataset to evaluate five pre-trained CNN models: AlexNet, VGG16, GoogleNet, ResNet18, and ResNet50, which were originally trained on the ImageNet database. The purpose of this study was to improve the accuracy and reliability of tumor classification by taking advantage of the model architectural diversity by using DL ensemble techniques. This approach was described by the authors as one that was very successful in the state of the art, achieving an accuracy rate of 97.47%.
A similar framework was developed by the authors of [
46] for detecting and classifying HGGs or LGGs in brain MRI scans. As part of the second stage of tumor segmentation, skip connections and residual units were used in order to segment the tumor. It was found that the detection and classification stage of the BraTS 2017 dataset achieved an accuracy of 99.6% when using 1800 images from the dataset. The Dice score, specificity, and sensitivity metrics were used in this study.
The research presented in [
47] used the Preet Viradiya Brain Tumor dataset. Different CNN models and a proposed CNN architecture were trained to optimize brain cancer detection. The proposed CNN model ranked higher than others, achieving an accuracy of 97.5%. Also, [
17] presented an advanced Dual DCNN model for the purpose of successfully classifying malignant and non-malignant MRI scanned images. The model was assessed using the Br35H dataset from Kaggle. Using the method, impressive results were achieved: 99% accuracy, 99% precision, 98% recall, and 99% F1-Score.
The authors of [
48] used the BRATS 2017 dataset to train a model named IMPA-Net, which was designed to enhance the interpretability and reliability of brain tumor classification results. Model performance demonstrated a classification accuracy of 92.12%.
In [
49], the authors propose a hybrid system that could help in the early detection and classification of brain tumors. With the public dataset from Figshare, the system achieved a remarkable accuracy of 98.89% based on two classes. A number of state-of-the-art models, including AlexNet, VGG-16, DenseNet-201, VGG-19, GoogleNet, and ResNet-50, were significantly upgraded by this model.
The authors of [
50] investigated Machine Learning (ML) techniques for classifying brain tumors, including gliomas and meningiomas. According to this study, the SVM model in combination with LBP and HOG achieved an accuracy of 97%, whereas a deep CNN model achieved 98%.
In [
51], three models were trained to classify brain tumors: VGG19, Inception V3, and MobileNetV2. The researchers used the Kaggle Brain X-ray image dataset, which includes images from two sources, including BRATS 2020. Based on the results, VGG19 was the most accurate model, with an accuracy of 98.58%.
In addition, a new attention-based glioma grading network (AGGN) was proposed in [
52]. According to the authors, the AGGN model was evaluated on the BRATS 2018 and BRATS 2019 datasets and demonstrated an accuracy of 96.12%.
The authors of [
53] developed a Coarse-to-Fine Feature Fusion Network (CFNet) to integrate multimodal visual information through modal interaction, semantic perception, and feature fusion. In order to evaluate the proposed CFNet, two publicly available datasets were used: BraTS2019 and BraTS2020.
Generally, the reviewed literature indicates that glioma classification performance is often reported under heterogeneous experimental protocols, which makes comparisons difficult and can overstate applicability in the real world. It is common for there to be gaps in transparency regarding data partitioning and the risk of patient-level leakage, (ii) tests that often remain confined to one benchmark without external testing under a domain shift, (iii) an emphasis on accuracy with insufficient attention to computational cost, memory/energy demands, and feasibility in resource-constrained settings, and (iv) a lack of clinically grounded interpretability analyses that would help establish trust and support decision-making. Based on these limitations, the present study adopted a clearly defined training–testing procedure, benchmarked multiple modern architectures under consistent conditions, reported both predictive performance and computational complexity, and utilized Grad-CAM representations as a first step toward more accurate and clinically interpretable brain tumor classification.
4. Results and Discussion
4.1. Test-Set Performance Across Hyperparameter Configurations
The training division performed all hyperparameter explorations (using the k-fold internal procedure/validation), and neither model selection nor hyperparameter adjustment was ever guided by the utilized test set. In order to determine the final generalization performance of each configuration/model, the test set was kept fixed and examined only after training. This ensured that the reported test accuracy was independent of the optimization process. In
Figure 7, the parallel coordinate plot displays the optimization process across the multiple deep learning architectures used in this study with the hyperparameters specified in
Table 2, with training loss as the primary objective function and test accuracy to report the post hoc evaluation of each configuration using the selected hyperparameters. The configuration described in
Table 2 was modified iteratively across the six proposed architectures.
A significant percentage of the architectures converge towards high accuracy, with the performance of the independent test set comprising unseen images consistently approaching 99%, proving that the optimization procedure is reliable and a high degree of accuracy can be achieved.
Figure 8 further illustrates the success of the hyperparameter approach, demonstrating tighter convergence of the evaluated models. As shown in the figure, the results were obtained when 25% of the dataset was used, with the initial learning rate set at
. The test accuracy is included in the diagram only as a post hoc performance indicator to show that the selected configuration generalizes well on unseen data; importantly, the test set was not used during hyperparameter optimization or model selection, and it is reported solely for independent evaluation.
Across all architectures, the test accuracy value is 99%, with minimal loss when batch sizes and learning rates are optimized. Under the proposed optimization pipeline, the proposed test set is strictly separated from the training and validation sets, which in turn ensures that the performance of the optimization pipeline is generalizable across all of the networks evaluated under patient-wise data separation.
4.2. Statistical Summary of Test Performance
Performance variability on an independent test set was examined to enhance the statistical rigor of the given point estimates. This section provides the mean and standard deviation (Std Dev) of the test metrics for each design and a distributional visualization (boxplots) to study consistency over multiple measurements in order to show dispersion and outliers. Together, these descriptions provide a more detailed overview of the reliability of the models than would be possible with the test accuracy scores alone.
As a clear indicator of performance variability between runs,
Table 3 presents the Std Dev of the primary test metrics (accuracy, F1-Score, precision, recall, and MCC) for each tested model. Higher Std Dev values imply that the model’s test performance is more susceptible to the specific training settings (e.g., fold assignment and stochastic optimization effects), whereas lower Std Dev values show more consistent generalization.
This comparison shows that architectures such as Xception41 and Inception_v4 exhibit comparatively smaller Std Devs across multiple metrics, indicating more stable behavior under the evaluated protocol. Conversely, swin_tiny_patch4_window7_224 and convnextv2_tiny show larger dispersions, particularly in F1-Score, recall, and MCC, indicating higher variability in the balance between sensitivity and specificity across runs. Overall, this analysis contextualizes the high performance values reported earlier. It shows not only how well the models perform but also how reliably they sustain performance under repeated evaluations.
The Avg columns report the mean test performance across repeated runs, providing an estimate of each model’s typical generalization level under the evaluated protocol. When interpreted together with the corresponding Std Dev, these averages enable a more rigorous assessment of robustness: higher Avg reflects stronger expected performance, while lower Std Dev indicates greater stability and reduced sensitivity to training stochasticity or fold variability. Based on this joint interpretation, Xception41 and Inception_v4 emerge as the most robust models, combining consistently high mean test performance with the lowest dispersion across metrics, while deit3_base_patch16_224 and efficientNet_B0 follow with strong averages and moderate variability.
In addition to the Std Dev summary,
Figure 9 shows the whole distribution of test-set performance across runs for each architecture. Boxplots show the median and interquartile range (IQR), while dispersed dots and whiskers draw attention to outliers and dispersion. A more thorough examination of model stability that goes beyond single-point accuracy ratings is supported by this perspective.
Greater consistency among models is shown by tighter boxes and shorter whiskers, whereas sensitivity to training circumstances is indicated by broad IQRs and a large number of outliers. The concentrated distribution of Inception-v4 and the continuous strong central tendency across measures demonstrate stable generalization. Additionally, Xception41 shows a compact distribution with a relatively small spread for accuracy, precision, and recall, demonstrating dependable behavior under repeated assessment. However, EfficientNet-B0 retains a high median performance across measures despite having more outliers and a wider dispersion than most compact models. This indicates generally good results with sporadic fluctuations. In terms of central tendency, deit3_base_patch16_224 likewise performs well on tests; nevertheless, in comparison to the most compact distributions, its boxplots have a larger spread and more outliers in various metrics (most notably F1 and MCC).
By contrast, swin_tiny_patch4_window7_224 and convnextv2_tiny display wider distributions, especially in F1-Score, recall, and MCC, revealing higher variability in clinically relevant trade-offs between sensitivity and specificity across runs. Generally, the boxplot analysis confirms that high peak performance should be interpreted alongside stability: models with consistently high medians and low dispersion provide more reliable performance over time.
Based on robustness (narrow IQR and fewer severe outliers) and central tendency (high medians), Inception-v4, Xception41, and EfficientNet-B0 appear to be the three most statistically reliable architectures in our investigation. These models are favored when consistency is prioritized in addition to peak accuracy because they combine good test performance with relatively narrower distributions across the evaluated parameters.
In summary, the combined Std Dev table and boxplot provide distribution-aware evidence of performance robustness, supporting a more statistically grounded interpretation of the reported test results.
4.3. Training Dynamics and Convergence Analysis
Figure 10 illustrates the performance metrics during the training stage, including
Figure 10a training accuracy,
Figure 10b training loss, and
Figure 10c validation accuracy. After the training process, it is evident that the architectures inception_v4, convnextv2_tiny, and deit3_base_patch16_224 nearly reached an accuracy of 100%, while efficientnet_b0, xception41, and swin_tiny_patch4_window7_224 achieved close to 99% Accuracy. This behavior is also reflected in the training loss, as shown in
Figure 10b. Furthermore,
Figure 10c highlights that xception41 is the only architecture that fails to achieve high validation accuracy. These curves in
Figure 10 indicate that the selected configuration achieves stable performance within the 10-epoch budget, supporting the fixed-epoch choice used throughout the benchmarking. The scikit-learn package supplies the metrics used to evaluate the performance of these different architectures. These include recall, accuracy, precision, F1-Score, and Matthews Correlation Coefficient. The various DNNs’ performance is evaluated using 219 test instances.
Figure 11 demonstrates the accuracy metric for each model in a bubble plot that correlates accuracy with the number of parameters in the model. The specific conditions are an initial learning rate of
, a batch size 64, and a training dataset size of 25%. The accuracy of four of the models employed in this study reaches 99%, with the position of each model on the plot varying depending on the magnitude of its parameters, measured in millions. First on the left is efficientnet_b0 with 5.3 million parameters, followed by xception41 with 22.9 million, swin_tiny_patch4_window7_224 with 28.1 million, and convnextv2_tiny with 28.6 million parameters. Continuing to the right, inception_v4 exhibits 42.7 million parameters, followed by deit3_base_patch16_224 with 86.1 million parameters.
While all models are highly accurate, choosing the most accurate model is difficult when the parameter count for each is calculated in millions. Initially, efficientnet_b0 seems to be the most appropriate option due to its lower parameter count. Nonetheless, a more comprehensive evaluation is required to establish whether other models would be more efficient for this binary classification task.
In
Table 4, the performance results for the CNN models evaluated in this study are depicted for a set of parameters: a learning rate of
, a batch size of 64, and 25% of the dataset in training stage. The models deit3_base_patch16_224, inception_v4, xception41, swin_tiny_patch4_window7_224, convnextv2_tiny, and efficientnet_b0, were tested. Accuracy, precision, recall, F1-Score, and MCC were the performance metrics considered in this study.
Based on all the measured metrics, the architecture deit3_base_patch16_224 is the best-performing model. An accuracy of 99401, precision of 0.99201, recall of 0.99610, F1-Score of 0.99403, and MCC of 0.98807 indicate exceptional performance in terms of accuracy and equilibrium across classification metrics. In spite of this, it has the highest number of parameters, resulting in a higher computational cost, as shown in
Figure 11. As a result, it may not be the most economical choice in terms of computational costs.
InceptionV4 has an accuracy of 0.99212, with notable values in other metrics: a precision of 0.99126, a recall of 0.99325, an F1-Score of 0.99222, and an MCC of 0.98431. While it is inferior deit3_base_patch16_224, the results produced by this model are very balanced and remain highly competitive. As shown in
Figure 11, inception_v4 has fewer parameters, which results in a decrease in computational cost, making it a superior option for classification.
With an accuracy of 0.99175, Xception41 exhibits remarkable performance. Similarly, with an accuracy of 0.99139, swin_tiny_patch4_window7_224 is close to the models previously mentioned. Convnextv2_tiny demonstrates moderate performance, with an accuracy of 0.98125. Despite a commendable precision of 0.98252, its recall of 0.97922 suggests a slight tendency to misclassify some positive instances.
A final point to consider is that efficientnet_b0 displays the lowest accuracy of 0.95274 among the evaluated models, according to the values observed in
Table 4.
The metrics in
Table 4 were obtained using Equations (
1)–(
5).
4.4. ML-Model Explorer
This study implemented the ML-Model Explorer tool, as suggested in [
54,
74]. It allows users to evaluate and select multi-class classifiers based on their confusion matrices, which are primarily based on class imbalances.
Convnextv2_tiny was identified as a weak classifier based on the evaluation. Meanwhile, efficientnet_b0, swin_tiny_patch4_window7_224, and xception41 were categorized as moderate classifiers. Inception_v4 and deit3_base_patch16_224 were ranked as strong classifiers in descending order in
Figure 12.
Figure 13 provides a graphic representation of the Std Dev of the recall of each model.
The X-axis in
Figure 13 represents the Std Dev of the recall metrics achieved by each model across classes. Having a lower Std Dev indicates improved consistency between predictions. The Y-axis indicates the overall accuracy of the model based on the dataset. A higher level of accuracy indicates that the model correctly classifies a greater number of samples.
For this study, swin_tiny_patch4_window7_224 (indicated with pink dot) achieved the highest accuracy ≈ 0.9995 and a very low recall Std Dev ≈ 0.001. Among the models compared, this model offers the best balance between overall performance and stability across all classes.
The model deit3_base_patch16_224 (symbolized by a blue dot) achieved an accuracy of ≈0.999, with a very low recall Std Dev ≈ 0.0015. Compared to the Swin model, its performance is approximately the same, although its overall accuracy is slightly lower.
EfficientNet_B0, indicated by the brown dot, achieved an accuracy of ≈0.999, similar to deit3_base_patch16_224, but with a higher recall Std Dev ≈ 0.003, indicating less consistent performance across classes.
In case of xception41, dark purple dot, achieved an accuracy close to ≈0.9995, comparable to the Swin model, but with a slightly higher recall Std Dev ≈ 0.002. Despite its high accuracy, this model is less stable due to its lower consistency.
Convnextv2_tiny, symbolized by a black dot, achieved an accuracy of ≈0.997 with a recall Std Dev of ≈0.002. In spite of its consistency, its accuracy is significantly lower when compared to the other models mentioned so far.
Inception_V4, marked with a green dot, showed the lowest accuracy ≈ 0.9965 and the highest recall Std Dev ≈ 0.005, making it the least effective and consistent model among those compared.
The models demonstrating the highest accuracy were swin_tiny_patch4_window7_224 and Xception41, while the most reliable designs were swin_tiny_patch4_window7_224 and deit3_base_patch16-_224, both exhibiting lower Std Devs, suggesting effective handling of class balancing. Conversely, the least effective model was Inception_V4, which showed the lowest performance on both measures.
For applications requiring high accuracy and consistency, the swin_tiny_patch4_window-7_224 model appears to be the optimal choice. However, if inter-class stability is less critical, the deit3_base_patch16_224 may be a viable alternative.
In
Figure 14, a boxplot illustrates the distribution of three essential metrics: F1-Score, precision, and recall. Measurements are shown on the Y-axis, while models are labeled on the X-axis. Each model’s mean performance and variability across these metrics will be evaluated.
Swin_tiny_patch4_window7_224 has a F1-Score that regularly approaches 1, characterized by a compact box and thin range lines, indicating high precision, recall, and F1-score. Precision and Recall are almost identical (≈1) with minimal variations, making it the most consistent and accurate model. In terms of overall performance and consistency, this model exhibits near-perfect metrics.
In the case of deit3_base_patch16_224, the F1-Score is nearly ≈0.999, characterized by a compact box and limited range, signifying stability. Compared to swin_tiny_patch4_window 7_224, its precision and recall remain consistently high.
The third-place model, EfficientNet_b0, shows solids metrics but demonstrates greater performance variability than the leading models swin_tiny_patch4_window7_224 and deit3_base_patch16_224.
Figure 15 provides error metrics for LGG and HGG classes, enabling comparisons of model errors between the two classes. The y-axis represents the magnitude of error for each model; a greater error indicates a lower degree of effectiveness.
The swin_tiny_patch4_window7_224 model exhibits the lowest error rates, with nearly negligible values (≈0.000) for both classifications (Low Grade and High Grade). This makes it the most robust and effective model for categorization. Deit3_base_patch16_224 demonstrates minimal errors of approximately 0.001 for Low Grade and 0.002 for High Grade, indicating excellent reliability and performance.
Efficientnet_b0 shows slightly higher errors, with approximately 0.002 for Low Grade and approximately 0.003 for High Grade. Although it does not achieve the same level of accuracy as the top-performing models, it maintains satisfactory consistency across both categories. Compared with EfficientNet at ≈0.003, Xception41 displays higher errors, with approximately 0.004 for Low Grade and 0.003 for High Grade. This makes it a less accurate model in general.
Both swin_tiny_patch4_window7_224 and deit3_base_patch16_224 are the most balanced models, exhibiting low error discrepancies across classes.
4.5. DNN Training Runtime, Computational Complexity, and GPU Power Usage
Figure 16a illustrates the percentage of GPU use. The X-axis represents the duration of model training, while the Y-axis displays the percentage of GPU consumption.
The models deit3_base_patch16_224 (pink) and convnextv2_tiny (green) show significant variations in GPU consumption, occasionally reaching 100%. Efficientnet_b0 (yellow) and Xception41 (orange) demonstrate lower average consumption, though with considerable fluctuations.
Inception_v4 (purple) demonstrates relatively stable GPU consumption, consistently remaining below 40%. The swin_tiny_patch4_window7_224 (light green) model exhibits moderate consumption with variations but is less unstable than its counterparts.
Figure 16b illustrates the CPU consumption of the procedure (%). The X-axis represents the elapsed training time, while the Y-axis indicates the proportion of CPU consumption attributed to the training process.
The CPU usage rates of all models are relatively low and stable. Efficientnet_b0 (yellow line) has the highest CPU consumption, approaching 40%. In addition, other models such as deit3_base_patch16_224, convnextv2_tiny, and inception_v4 exhibit usage rates in the range of 20% to 30%.
A majority of models are dependent on GPUs, with oscillations possibly related to computationally intensive tasks. A significant reduction in CPU demand can be observed for parallel processing tasks that are dominated by GPUs.
More consistent GPU utilization, as seen for inception_v4 and Efficientnet_b0, may suggest efficient resource usage or a decreased reliance on sporadic high-intensity operations. More GPU variability is seen in models like deit3_base_patch16_224, which may be due to their resource-intensive training needs.
It is generally recognized that models with stable usage patterns are more predictable in production settings. In contrast, models with greater variability in GPU utilization might be more susceptible to hardware availability issues. Low CPU utilization indicates that these training procedures are not computationally intensive.
Figure 17 displays GPU memory allocation (%). The X-axis represents the model training time, while the Y-axis indicates the percentage of GPU memory utilized by each model during training.
The model with the lowest GPU memory utilization among the assessed architectures is Inception_v4 (purple line), which continuously utilizes less than 20% of GPU RAM. Efficientnet_b0 (orange line) exhibits consistent behavior during training, maintaining a steady memory allocation of about 40%. With consistent performance comparable to Efficientnet_b0, Convnextv2_tiny (dark green line) uses between 30% and 40% of GPU RAM.
With continuous usage, Xception41 (yellow line) uses the most GPU memory—roughly 40%. Deit3_base_patch16_224 (pink line) exhibits consistent behavior over time and consumes 20% of GPU RAM. Similar to deit3_base_patch16_224, swin_tiny_patch4_window7_224 (bright green line) similarly uses about 20% of GPU memory with little variation.
Models such as Inception_v4, deit3_base_patch16_224, and swin_tiny_patch4_window7 _224 require less GPU memory (below 20%), making them more suitable for memory-constrained environments.
On the other hand, models such as Efficientnet_b0, Convnextv2_tiny, and Xception41 require about 40% of memory allocation, due either to their bigger batch sizes or more complex designs. Because they use less memory, Inception_v4, deit3_base_patch16_224, and swin_tiny_patch4_window7_224 are superior choices when it comes to memory efficiency.
Figure 18 illustrates the GPU Power Usage (W) analysis graph, with the X-axis (Time) denoting the duration of model training and the Y-axis (GPU Power Usage in Watts) indicating the power consumption of the GPU for each model during the training process. For inception_v4 (purple), the model exhibits the lowest power usage among all evaluated models, consistently remaining below 100 W. This behavior highlights its superior energy efficiency; however, this efficiency may come at the expense of performance compared to more advanced architectures.
Similar to inception_v4, efficientnet_b0 (orange line) uses very little power. It almost always stays below 100 W. This architecture may be ideal for tasks requiring a balance between performance and energy efficiency.
Convnextv2_tiny (dark green line) shows significant variations in power usage, along with a maximum power consumption of about 300 W. As a result of this pattern, it can be inferred that GPU utilization is more dependent on data or key training moments.
The yellow line on Xception41 (yellow line) indicates moderate consumption between 100 W and 200 W. Compared to other models, it does not reach excessive peak levels. The model consumes a moderate level of power.
Deit3_base_patch16_224 (pink line) has the highest power consumption, occasionally coming close to 300W. Despite the high energy cost, performance is prioritized in this paradigm.
Finally, swin_tiny_patch4_window7_224 (bright green) exhibits significant power consumption peaks (≈300 W), though they occur less often than with deit3_base_patch16_224. It is a powerful but less energy-efficient variant.
Low and steady energy usage make inception_v4 and efficientnet_b0 the most efficient.
The complexity and processing power of newer models such as deit3_base_patch16_ 224 and swin_tin-y_patch4_window7_224 may explain their more significant power usage.
In this investigation, the most energy-efficient model is inception_v4, with low power usage (<100 W) and stability, as shown in
Figure 18.
deit3_base_patch16_224 and swin_tiny_patch4_window7_224 are the most powerful solutions for optimal performance independent of energy cost, but they need more resources.
4.6. Grad-CAM
While the current data effectively illustrate the usefulness of a specific model, an alternative method of graphically representing the performance of such models is available [
54]. The heatmap produced by the Grad-CAM approach defines the focal area and aids in visualizing the regions the classification model examines for specific predictions [
48]. This approach has been used in other works, including [
14,
38,
54,
72,
75]. Grad-CAM helps users gain insight into how a model performs generally and how to improve it. Prediction based on convolutional layers is highly sensitive to specific computed gradients; preserving spatial information from the ongoing interpreted features from a region of interest, providing a basis for error rate calculation, and supporting precision for the incorporation of the final result into diagnostic interpretation. An individual patient’s brain tissue analysis is preceded by integrated gradients of change with respect to a given baseline in tissue images, which are accumulated posteriorly to determine the change. The gradients at different points along an image path are averaged and multiplied by the difference between the input and the baseline to determine the integrated gradient for each image feature. The difference between the model’s output for the input and the result for the baseline are then summed to produce a set of feature attributions.
Figure 19 illustrates the Grad-CAM for each of the six models in the last run. These six representations emphasize the areas of the highest importance for predicting the LGG class. Clinical interpretation was made according to intensity, density, space, volume, and tissue diffusion, all of which are routine considerations in neuroradiology visual analysis procedures [
76].
The inception_v4 model highlights some brain regions, as seen in
Figure 19a, with the top left region showing noticeably high intensity. By focusing on specific areas, it can identify important LGG-related characteristics. In this image, intensity, showing a well-defined and distinguished area of interest; space, showing scarce possibility for localization but providing a possible cortical lobe approach; volume; and tissue diffusion indicate a possible low-grade brain tumor due to its large diffusion and low space accuracy.
The xception41 model’s heatmap is more dispersed in
Figure 19b, with its focal points mainly centered in the top center and diffused throughout several locations. Accordingly, the model may concentrate on fewer specific characteristics and employ a more comprehensive detection strategy. Intensity facilitates a clear depiction of the region of interest, while density and space support localization accuracy, volume suggests the plausible clinical condition of the gray matter according to temporal properties, and tissue diffusion remains well-defined in the outcome.
The convnextv2_tiny model is shown in
Figure 19c, where the highlighted regions are moved laterally and bright colors emerge close to the inferior border of the brain. The emphasis on ancillary characteristics suggests a unique but imprecise method of detecting the target. Intensity and density remain high in the areas of interest, showing a cortical detection regarding space, and a large volume of identified tissue in the gray matter, with possible low-grade diffusion between both cerebral hemispheres.
Figure 19d shows a high-intensity activation that is mostly concentrated in a central location within the swin_tiny_patch4_window7_224 model. Based on the heatmap’s clear resolution, this model is able to accurately locate relevant locations. This run presents the most pronounced reduction in intensity and space, identifying large tissue diffusion, which may suggest white matter malignant tissue formation.
Compared to other models,
Figure 19e, which corresponds to efficientnet_b0, displays scattered points of focus with less intensity. This could be a sign of a larger, more generalized, classification strategy accompanied by a clear density spectrum of notable tissue volume located in the medial interhemispheric and well delimited left fronto-temporal regions.
The deit3_base_patch16_224 model illustrates regions of consistent attention throughout the brain (
Figure 19f). As a result, the model examines several crucial areas, suggesting a more exact but somewhat troublesome approach. There could be confounding factors for this detection strategy, resulting in regions showing several diffuse, large areas of high density in the brain, with notable high volume dimensions distributed in both hemispheres.
The models Inception_v4
Figure 19a and swin_tiny_patch4 _window7_224
Figure 19d are found to be more accurate in areas related to LGGs. This is consistent with earlier assessments that emphasized them as high-performing and energy-efficient choices. Despite requiring more resources, deit3_base_patch16_224
Figure 19f is still a viable option.
The GradCAM for the inception_v4 model is displayed in
Figure 20a,b, demonstrating its capacity to detect LGG- and HGG-relevant features in specific regions. The model shows a focal point for the low-grade class, indicating that it is successful in classifying low-grade tumors based on unique and localized characteristics. Due to HGG cancers’ aggressive and diffuse nature, the model finds more global patterns in the brain core for the HGG class. This finding might be indicative of how well the model adapts to the complex and wide-ranging patterns typical of high-grade malignancies.
The peripheral and constrained emphasis of the LGG heatmap is consistent with the defined, less invasive character of LGG tumors. The HGG heatmap, on the other hand, is larger and more centralized, demonstrating the widespread and invasive nature of HGG tumors.
The InceptionV4 model effectively differentiates between the two classes by generating attention maps that align with the expected anatomical and clinical characteristics of LGG and HGG malignancies. Its ability to focus on specific regions for LGGs and cover broader areas for HGGs underscores its suitability for this classification task.
As previously noted, transformer-based models can supplement and even outperform CNNs in this classification task because self-attention aggregates data throughout the whole slice, capturing global contextual cues and long-range spatial relationships that primarily local convolutional receptive fields might overlook. This is consistent with our Grad-CAM data, which show that informative reactions might encompass wider, dispersed patterns rather than being limited to a single compact location. In this regard, CNNs remain useful for local, texture-based characteristics, but attention-based primary structures, mainly Vision Transformers (e.g., Swin Transformer, DeiT), are better suited to incorporating such global evidence.
4.7. Comparison with Other Similar Studies
The classification performance, energy efficiency, and resource consumption of InceptionV4, the best deep neural network identified in this study, were evaluated in comparison to other methods. The methods include specialized models such as SADO-Net, BrainNet, and Neuro-XAI, as well as popular architectures such as ResNet, VGG, and EfficientNet.
A comparative study of recent research is given in
Table 5, which mainly makes use of BRATS datasets (2017–2023) and specialty datasets like Kempanna, Br35H, and Kaggle. Most of the studies assessed focus on binary classification problems, which is the main objective of the comparison.
4.8. Real-Time Inference Benchmarking of the DL Models in Cloud and Edge Environments
This subsection evaluates real-time feasibility by benchmarking per-image inference latency for all trained architectures in both cloud and edge environments. To reflect a clinically realistic usage pattern, all measurements were performed with a batch size , reporting the mean and Std Dev of the inference time across repeated runs.
For the cloud setting, inference was executed on Google Colab using an NVIDIA H100 (GH100, Hopper; 80 GB HBM3) under PyTorch FP16. For the edge setting, inference was executed on an NVIDIA Jetson Orin Nano (Ampere, 16 GB LPDDR5) running Ubuntu 20.04 with NVIDIA JetPack in the 7 W power mode. On the Jetson platform, five architectures (xception41, inception_v4, efficientNet_B0, swin_tiny_patch4_window7_224, and deit3_base_patch16_224) were benchmarked using a TensorRT-optimized FP16 engine (TensorRT v10.3.0) within our in-house containerized runtime to reflect deployment-oriented embedded inference. In contrast, convnextv2_tiny was benchmarked exclusively under PyTorch 2.4.0 FP16 due to runtime constraints preventing TensorRT execution for this specific model; for this case, the Jetson measurement was obtained using the Jetson Containers software environment [
78].
Table 6 summarizes the resulting latencies. In the cloud benchmark, the lowest mean inference time was obtained by deit3_base_patch16_224 (2.98 ms), followed by convnextv2_tiny (4.11 ms), EfficientNet_B0 (4.38 ms), and Xception41 (4.81 ms), whereas Inception_v4 exhibits the highest latency (10.04 ms). On the edge computing device (embedded hardware), EfficientNet_B0 yields the lowest mean latency among the reported models (8.66 ms), while Inception_v4 (13.32 ms) and deit3_base_patch16_224 (20.36 ms) show higher inference times, highlighting the expected trade-off between architectural complexity and embedded execution constraints.
convnextv2_tiny was benchmarked on the Jetson Orin Nano using PyTorch 2.4.0 FP16 (rather than TensorRT) because a TensorRT FP16 engine could not be generated for this architecture with our setup. Therefore, its reported edge latency (27.96 ms) should be interpreted as a PyTorch-based embedded reference, while the remaining five models reflect TensorRT-optimized FP16 inference. Even so, this measurement remains useful as a practical baseline of how long the model takes to run without TensorRT acceleration on an edge computing device. Overall, deit3_base_patch16_224 is the fastest model in the cloud setting, whereas EfficientNet_B0 achieves the best latency on the edge device.
4.9. Scenario of Real-World Usage
An end-to-end, cloud-native clinical workflow for AI-assisted brain MRI study analysis is shown in
Figure 21. It is divided into five successive stages that are connected by directing arrows. Initially, imaging data are obtained during standard MRI scanning at the point of care. To demonstrate the kind of radiological input coming into the pipeline, representative brain pictures are displayed. Second, the collected data are moved to a cloud storage layer (shown as a cloud database), which allows for remote access across clinical sites and serves as the central repository for further processing.
Third, an API-driven orchestration layer initiates automatic processing when data are saved. An API icon, a webhook/event sign, and gears are used to graphically symbolize this step, emphasizing that the process may be started programmatically (for example, upon upload) and managed by modular services. To highlight that this phase is carried out in a controlled setting that upholds access restrictions and facilitates the safe management of clinical data, a security icon is inserted. Preprocessing is specifically anticipated to function on several slices in this approach, matching real-world pipelines where studies are broken down into 2D inputs or selected slice sets before inference.
As a fourth step, the processed inputs are routed to a cloud GPU inference module (GPU server icon), representing elastic computing resources that are typical of commercial cloud providers (e.g., AWS, GCP, or comparable platforms). As part of this stage, AI outputs are generated, as are interpretability artifacts (such as heatmap overlays), which are intended to facilitate model transparency during clinical review. It is important to emphasize that this component is positioned as a computational infrastructure for the generation of inference and explanation rather than an autonomous clinical component.
Fifth, regular clinical workstations or mobile devices can obtain the data via a web-based visualization interface (monitor and clinician review images). In this case, the AI system is presented as a clinical decision support tool, which speeds up the review by providing visual evidence and summarizing model results. The radiologist is still in charge of interpretation and making the ultimate decision. This posture matches deployment restrictions in the real world, where AI enables integration into standard clinical reporting pathways through a cloud-accessible interface and increases workflow speed and consistency without taking the place of expert judgment. This cloud-based implementation opens the possibility of multi-user systems with different patient databases for each hospital.
4.10. Limitations of the Study
In this study, a key limitation was the lack of sufficiently large and clinically diverse datasets that matched the target classification task. Although BraTS 2019 provides a well-established benchmark, access to current, multi-center imaging data remains challenging due to privacy constraints, acquisition heterogeneity, and curation and annotation costs. In this context, open-source platforms (e.g., Kaggle and similar repositories) have become valuable for facilitating data sharing and reproducibility; however, they may still provide a limited representation of real-world clinical settings. Our assessment is predicated on a 2D best-slice approach, which minimizes computational load but may exclude clinically significant inter-slice context, including 3D spatial continuity, tumor size, and heterogeneity between slices. As a result, the claimed performance may not immediately translate to multi-slice or complete volumetric MRI analysis and should instead be considered proof of viability in a reduced scenario.
Another significant limitation concerns computational complexity. Training and evaluating modern deep learning architectures can be resource-intensive, which may restrict reproducibility in low-resource environments. Cloud-based solutions such as the free edition of Google Colab can partially address this barrier for prototyping and educational use, but large-scale experimentation and practical deployment still typically require more consistent access to high-performance hardware (GPU-enabled workstations or dedicated edge devices) to meet time and efficiency constraints in applied scenarios.
In the case of Grad-CAM, it supports interpretability but cannot provide clinical validation.
4.11. Future Work
Enhancing generalizability by increasing the size and variety of the data will be the main focus of future research. In order to better represent real-world diversity in collection techniques, scanners, and patient demographics, future work will include additional public glioma MRI cohorts and, where practical, multi-center clinical data, even if BraTS 2019 offers a well-established multi-institutional benchmark. The pipeline will be enhanced to include multi-slice and 3D volumetric learning (such as automated slice/volume selection and multi-modal fusion).
Future research will also focus on reducing computational complexity to facilitate practical deployment. To achieve lower memory and energy requirements while maintaining accuracy, we will investigate model compression strategies and optimize inference via efficient runtimes and hardware-aware tuning. With this approach, experimentation will be more accessible in low-resource settings, and devices will be usable in real time or near-real time.
Quantitative overlap with tumor masks/segmentation and expert-driven validation of the Grad-CAM will be included to expand the interpretability of the study. To assess the clinical plausibility and consistency of the highlighted regions with tumor-related patterns, we specifically plan to collaborate with neuro-radiologists and neuro-oncology specialists. We will also use structured protocols (such as region-of-interest review and inter-rater reliability) to measure agreement. In addition, future work will incorporate a clinically oriented error analysis on new, unseen patients, where HGG-LGG misclassifications will be reviewed by specialists to identify failure modes and improve robustness.
Neurodevelopment and neuropathologies can be influenced by cultural and biosocial factors, such as ethnicity/race, gender, and nutrition, among others. Future work will improve the incorporation of demographic diversity among patients to reduce biases and performance gaps between different groups. To achieve this, it will be crucial to expand the dataset with information from more medical institutions and different regions of the world.
5. Conclusions
This paper proposed a novel DL-based method that utilizes six distinct DL models trained using the method of parameter optimization: inception_v4, xception41, convnextv2_tiny, swin_tiny_patch4_window7_224, efficientnet_b0, and deit3_base_patch16_224. It has been demonstrated in this work that DL approaches can be used for the classification of images in the specific context of the classification of high-grade and low-grade glioma tumors. It can be seen in
Table 4 that these models demonstrated superior classification ability, even when the dataset was relatively limited, with 25% of the total, and the distribution of classes was unbalanced.
With an accuracy of 99.40% and an -Score of 99.40%, the deit3_base_patch16_224 model is the most accurate, exhibiting a remarkable balance between accuracy and recall. It is followed by xception41, which offers dependable performance with 99.18% and 99.16%, respectively, and inception_v4, which has an accuracy of 99.21% and an -Score of 99.22%.
The models swin_tiny_patch4_window7_224 and convnextv2_tiny achieved competitive F1 Score values of 99.09% and 98.08% while achieving Accuracies of 99.14% and 98.12%, respectively. In this analysis, efficientnet_b0 has the lowest accuracy with 95.27 percent and an -Score of 95.36 percent; however, due to its computational efficiency, it is still a good option.
Based on this proposed approach, inception_v4 displayed distinct, focused, and cohesive attention mappings for both LGG and HGG classes in GradCAM. It is affordable for deployment in energy-constrained locations thanks to its comparatively low power consumption (<100 W).
Due to their robust and evenly distributed attention maps, deit3_base_patch16_224 and swin_tiny_patch4_window7_224 are the best choices for applications that prioritize accuracy and performance. However, the application of some modern technologies or significant growth in computer resources must be considered in this decision.
Grad-CAM maps for LGG reveal confined attention patterns focusing on specific brain regions, consistent with the less invasive characteristics of low-grade tumors. Models such as inception_v4 and deit3_base_patch16_224 effectively capture these patterns. In contrast, HGG attention maps display a more global and centralized focus, demonstrating the models’ ability to recognize the aggressive and diffuse nature of high-grade tumors. Models like inception_v4 and swin_tiny_patch4_window7_224 exemplify these characteristics.
The optimal model selection was aided by the suggested approach. Inception_v4 would be the best option in clinical settings when quick deployment and energy efficiency are crucial. For advanced studies requiring peak accuracy and unlimited computing resources, deit3_base_patch16_224 would be a suitable option.
There is no doubt that DNN research holds enormous promise for solving challenging classification problems in the future. Specifically, it focused on resource-efficient solutions and transfer learning approaches for the classification of brain tumors in magnetic resonance images. For brain cancer specialists, the suggested work can be a useful clinical support tool that lessens their effort and increases diagnostic precision in urgent medical situations.
In conclusion, the suggested approach offers a variety of models with superior performance and resource consumption profiles, demonstrating a reliable, flexible solution for the categorization of brain tumors. This study enabled the choice of the appropriate model based on particular requirements: while deit3_base_patch16_224 is the best option for advanced studies that prioritize peak accuracy in scenarios with sufficient computing resources, Inception_v4 is the best option for clinical scenarios with limited computing resources.