Enhancing Glioma Classification in Magnetic Resonance Imaging Using Vision Transformers and Convolutional Neural Networks

Gómez-Guzmán, Marco Antonio; Esqueda-Elizondo, José Jaime; Jiménez-Beristain, Laura; Galindo-Aldana, Gilberto Manuel; Aguirre-Castro, Oscar Adrian; Ramos-Acosta, Edgar Rene; Torres-Gonzalez, Cynthia; García-Guerrero, Enrique Efren; Inzunza-Gonzalez, Everardo

doi:10.3390/electronics15020434

Open AccessArticle

Enhancing Glioma Classification in Magnetic Resonance Imaging Using Vision Transformers and Convolutional Neural Networks

by

Marco Antonio Gómez-Guzmán

¹

,

José Jaime Esqueda-Elizondo

²

,

Laura Jiménez-Beristain

²

,

Gilberto Manuel Galindo-Aldana

³

,

Oscar Adrian Aguirre-Castro

¹

,

Edgar Rene Ramos-Acosta

¹

,

Cynthia Torres-Gonzalez

³

,

Enrique Efren García-Guerrero

^1,*

and

Everardo Inzunza-Gonzalez

^1,*

¹

Facultad de Ingeniería, Arquitectura y Diseño, Universidad Autónoma de Baja California, Carretera Transpeninsular Ensenada-Tijuana No. 3917, Ensenada 22860, Baja California, Mexico

²

Facultad de Ciencias Químicas e Ingeniería, Universidad Autónoma de Baja California, Calzada Universidad No. 14418, Parque Industrial Internacional, Tijuana 22424, Baja California, Mexico

³

Laboratory of Neuroscience and Cognition, Facultad de Ciencias Administrativas, Sociales e Ingeniería, Universidad Autónoma de Baja California, Carr. Est. No. 3 s/n Col. Gutierrez, Mexicali 21700, Baja California, Mexico

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(2), 434; https://doi.org/10.3390/electronics15020434

Submission received: 1 December 2025 / Revised: 11 January 2026 / Accepted: 14 January 2026 / Published: 19 January 2026

(This article belongs to the Special Issue Applications of Artificial Intelligence in Image and Video Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Brain tumors, encompassing subtypes with distinct progression and risk profiles, are a serious public health concern. Magnetic resonance imaging (MRI) is the primary imaging modality for non-invasive assessment, providing the contrast and detail necessary for diagnosis, subtype classification, and individualized care planning. In this paper, we evaluate the capability of modern deep learning models to classify gliomas as high-grade (HGG) or low-grade (LGG) using reduced training data from MRI scans. Utilizing the BraTS 2019 best-slice dataset (2185 images in two classes, HGG and LGG) divided in two folders, training and testing, with different images obtained from different patients, we created subsets including 10%, 25%, 50%, 75%, and 100% of the dataset. Six deep learning architectures, DeiT3_base_patch16_224, Inception_v4, Xception41, ConvNextV2_tiny, swin_tiny_patch4_window7_224, and EfficientNet_B0, were evaluated utilizing three-fold cross-validation (k = 3) and increasingly large training datasets. Explainability was assessed using Grad-CAM. With 25% of the training data, DeiT3_base_patch16_224 achieved an accuracy of 99.401% and an F1-Score of 99.403%. Under the same conditions, Inception_v4 achieved an accuracy of 99.212% and a F1-Score of 99.222%. Considering how the models performed across both data subsets and their compute demands, Inception_v4 struck the best balance for MRI-based glioma classification. Both convolutional networks and vision transformers achieved superior discrimination between HGGs and LGGs, even under data-limited conditions. Architectural disparities became increasingly apparent as training data diminished, highlighting unique inductive biases and efficiency characteristics. Even with a relatively limited amount of training data, current deep learning (DL) methods can achieve reliable performance in classifying gliomas from MRI scans. Among the architectures evaluated, Inception_v4 offered the most consistent balance between accuracy, F1-Score, and computational cost, making it a strong candidate for integration into MRI-based clinical workflows.

Keywords:

glioma classification; brain tumor diagnosis; MRI; medical image analysis; CNN; ViT; deep learning; artificial intelligence

1. Introduction

Brain tumors develop when abnormal tissue structures grow in the brain [1,2]. The broad mortality range of these tumors makes them some of the most serious and dangerous conditions encountered in clinical practice [3]. Meningiomas, glioblastomas, and gliomas are usually classified in neuro-oncology. This classification is based on their location, morphology, and size [4]. Gliomas are more common than any other major brain tumor, accounting for approximately 26.5% of all malignant tumors of the brain and nervous system. Currently, many systems classify these tumors according to their site of origin and extent of spread [2,5]. However, reliable identification is difficult because tumor contours differ considerably, their dimensions vary from case to case, and tissues behave unpredictably. Their location in the body can add to this confusion. Furthermore, there are variations in scanning protocols and imaging techniques [3].

According to the World Health Organization (WHO), gliomas are classified by tissue type into groups I to IV. Those in groups I and II are considered low-grade gliomas (LGGs), and those in groups III and IV are considered high-grade gliomas (HGGs) [6,7]. LGGs are typically characterized by a slow growth pattern, taking months or years to develop, while HGG spread rapidly, behave in an extremely malignant manner, and worsen quickly in actual cases. It is generally agreed that LGG patients can have a good prognosis after the tumor is removed. In contrast, HGG patients face significant challenges to their long-term survival after surgery, as they must continue to receive prolonged radiation and chemotherapy treatments. Oncologists know the importance of accurate classification when it comes to planning treatments and predicting disease progression [8].

The primary diagnostic tool in brain cancer diagnosis is imaging, which provides a detailed overview of where the tumor is located in the brain, as well as its size and volume. In recent years, imaging technology has evolved into one of the most important diagnostic tools available to oncologists and radiologists [9]. As a result of technological advancements, radiologists have been able to utilize a wide range of imaging techniques, including X-rays, computed tomography (CT), electroencephalography (EEG), ultrasound, positron emission tomography (PET), single-photon emission computed tomography, and magnetic resonance imaging (MRI). Each of these has contributed to more accurate diagnoses of brain tumors and improved the basis for future treatment decisions [10,11].

Of the imaging methods mentioned above, CT and MRI are standard techniques for detecting brain tumors and for mapping the regions of the brain that are involved [5,12,13,14]. Specifically, MRI has become the top choice of non-invasive modality for glioma detection because it produces high-resolution, multi-sequence images that provide varied scans of damaged tissue. [1,2,8,15,16]. In comparison with CT, MRI offers more detailed anatomical information and markedly better soft-tissue contrast, which helps clinicians distinguish normal from diseased structures with greater confidence in practice [2,15,17].

The burden on clinicians in radiology within healthcare systems represents a significant workload problem [18]. In clinical practice, manual segmentation of gliomas is extremely time-consuming but contributes significantly to patient health outcomes. Progress in solving this problem involves the development of modules in a workflow before clinical validation, involving pre-processing, segmentation, feature extraction, and classification [19]. Early diagnosis and classification with improved prediction accuracy are the most crucial steps in identifying and treating brain tumors to save a patient’s life [14]. The literature addresses at least five domains that have benefited from the validation of DL and artificial intelligence, including reducing scan times, automating segmentation, optimizing workflows, decreasing reading times, and achieving general time savings or workload reductions [20]. A comprehensive understanding of brain illnesses, including the classification of brain tumors, is necessary to assess the pathological tissue and assist patients in receiving the proper treatment based on their classification [21]. There are several health-related and cognitive symptoms associated with brain tumors, and from a clinical perspective, it is difficult and time-consuming to diagnose and treat the wide variety of possible consequences of these tumors precisely. A multimodal assessment in clinical services is crucial for an accurate prediction of future health outcomes [22], and assumptions about health and behavioral correlations with tumor classification increase reliability.

The diagnosis and classification of brain tumors often require MRI scans in three planes, axial, sagittal, and coronal, along with neurological evaluations and, if feasible, biopsy. Because symmetry in the axial and coronal planes is a hallmark of a healthy brain, MRI is a diagnostic tool for brain tumors, epilepsy, neurological disorders, and other conditions. Asymmetries in pixel intensity in axial MR brain images indicate a pathological state of the brain [5,23].

Multiple sequences, such as T1-weighted (T1), contrast-enhanced T1-weighted (T1ce), T2-weighted (T2), and fluid-attenuated inversion recovery (FLAIR), may reveal distinct tissues around the lesion site [6,13].

In their daily work, radiologists routinely view multiple MRI sequences simultaneously to try to diagnose the type of tumor. This practice is necessary but overwhelming, especially because gliomas present a wide range of morphological patterns that complicate visual interpretation and slow the diagnostic process. Furthermore, the final decision is heavily influenced by the radiologist’s training and experience, so personal judgment and other factors affect diagnostic accuracy. For these reasons, there is a strong motivation to reduce the workload of radiologists while simultaneously improving the reliability of diagnostic results [23,24,25]. Taking into account the functional structure of the brain and the particularity of its tissues, it becomes clear why it is not easy to distinguish between the various types of tumors, such as gliomas, meningiomas and glioblastomas, especially when there are conditions that present neurological and neuropsychological symptoms that are intertwined [26].

These challenges have resulted in significant interest in computer-aided diagnostic systems (CADs), since they allow for automatic and timely detection of diseases. By combining medical knowledge and expertise, these systems facilitate a quicker and often more accurate diagnosis of pathological abnormalities. They typically capture medical images, analyze them using various image processing and DL processes, and evaluate whether a specific disease is present or not [27].

Recent advances in DL have been made in areas such as computer vision and human language understanding. This growth has been extended to imaging analysis in medicine in a significant way [3,16,28,29]. DL models excel at managing unstructured data and progressively identifying various elements within the analyzed information. In order to achieve this, they construct a multi-layered feature hierarchy that evolves from basic patterns to abstract and semantically meaningful representations [30].

In the broader landscape of DL, CNNs are distinguished for their ability to solve challenging problems in computer vision [31,32,33]. CNN-based solutions are commonly used as the first line of computation when the quality of MRI is suboptimal and may compromise clinical judgment [34,35]. In addition to combining feature extraction with classification, CNNs are able to learn the image features that are most relevant for discrimination automatically [36,37]. As a consequence, transfer learning (TL) has emerged as one of the most effective methods for detecting and classifying brain tumors at an early stage. Since TL makes use of parameters learned from large, related datasets, it has proven especially useful in medical imaging applications, such as the evaluation of brain lesions [38].

Recent CNN architectures differ not only in the depth of the network and the number of filters but also in the ways they fuse information coming from different spatial scales or contexts. At the same time, advances in hardware—particularly in GPU and CPU performance—have mitigated some of the substantial computational costs associated with training and deploying such models, encouraging their use in both research and clinical environments [39].

Due to the steady increase in the number of patients undergoing brain MRI scans, it has become increasingly difficult to rely solely on manual image interpretation. As a result, it is possible for the process to be slow, subject to fatigue-related errors, and difficult to standardize. As a result, there is a strong need for novel CAD systems that support and streamline the interpretation of brain MRIs [40]. Despite this, many current CNN-based methods do not consider all of the complementary information that MRI sequences can provide, leaving valuable information unused. Many of these approaches use predefined areas of interest (ROIs) to assist in classification, which adds another layer of complexity to classification [16].

The present study examines a DL-based approach for discriminating high-grade from low-grade tumors within this context. As part of the study, six Deep Neural Networks (DNNs) dedicated to binary tumor classification will be trained and evaluated.

A custom dataset was constructed, and the following models were trained:

deiT3_base_patch16_224, Inception_v4, Xception41, efficientnet_b0, convNextV2_tiny, and swin_tiny_patch4_window7_224. Each model was configured with two artificial neurons in the output layer for binary classification.

The novelty of this contribution lies in enhancing tumor classification accuracy in medical imaging, particularly in detecting malignant cells, by evaluating state-of-the-art DL algorithms and leveraging transfer learning (TL) techniques. This study examines the challenges of clinically classifying high-grade and low-grade gliomas. The purpose is to enhance the accuracy and efficacy of a classification task in complicated clinical settings with many medical images. The main contributions of this work are as follows:

Development of a novel approach for tumor classification in medical imaging.
Benchmarking state-of-the-art DL models through hyperparameter optimization.
Preparation of a bespoke dataset for tumor classification in medical imaging.
Optimization of six DNNs to enhance classification performance.
Exploration of the impact of applying DL and TL techniques in medical diagnostics for tumor classification.

The structure of this paper is as follows: Section 2 provides an overview of related works in brain tumor classification. Section 3 presents a detailed explanation of the model training process. Section 4 demonstrates the performance of the proposed method and offers a comparative analysis with other binary classification approaches. This section also describes a real-world use scenario, addresses the limitations and future work of the study and provides a structure for future research. Finally, Section 5 summarizes the conclusions drawn from this study.

2. Related Works

A number of studies in the literature have addressed the relevance of these techniques in different fields of neuroimaging feature extraction. In the case of different developed or applied classifiers, the accuracy achieved varies, indicating a dependence on the model and methodology used. In [1], using a dual-CNN approach, internal features were extracted from brain images, and a second CNN module was used to classify these features. As a result of using the BRATS-IXI dataset for training, the authors achieved an accuracy of 98.85%. In [41], the authors developed a novel multimodal classification approach leveraging DL for brain tumor type categorization. For BraTS2015, BraTS2017, and BraTS2018, the methodology achieved accuracies of 97.8%, 96.9%, and 92.5%, respectively.

In a related study, [6], an autodistillation technique for glioma classification was presented that was discrepancy-sensitive. Based on the BRATS2018 and BraTS2019 datasets, four imaging modalities (T1, T2, T1ce, and FLAIR) were used to achieve binary classification of LGGs and HGGs. The evaluated models include ResNet-101, DenseNet-169, and ConvNeXt-S, with the highest accuracy recorded at 93.6%. Similarly, [8] employed various versions of the BRATS datasets. The authors developed an innovative automated system named the Spatial Adaptive Dart Optimized Network (SADO-Net). The system was tested using the BRATS 2018, BRATS 2019, BRATS 2020, and Figshare datasets, achieving an impressive average accuracy of 99.2% in tumor detection, as well as the following results in other datasets: BRATS 2015 CNN-SVM, 94.11%; BraTS 2017 Hybrid-DANet, 97.23%; REMBRANDT CRNN, 98.49%; BRATS 2015 DNN, 93.10%; BRATS 2015, U-Net architecture, 83.23%; BRATS 2015, VGG19, 96.71% [24].

CNNs can be customized for multi-class classification to more accurately develop an ideal architecture for brain tumor classification. In [13], a CNN architecture modified through domain knowledge, and an evolutionary optimization approach was proposed to select hyperparameters. Tests conducted on the BraTS2020 and BraTS2021 datasets demonstrated an enhanced average accuracy of 98% and a maximum single-classifier accuracy of 99.80%. In addition, in [42], four designs are constructed, each with unique layers and hyperparameters. Before the classification procedure is applied, the images are put into the convolutional layers for feature extraction and a softmax function. Our proposed CNN-based classification strategy achieves state-of-the-art accuracy, precision, and recall, with F1-Score of 99.76%, 99.64%, 99.62%, and 99.64%, as demonstrated by a comprehensive experimental analysis. Additionally, better performance is attained with a Micro-Avg Matthew correlation coefficient (MCC) of 0.929. In [43], the authors proposed a spatial residual module (SRM) for volumetric glioma complexity representation, utilizing a 3D CNN design. The authors integrated Swin UNETR, a pre-trained segmentation model, to enhance the network without additional training. ResMT was tested on the BRATS 2019 dataset, achieving a maximum prediction accuracy of 97.01%. This work underscores the potential of hybrid CNN–Transformer models for classifying 3D magnetic resonance images.

According to the literature, binary image classification has been widely pursued in order to improve accuracy. A custom dataset of MRI scans was used in [16] to evaluate the performance of Resnet18. It was found that ResNet18 had an accuracy of 95.54%. The [24] study provides another example of a BRATS dataset using the BRATS 2020 version. In their study, the authors proposed a ConvNet-ResNeXt101 model for the classification of tumors, which achieved an accuracy rate of 99.27%.

Meanwhile, [28] compared ResNet50, EfficientNetB3, and VGG-19 models on an MRI dataset from Kaggle. EfficientNetB3 achieved a training accuracy of 99.44% but a validation accuracy of 89.47%, highlighting the overfitting problem. In [44], authors proposed a binary classification and detection approach to address this problem and to detect brain cancers earlier. This was made possible by the TL approach with pre-trained ResNet50, VGG19, and InceptionV3 DL models. The pre-trained methods InceptionV3, ResNet50, and VGG19 obtained accuracy rates of 99.72%, 98.84%, and 94.65%, respectively. In [29], the BRATS 2021 dataset was utilized for a binary classification task. Based on deeplabV3+, the authors developed the Neuro-XAI model, an explainable DL framework. In comparison to previous strategies, the approach demonstrated an improved performance, achieving 97% classification accuracy and 98% overall accuracy.

As reported in [45], the authors used the BRATS 2018 dataset to evaluate five pre-trained CNN models: AlexNet, VGG16, GoogleNet, ResNet18, and ResNet50, which were originally trained on the ImageNet database. The purpose of this study was to improve the accuracy and reliability of tumor classification by taking advantage of the model architectural diversity by using DL ensemble techniques. This approach was described by the authors as one that was very successful in the state of the art, achieving an accuracy rate of 97.47%.

A similar framework was developed by the authors of [46] for detecting and classifying HGGs or LGGs in brain MRI scans. As part of the second stage of tumor segmentation, skip connections and residual units were used in order to segment the tumor. It was found that the detection and classification stage of the BraTS 2017 dataset achieved an accuracy of 99.6% when using 1800 images from the dataset. The Dice score, specificity, and sensitivity metrics were used in this study.

The research presented in [47] used the Preet Viradiya Brain Tumor dataset. Different CNN models and a proposed CNN architecture were trained to optimize brain cancer detection. The proposed CNN model ranked higher than others, achieving an accuracy of 97.5%. Also, [17] presented an advanced Dual DCNN model for the purpose of successfully classifying malignant and non-malignant MRI scanned images. The model was assessed using the Br35H dataset from Kaggle. Using the method, impressive results were achieved: 99% accuracy, 99% precision, 98% recall, and 99% F1-Score.

The authors of [48] used the BRATS 2017 dataset to train a model named IMPA-Net, which was designed to enhance the interpretability and reliability of brain tumor classification results. Model performance demonstrated a classification accuracy of 92.12%.

In [49], the authors propose a hybrid system that could help in the early detection and classification of brain tumors. With the public dataset from Figshare, the system achieved a remarkable accuracy of 98.89% based on two classes. A number of state-of-the-art models, including AlexNet, VGG-16, DenseNet-201, VGG-19, GoogleNet, and ResNet-50, were significantly upgraded by this model.

The authors of [50] investigated Machine Learning (ML) techniques for classifying brain tumors, including gliomas and meningiomas. According to this study, the SVM model in combination with LBP and HOG achieved an accuracy of 97%, whereas a deep CNN model achieved 98%.

In [51], three models were trained to classify brain tumors: VGG19, Inception V3, and MobileNetV2. The researchers used the Kaggle Brain X-ray image dataset, which includes images from two sources, including BRATS 2020. Based on the results, VGG19 was the most accurate model, with an accuracy of 98.58%.

In addition, a new attention-based glioma grading network (AGGN) was proposed in [52]. According to the authors, the AGGN model was evaluated on the BRATS 2018 and BRATS 2019 datasets and demonstrated an accuracy of 96.12%.

The authors of [53] developed a Coarse-to-Fine Feature Fusion Network (CFNet) to integrate multimodal visual information through modal interaction, semantic perception, and feature fusion. In order to evaluate the proposed CFNet, two publicly available datasets were used: BraTS2019 and BraTS2020.

Generally, the reviewed literature indicates that glioma classification performance is often reported under heterogeneous experimental protocols, which makes comparisons difficult and can overstate applicability in the real world. It is common for there to be gaps in transparency regarding data partitioning and the risk of patient-level leakage, (ii) tests that often remain confined to one benchmark without external testing under a domain shift, (iii) an emphasis on accuracy with insufficient attention to computational cost, memory/energy demands, and feasibility in resource-constrained settings, and (iv) a lack of clinically grounded interpretability analyses that would help establish trust and support decision-making. Based on these limitations, the present study adopted a clearly defined training–testing procedure, benchmarked multiple modern architectures under consistent conditions, reported both predictive performance and computational complexity, and utilized Grad-CAM representations as a first step toward more accurate and clinically interpretable brain tumor classification.

3. Materials and Methods

As part of this research, six DL algorithms were trained and evaluated to classify brain tumors into two classes, LGGs and HGGs, in various imaging settings. TL methodologies were used to improve the binary classification. Figure 1 illustrates the methodology of this study.

In this work, the parameter optimization approach proposed in [54,55] is analyzed and applied, focusing on binary classification of brain tumors. This study also utilizes the model training methodology previously implemented in [3,55,56].

The proposed methodology consists of four main steps. A detailed description of the dataset, BRATS 2019, is provided in subsequent sections. A variety of preprocessing techniques are used for the dataset, including labeling, resizing, color adjustments, rotation, and random flipping. The second stage of the process consists of training and validating pre-trained CNN models in order to assess their performance, as shown in Figure 1. During validation, the accuracy of the training process is ensured, while testing is conducted using images that are different from those used for training and validation.

As a final step, key performance metrics are used to evaluate the efficacy of the proposed models, including accuracy, precision, recall, the F1-Score, and the Matthews correlation coefficient (MCC). In order to identify which model produces the best results with the specified dataset, each pre-trained model is tested.

The pre-trained models employed in this study are described below:

3.1. Inception_v4

Developed by Google in 2014, the Inception_v4 model builds on its predecessors’ foundational principles. Essentially, the concept behind Inception involves the use of multiple concurrent branches to apply different types of convolutional layers, which enables the extraction of information at varying levels and dimensions. As a significant improvement over earlier versions, Inception_v4 includes factorized convolutions, residual connections, and label smoothing [54,57].

3.2. Xception41

Based on the Inception model, the Xception41 CNN uses depthwise separable convolutions as its primary innovation. An approximate Inception module can be obtained using depthwise separable convolutions, consisting of a depthwise convolution conducted independently for each channel and a pointwise convolution that combines data across channels by performing a one-way convolution operation [7,58].

3.3. EfficcientNet_b0

This system is designed to maximize both the accuracy and efficiency of image classification. In the presence of a compound coefficient, the resolution, length, and depth of the network are uniformly scaled. As a result of this design, the application is flexible in adapting to varying resource constraints and datasets. Model components include a stem convolution layer and seventeen inverted bottleneck residual blocks, each equipped with squeeze-and-excite modules [54,56,59].

3.4. Data Efficient Image Transformer

The DeiT3_base_patch16_224 architecture was introduced by Google in 2017 for Natural Language Processing (NLP). When predicting input sequences, transformers use self-attention mechanisms to prioritize key components, making them highly efficient. Deep vision transformers (ViTs) have become very popular for their ability to handle visual data. It is possible to optimize transformer-based models by transferring knowledge from large pre-trained instructor models to more compact student models, facilitating faster learning [54,60].

3.5. Swin Transformer

Swin_tiny_patch4_window7_224 introduces the swin attention mechanism during sequence processing. Using this mechanism, positions within the sequence that shift dynamically over time are selectively focused on. As a result of this technique, the model can identify long-range dependencies while reducing the computational cost of attention mechanisms at the same time [54,61].

3.6. Convnextv2_tiny

In this improved version of ResNet50, Convolutional Neural Networks (ConvNets) and vision transformers are combined to produce an even more efficient and effective system. Through the combination of self-supervision techniques, masked autoencoders, and a Global Response Normalization (GRN) layer, performance is improved. For the purpose of training and evaluating these models, public datasets such as ImageNet and the Large Scale Visual Recognition Challenge (ILSVRC) are used [62].

3.7. Procedure for Model Training

The following items are prerequisites to training the model:

Dataset preparation: A dataset of images of HGGs and LGGs is arranged into two folders, Train and Test, each containing the two classes: HGG and LGG. There is patient-wise separation in the dataset, and the images in the training and testing folders are from different patients.
Selection of programming environment: Google Colab [63] is a cloud-based programming environment equipped with advanced hardware, including GPUs and TPUs, which enable faster model training compared to CPUs. This work utilizes the Nvidia A100 GPU.
Programming language: Python 3 [64].
Library selection: PyTorch [65], scikit-learn [66], and Weights & Biases (wandb) are used for tracking and monitoring the training process [67].
Image data augmentation: This preprocessing method enhances the robustness and diversity of the dataset [68].
K-fold cross-validation: The training stage is evaluated using three folds generated by the standard cross-validation approach, which are used in training and validation stages.
Transfer learning strategy during training: Using this method, knowledge is transferred from the source domain to the target domain, thereby addressing the issue of insufficient training data. This has an important impact on the advancement of several complex fields with limited training data [69].
Model testing: Metrics are calculated to evaluate the performance of the six previously trained DNNs.

3.7.1. BRATS 2019 Dataset

The study utilizes brain MRI images from the Brain Tumor Segmentation (BraTS) 2019 dataset [70]. BraTS 2019 is the official MICCAI BraTS Challenge benchmark distributed by CBICA (University of Pennsylvania) and comprises routinely acquired pre-operative, multi-parametric MRI scans from real glioma patients collected across multiple clinical institutions with heterogeneous scanners and protocols. The public release is de-identified (no direct personal identifiers), and the resulting variability across sites supports reproducible evaluation under realistic acquisition differences; moreover, the dataset is released with a pre-defined split (separate Train and Test folders) containing different patients.

There are four distinct contrast-enhanced brain images included in the BraTS 2019 dataset: T1, T1-Contrast Enhanced (T1ce), T2, and FLAIR. Based on their enhanced contrast, T1ce brain images were used, as this modality is commonly used in cancer research for tumor identification [1,6]. In total, 335 brain images are included in the T1ce dataset, 259 of which are images of HGGs and 76 of which are images of LGGs.

As part of the training phase, 335 patient cases from both categories were incorporated using the T1ce modality. Slices were extracted from each image in the dataset using a segmentation technique, and the optimal slices depicting the tumor lesion were selected. In the case of the HGG class, 1074 axial slice images were obtained; in the case of the LGG class, there were 1111. The test dataset consisted of 10% of the images in each class, resulting in 108 images of HGGs and 111 images of LGGs. To avoid overlap between training/validation and testing, different patients are included in the Train and Test folders (patient-wise division). A representative sample of the images included in the dataset is shown in Figure 2.

Class distribution is near-balanced in training set, as shown in Figure 3. With just slight variation between the HGG and LGG samples, the class counts and percentages indicate a well-balanced dataset. This distribution promotes equitable reporting of per-class data and lessens the possibility of bias toward the majority class. Furthermore, the imbalance indicators in Figure 4 confirm minimal skew, which supports the use of standard training without imbalance-driven corrections. The imbalance ratio (IR) and entropy balance (EB) indicators were computed following the definitions reported in [55].

3.7.2. Image Data Augmentation

Numerous computer vision applications have demonstrated the efficacy of CNNs. In order to minimize the possibility of overfitting, Artificial Neural Networks require extensive datasets.

During overfitting, the neural network constructs a function with an excessive variance, resulting in erroneous output and limiting the generalization capacity of the network. Image data augmentation is used to modify and augment the existing dataset within the data space in order to mitigate data scarcity. DL models can be developed more effectively by augmenting training datasets with image data, using a variety of techniques aimed at improving training datasets. Methods of enhancing image data include geometric transformations, color space alterations, kernel filtering, image mixing, random blurring, and feature space augmentation [68].

Instead of creating a static augmented dataset, picture augmentation was performed as dynamic, probabilistic PyTorch transformations, as shown in Table 1, presenting randomized versions of training images during each training session. In Figure 5, examples of randomly augmented classes demonstrate a range of light levels, angles, and placements.

3.7.3. Hyperparameter Setup

Table 2 presents the hyperparameters that were used. The number of epochs was set to 10, ensuring a manageable training time while maintaining model performance. Batch sizes were set to 32 and 64. We used a small dataset for initial experimentation, limiting data size to 10% and 25%. Subsequently, larger subsets of 50%, 75%, and 100% of the dataset were employed for training.

The optimizer employed was Adam (Adaptive Moment Estimation), a widely used optimization algorithm in DL. The initial learning rate ranged from

1 \times 10^{- 5}

to

1 \times 10^{- 3}

. The ReduceLROnPlateau scheduler was used to adjust the learning rate dynamically during training, reducing the learning rate based on validation performance after each epoch.

To enable exact re-runs and reduce variance due to uncontrolled randomness, we fixed the random seeds and deterministic settings across the entire pipeline. Specifically, we set random.seed(SEED), numpy.random.seed(SEED), torch.manual_seed(SEED), and torch.cuda.manual_seed_all(SEED), enabled deterministic CuDNN behavior, and ensured DataLoader worker seeding. Regarding data handling, preprocessing was applied first (resizing and normalization), followed by dynamic, probabilistic PyTorch augmentations, which were executed on the fly during training (Table 1) rather than generating a static augmented dataset. All architectures were allocated a fixed training budget of 10 epochs to regulate computational costs between models and ensure a fair comparison, consistent with previous studies [54,55,56]. Under the transfer-learning configuration with the ReduceLROnPlateau scheduler, the networks typically reach stable validation behavior within this brief schedule, as reflected by the rapid stabilization of validation accuracy and the corresponding decrease in validation loss.

3.7.4. Performance Metrics

The performance metrics utilized in this study were employed in prior works such as [14]. Accuracy is one of the most fundamental and intuitive metrics. It is the ratio of correctly predicted observations to total observations. The formula for calculating accuracy, an essential evaluation metric, is provided in Equation (1). Accuracy only applies in scenarios where datasets are balanced and the occurrences of false positives and false negatives are nearly equal.

Accuracy = \frac{T P + T N}{T P + F P + T N + F N},

(1)

where

$T P$ : True Positives (correctly classified positive cases).
$T N$ : True Negatives (correctly classified negative cases).
$F P$ : False Positives (incorrectly classified as positive).
$F N$ : False Negatives (incorrectly classified as negative).

Equation (2) defines precision as the ratio of correctly predicted positive observations to the total predicted positive observations. This metric, also known as Positive Predictive Value, evaluates the model’s ability to identify positive cases accurately.

Precision = \frac{T P}{T P + F P} .

(2)

Recall, also known as sensitivity, hit rate, or true positive rate, measures the proportion of correctly predicted positive observations to the total positive observations in the actual class. Equation (3) provides the formula for calculating recall, which is often considered a critical performance metric for evaluating model quality.

Recall = \frac{T P}{T P + F N} .

(3)

The F1-Score is a harmonic mean that combines precision and recall into a single comprehensive metric. While precision and recall independently provide valuable insights, neither can fully capture a model’s overall performance. For instance, a model may achieve high precision but low recall or vice versa. The F1-Score addresses this limitation by balancing the trade-off between precision and recall in a single value.

As seen in Equation (4), the F1-Score may be determined by integrating the precision and recall metrics after a binary or multiclass classification task.

F 1 - S c o r e = \frac{2 \times R e c a l l \times P r e c i s i o n}{(R e c a l l + P r e c i s i o n)}

(4)

The Matthews Correlation Coefficient (MCC) [71] is a robust metric used to evaluate the quality of binary (two-class) classifications. Unlike more straightforward metrics such as accuracy, the MCC is particularly valuable when dealing with imbalanced datasets, as it considers all elements of the confusion matrix. The MCC is calculated using Equation (5):

MCC = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(5)

Typical performance metrics such as accuracy, precision, recall, and the F1-Score have been used extensively in prior works [7,10,40,48,52,72,73]. In contrast, the MCC was specifically employed in [54] for its ability to provide a balanced assessment of model performance, even in scenarios with disproportionate class distributions.

3.7.5. Training Phase Using a Transfer Learning Methodology

In this stage, six different DNN models are employed. Through a standard cross-validation approach with

k = 3

, each model architecture is iteratively trained and evaluated.

Figure 6 depicts k-fold cross-validation with

k = 3

. Using this approach, the dataset is partitioned into subsets: a training subset and a validation subset. The value of k represents the number of each subset.

After training, a test dataset that includes different images is used to evaluate the performance of each model. Based on its superior performance, the model is identified as the most effective for distinguishing LGGs from HGGs.

4. Results and Discussion

4.1. Test-Set Performance Across Hyperparameter Configurations

The training division performed all hyperparameter explorations (using the k-fold internal procedure/validation), and neither model selection nor hyperparameter adjustment was ever guided by the utilized test set. In order to determine the final generalization performance of each configuration/model, the test set was kept fixed and examined only after training. This ensured that the reported test accuracy was independent of the optimization process. In Figure 7, the parallel coordinate plot displays the optimization process across the multiple deep learning architectures used in this study with the hyperparameters specified in Table 2, with training loss as the primary objective function and test accuracy to report the post hoc evaluation of each configuration using the selected hyperparameters. The configuration described in Table 2 was modified iteratively across the six proposed architectures.

A significant percentage of the architectures converge towards high accuracy, with the performance of the independent test set comprising unseen images consistently approaching 99%, proving that the optimization procedure is reliable and a high degree of accuracy can be achieved. Figure 8 further illustrates the success of the hyperparameter approach, demonstrating tighter convergence of the evaluated models. As shown in the figure, the results were obtained when 25% of the dataset was used, with the initial learning rate set at

1 \times 10^{- 5}

. The test accuracy is included in the diagram only as a post hoc performance indicator to show that the selected configuration generalizes well on unseen data; importantly, the test set was not used during hyperparameter optimization or model selection, and it is reported solely for independent evaluation.

Across all architectures, the test accuracy value is 99%, with minimal loss when batch sizes and learning rates are optimized. Under the proposed optimization pipeline, the proposed test set is strictly separated from the training and validation sets, which in turn ensures that the performance of the optimization pipeline is generalizable across all of the networks evaluated under patient-wise data separation.

4.2. Statistical Summary of Test Performance

Performance variability on an independent test set was examined to enhance the statistical rigor of the given point estimates. This section provides the mean and standard deviation (Std Dev) of the test metrics for each design and a distributional visualization (boxplots) to study consistency over multiple measurements in order to show dispersion and outliers. Together, these descriptions provide a more detailed overview of the reliability of the models than would be possible with the test accuracy scores alone.

As a clear indicator of performance variability between runs, Table 3 presents the Std Dev of the primary test metrics (accuracy, F1-Score, precision, recall, and MCC) for each tested model. Higher Std Dev values imply that the model’s test performance is more susceptible to the specific training settings (e.g., fold assignment and stochastic optimization effects), whereas lower Std Dev values show more consistent generalization.

This comparison shows that architectures such as Xception41 and Inception_v4 exhibit comparatively smaller Std Devs across multiple metrics, indicating more stable behavior under the evaluated protocol. Conversely, swin_tiny_patch4_window7_224 and convnextv2_tiny show larger dispersions, particularly in F1-Score, recall, and MCC, indicating higher variability in the balance between sensitivity and specificity across runs. Overall, this analysis contextualizes the high performance values reported earlier. It shows not only how well the models perform but also how reliably they sustain performance under repeated evaluations.

The Avg columns report the mean test performance across repeated runs, providing an estimate of each model’s typical generalization level under the evaluated protocol. When interpreted together with the corresponding Std Dev, these averages enable a more rigorous assessment of robustness: higher Avg reflects stronger expected performance, while lower Std Dev indicates greater stability and reduced sensitivity to training stochasticity or fold variability. Based on this joint interpretation, Xception41 and Inception_v4 emerge as the most robust models, combining consistently high mean test performance with the lowest dispersion across metrics, while deit3_base_patch16_224 and efficientNet_B0 follow with strong averages and moderate variability.

In addition to the Std Dev summary, Figure 9 shows the whole distribution of test-set performance across runs for each architecture. Boxplots show the median and interquartile range (IQR), while dispersed dots and whiskers draw attention to outliers and dispersion. A more thorough examination of model stability that goes beyond single-point accuracy ratings is supported by this perspective.

Greater consistency among models is shown by tighter boxes and shorter whiskers, whereas sensitivity to training circumstances is indicated by broad IQRs and a large number of outliers. The concentrated distribution of Inception-v4 and the continuous strong central tendency across measures demonstrate stable generalization. Additionally, Xception41 shows a compact distribution with a relatively small spread for accuracy, precision, and recall, demonstrating dependable behavior under repeated assessment. However, EfficientNet-B0 retains a high median performance across measures despite having more outliers and a wider dispersion than most compact models. This indicates generally good results with sporadic fluctuations. In terms of central tendency, deit3_base_patch16_224 likewise performs well on tests; nevertheless, in comparison to the most compact distributions, its boxplots have a larger spread and more outliers in various metrics (most notably F1 and MCC).

By contrast, swin_tiny_patch4_window7_224 and convnextv2_tiny display wider distributions, especially in F1-Score, recall, and MCC, revealing higher variability in clinically relevant trade-offs between sensitivity and specificity across runs. Generally, the boxplot analysis confirms that high peak performance should be interpreted alongside stability: models with consistently high medians and low dispersion provide more reliable performance over time.

Based on robustness (narrow IQR and fewer severe outliers) and central tendency (high medians), Inception-v4, Xception41, and EfficientNet-B0 appear to be the three most statistically reliable architectures in our investigation. These models are favored when consistency is prioritized in addition to peak accuracy because they combine good test performance with relatively narrower distributions across the evaluated parameters.

In summary, the combined Std Dev table and boxplot provide distribution-aware evidence of performance robustness, supporting a more statistically grounded interpretation of the reported test results.

4.3. Training Dynamics and Convergence Analysis

Figure 10 illustrates the performance metrics during the training stage, including Figure 10a training accuracy, Figure 10b training loss, and Figure 10c validation accuracy. After the training process, it is evident that the architectures inception_v4, convnextv2_tiny, and deit3_base_patch16_224 nearly reached an accuracy of 100%, while efficientnet_b0, xception41, and swin_tiny_patch4_window7_224 achieved close to 99% Accuracy. This behavior is also reflected in the training loss, as shown in Figure 10b. Furthermore, Figure 10c highlights that xception41 is the only architecture that fails to achieve high validation accuracy. These curves in Figure 10 indicate that the selected configuration achieves stable performance within the 10-epoch budget, supporting the fixed-epoch choice used throughout the benchmarking. The scikit-learn package supplies the metrics used to evaluate the performance of these different architectures. These include recall, accuracy, precision, F1-Score, and Matthews Correlation Coefficient. The various DNNs’ performance is evaluated using 219 test instances.

Figure 11 demonstrates the accuracy metric for each model in a bubble plot that correlates accuracy with the number of parameters in the model. The specific conditions are an initial learning rate of

1 \times 10^{- 5}

, a batch size 64, and a training dataset size of 25%. The accuracy of four of the models employed in this study reaches 99%, with the position of each model on the plot varying depending on the magnitude of its parameters, measured in millions. First on the left is efficientnet_b0 with 5.3 million parameters, followed by xception41 with 22.9 million, swin_tiny_patch4_window7_224 with 28.1 million, and convnextv2_tiny with 28.6 million parameters. Continuing to the right, inception_v4 exhibits 42.7 million parameters, followed by deit3_base_patch16_224 with 86.1 million parameters.

While all models are highly accurate, choosing the most accurate model is difficult when the parameter count for each is calculated in millions. Initially, efficientnet_b0 seems to be the most appropriate option due to its lower parameter count. Nonetheless, a more comprehensive evaluation is required to establish whether other models would be more efficient for this binary classification task.

In Table 4, the performance results for the CNN models evaluated in this study are depicted for a set of parameters: a learning rate of

1 \times 10^{- 5}

, a batch size of 64, and 25% of the dataset in training stage. The models deit3_base_patch16_224, inception_v4, xception41, swin_tiny_patch4_window7_224, convnextv2_tiny, and efficientnet_b0, were tested. Accuracy, precision, recall, F1-Score, and MCC were the performance metrics considered in this study.

Based on all the measured metrics, the architecture deit3_base_patch16_224 is the best-performing model. An accuracy of 99401, precision of 0.99201, recall of 0.99610, F1-Score of 0.99403, and MCC of 0.98807 indicate exceptional performance in terms of accuracy and equilibrium across classification metrics. In spite of this, it has the highest number of parameters, resulting in a higher computational cost, as shown in Figure 11. As a result, it may not be the most economical choice in terms of computational costs.

InceptionV4 has an accuracy of 0.99212, with notable values in other metrics: a precision of 0.99126, a recall of 0.99325, an F1-Score of 0.99222, and an MCC of 0.98431. While it is inferior deit3_base_patch16_224, the results produced by this model are very balanced and remain highly competitive. As shown in Figure 11, inception_v4 has fewer parameters, which results in a decrease in computational cost, making it a superior option for classification.

With an accuracy of 0.99175, Xception41 exhibits remarkable performance. Similarly, with an accuracy of 0.99139, swin_tiny_patch4_window7_224 is close to the models previously mentioned. Convnextv2_tiny demonstrates moderate performance, with an accuracy of 0.98125. Despite a commendable precision of 0.98252, its recall of 0.97922 suggests a slight tendency to misclassify some positive instances.

A final point to consider is that efficientnet_b0 displays the lowest accuracy of 0.95274 among the evaluated models, according to the values observed in Table 4.

The metrics in Table 4 were obtained using Equations (1)–(5).

4.4. ML-Model Explorer

This study implemented the ML-Model Explorer tool, as suggested in [54,74]. It allows users to evaluate and select multi-class classifiers based on their confusion matrices, which are primarily based on class imbalances.

Convnextv2_tiny was identified as a weak classifier based on the evaluation. Meanwhile, efficientnet_b0, swin_tiny_patch4_window7_224, and xception41 were categorized as moderate classifiers. Inception_v4 and deit3_base_patch16_224 were ranked as strong classifiers in descending order in Figure 12.

Figure 13 provides a graphic representation of the Std Dev of the recall of each model.

The X-axis in Figure 13 represents the Std Dev of the recall metrics achieved by each model across classes. Having a lower Std Dev indicates improved consistency between predictions. The Y-axis indicates the overall accuracy of the model based on the dataset. A higher level of accuracy indicates that the model correctly classifies a greater number of samples.

For this study, swin_tiny_patch4_window7_224 (indicated with pink dot) achieved the highest accuracy ≈ 0.9995 and a very low recall Std Dev ≈ 0.001. Among the models compared, this model offers the best balance between overall performance and stability across all classes.

The model deit3_base_patch16_224 (symbolized by a blue dot) achieved an accuracy of ≈0.999, with a very low recall Std Dev ≈ 0.0015. Compared to the Swin model, its performance is approximately the same, although its overall accuracy is slightly lower.

EfficientNet_B0, indicated by the brown dot, achieved an accuracy of ≈0.999, similar to deit3_base_patch16_224, but with a higher recall Std Dev ≈ 0.003, indicating less consistent performance across classes.

In case of xception41, dark purple dot, achieved an accuracy close to ≈0.9995, comparable to the Swin model, but with a slightly higher recall Std Dev ≈ 0.002. Despite its high accuracy, this model is less stable due to its lower consistency.

Convnextv2_tiny, symbolized by a black dot, achieved an accuracy of ≈0.997 with a recall Std Dev of ≈0.002. In spite of its consistency, its accuracy is significantly lower when compared to the other models mentioned so far.

Inception_V4, marked with a green dot, showed the lowest accuracy ≈ 0.9965 and the highest recall Std Dev ≈ 0.005, making it the least effective and consistent model among those compared.

The models demonstrating the highest accuracy were swin_tiny_patch4_window7_224 and Xception41, while the most reliable designs were swin_tiny_patch4_window7_224 and deit3_base_patch16-_224, both exhibiting lower Std Devs, suggesting effective handling of class balancing. Conversely, the least effective model was Inception_V4, which showed the lowest performance on both measures.

For applications requiring high accuracy and consistency, the swin_tiny_patch4_window-7_224 model appears to be the optimal choice. However, if inter-class stability is less critical, the deit3_base_patch16_224 may be a viable alternative.

In Figure 14, a boxplot illustrates the distribution of three essential metrics: F1-Score, precision, and recall. Measurements are shown on the Y-axis, while models are labeled on the X-axis. Each model’s mean performance and variability across these metrics will be evaluated.

Swin_tiny_patch4_window7_224 has a F1-Score that regularly approaches 1, characterized by a compact box and thin range lines, indicating high precision, recall, and F1-score. Precision and Recall are almost identical (≈1) with minimal variations, making it the most consistent and accurate model. In terms of overall performance and consistency, this model exhibits near-perfect metrics.

In the case of deit3_base_patch16_224, the F1-Score is nearly ≈0.999, characterized by a compact box and limited range, signifying stability. Compared to swin_tiny_patch4_window 7_224, its precision and recall remain consistently high.

The third-place model, EfficientNet_b0, shows solids metrics but demonstrates greater performance variability than the leading models swin_tiny_patch4_window7_224 and deit3_base_patch16_224.

Figure 15 provides error metrics for LGG and HGG classes, enabling comparisons of model errors between the two classes. The y-axis represents the magnitude of error for each model; a greater error indicates a lower degree of effectiveness.

The swin_tiny_patch4_window7_224 model exhibits the lowest error rates, with nearly negligible values (≈0.000) for both classifications (Low Grade and High Grade). This makes it the most robust and effective model for categorization. Deit3_base_patch16_224 demonstrates minimal errors of approximately 0.001 for Low Grade and 0.002 for High Grade, indicating excellent reliability and performance.

Efficientnet_b0 shows slightly higher errors, with approximately 0.002 for Low Grade and approximately 0.003 for High Grade. Although it does not achieve the same level of accuracy as the top-performing models, it maintains satisfactory consistency across both categories. Compared with EfficientNet at ≈0.003, Xception41 displays higher errors, with approximately 0.004 for Low Grade and 0.003 for High Grade. This makes it a less accurate model in general.

Both swin_tiny_patch4_window7_224 and deit3_base_patch16_224 are the most balanced models, exhibiting low error discrepancies across classes.

4.5. DNN Training Runtime, Computational Complexity, and GPU Power Usage

Figure 16a illustrates the percentage of GPU use. The X-axis represents the duration of model training, while the Y-axis displays the percentage of GPU consumption.

The models deit3_base_patch16_224 (pink) and convnextv2_tiny (green) show significant variations in GPU consumption, occasionally reaching 100%. Efficientnet_b0 (yellow) and Xception41 (orange) demonstrate lower average consumption, though with considerable fluctuations.

Inception_v4 (purple) demonstrates relatively stable GPU consumption, consistently remaining below 40%. The swin_tiny_patch4_window7_224 (light green) model exhibits moderate consumption with variations but is less unstable than its counterparts.

Figure 16b illustrates the CPU consumption of the procedure (%). The X-axis represents the elapsed training time, while the Y-axis indicates the proportion of CPU consumption attributed to the training process.

The CPU usage rates of all models are relatively low and stable. Efficientnet_b0 (yellow line) has the highest CPU consumption, approaching 40%. In addition, other models such as deit3_base_patch16_224, convnextv2_tiny, and inception_v4 exhibit usage rates in the range of 20% to 30%.

A majority of models are dependent on GPUs, with oscillations possibly related to computationally intensive tasks. A significant reduction in CPU demand can be observed for parallel processing tasks that are dominated by GPUs.

More consistent GPU utilization, as seen for inception_v4 and Efficientnet_b0, may suggest efficient resource usage or a decreased reliance on sporadic high-intensity operations. More GPU variability is seen in models like deit3_base_patch16_224, which may be due to their resource-intensive training needs.

It is generally recognized that models with stable usage patterns are more predictable in production settings. In contrast, models with greater variability in GPU utilization might be more susceptible to hardware availability issues. Low CPU utilization indicates that these training procedures are not computationally intensive.

Figure 17 displays GPU memory allocation (%). The X-axis represents the model training time, while the Y-axis indicates the percentage of GPU memory utilized by each model during training.

The model with the lowest GPU memory utilization among the assessed architectures is Inception_v4 (purple line), which continuously utilizes less than 20% of GPU RAM. Efficientnet_b0 (orange line) exhibits consistent behavior during training, maintaining a steady memory allocation of about 40%. With consistent performance comparable to Efficientnet_b0, Convnextv2_tiny (dark green line) uses between 30% and 40% of GPU RAM.

With continuous usage, Xception41 (yellow line) uses the most GPU memory—roughly 40%. Deit3_base_patch16_224 (pink line) exhibits consistent behavior over time and consumes 20% of GPU RAM. Similar to deit3_base_patch16_224, swin_tiny_patch4_window7_224 (bright green line) similarly uses about 20% of GPU memory with little variation.

Models such as Inception_v4, deit3_base_patch16_224, and swin_tiny_patch4_window7 _224 require less GPU memory (below 20%), making them more suitable for memory-constrained environments.

On the other hand, models such as Efficientnet_b0, Convnextv2_tiny, and Xception41 require about 40% of memory allocation, due either to their bigger batch sizes or more complex designs. Because they use less memory, Inception_v4, deit3_base_patch16_224, and swin_tiny_patch4_window7_224 are superior choices when it comes to memory efficiency.

Figure 18 illustrates the GPU Power Usage (W) analysis graph, with the X-axis (Time) denoting the duration of model training and the Y-axis (GPU Power Usage in Watts) indicating the power consumption of the GPU for each model during the training process. For inception_v4 (purple), the model exhibits the lowest power usage among all evaluated models, consistently remaining below 100 W. This behavior highlights its superior energy efficiency; however, this efficiency may come at the expense of performance compared to more advanced architectures.

Similar to inception_v4, efficientnet_b0 (orange line) uses very little power. It almost always stays below 100 W. This architecture may be ideal for tasks requiring a balance between performance and energy efficiency.

Convnextv2_tiny (dark green line) shows significant variations in power usage, along with a maximum power consumption of about 300 W. As a result of this pattern, it can be inferred that GPU utilization is more dependent on data or key training moments.

The yellow line on Xception41 (yellow line) indicates moderate consumption between 100 W and 200 W. Compared to other models, it does not reach excessive peak levels. The model consumes a moderate level of power.

Deit3_base_patch16_224 (pink line) has the highest power consumption, occasionally coming close to 300W. Despite the high energy cost, performance is prioritized in this paradigm.

Finally, swin_tiny_patch4_window7_224 (bright green) exhibits significant power consumption peaks (≈300 W), though they occur less often than with deit3_base_patch16_224. It is a powerful but less energy-efficient variant.

Low and steady energy usage make inception_v4 and efficientnet_b0 the most efficient.

The complexity and processing power of newer models such as deit3_base_patch16_ 224 and swin_tin-y_patch4_window7_224 may explain their more significant power usage.

In this investigation, the most energy-efficient model is inception_v4, with low power usage (<100 W) and stability, as shown in Figure 18.

deit3_base_patch16_224 and swin_tiny_patch4_window7_224 are the most powerful solutions for optimal performance independent of energy cost, but they need more resources.

4.6. Grad-CAM

While the current data effectively illustrate the usefulness of a specific model, an alternative method of graphically representing the performance of such models is available [54]. The heatmap produced by the Grad-CAM approach defines the focal area and aids in visualizing the regions the classification model examines for specific predictions [48]. This approach has been used in other works, including [14,38,54,72,75]. Grad-CAM helps users gain insight into how a model performs generally and how to improve it. Prediction based on convolutional layers is highly sensitive to specific computed gradients; preserving spatial information from the ongoing interpreted features from a region of interest, providing a basis for error rate calculation, and supporting precision for the incorporation of the final result into diagnostic interpretation. An individual patient’s brain tissue analysis is preceded by integrated gradients of change with respect to a given baseline in tissue images, which are accumulated posteriorly to determine the change. The gradients at different points along an image path are averaged and multiplied by the difference between the input and the baseline to determine the integrated gradient for each image feature. The difference between the model’s output for the input and the result for the baseline are then summed to produce a set of feature attributions. Figure 19 illustrates the Grad-CAM for each of the six models in the last run. These six representations emphasize the areas of the highest importance for predicting the LGG class. Clinical interpretation was made according to intensity, density, space, volume, and tissue diffusion, all of which are routine considerations in neuroradiology visual analysis procedures [76].

The inception_v4 model highlights some brain regions, as seen in Figure 19a, with the top left region showing noticeably high intensity. By focusing on specific areas, it can identify important LGG-related characteristics. In this image, intensity, showing a well-defined and distinguished area of interest; space, showing scarce possibility for localization but providing a possible cortical lobe approach; volume; and tissue diffusion indicate a possible low-grade brain tumor due to its large diffusion and low space accuracy.

The xception41 model’s heatmap is more dispersed in Figure 19b, with its focal points mainly centered in the top center and diffused throughout several locations. Accordingly, the model may concentrate on fewer specific characteristics and employ a more comprehensive detection strategy. Intensity facilitates a clear depiction of the region of interest, while density and space support localization accuracy, volume suggests the plausible clinical condition of the gray matter according to temporal properties, and tissue diffusion remains well-defined in the outcome.

The convnextv2_tiny model is shown in Figure 19c, where the highlighted regions are moved laterally and bright colors emerge close to the inferior border of the brain. The emphasis on ancillary characteristics suggests a unique but imprecise method of detecting the target. Intensity and density remain high in the areas of interest, showing a cortical detection regarding space, and a large volume of identified tissue in the gray matter, with possible low-grade diffusion between both cerebral hemispheres.

Figure 19d shows a high-intensity activation that is mostly concentrated in a central location within the swin_tiny_patch4_window7_224 model. Based on the heatmap’s clear resolution, this model is able to accurately locate relevant locations. This run presents the most pronounced reduction in intensity and space, identifying large tissue diffusion, which may suggest white matter malignant tissue formation.

Compared to other models, Figure 19e, which corresponds to efficientnet_b0, displays scattered points of focus with less intensity. This could be a sign of a larger, more generalized, classification strategy accompanied by a clear density spectrum of notable tissue volume located in the medial interhemispheric and well delimited left fronto-temporal regions.

The deit3_base_patch16_224 model illustrates regions of consistent attention throughout the brain (Figure 19f). As a result, the model examines several crucial areas, suggesting a more exact but somewhat troublesome approach. There could be confounding factors for this detection strategy, resulting in regions showing several diffuse, large areas of high density in the brain, with notable high volume dimensions distributed in both hemispheres.

The models Inception_v4 Figure 19a and swin_tiny_patch4 _window7_224 Figure 19d are found to be more accurate in areas related to LGGs. This is consistent with earlier assessments that emphasized them as high-performing and energy-efficient choices. Despite requiring more resources, deit3_base_patch16_224 Figure 19f is still a viable option.

The GradCAM for the inception_v4 model is displayed in Figure 20a,b, demonstrating its capacity to detect LGG- and HGG-relevant features in specific regions. The model shows a focal point for the low-grade class, indicating that it is successful in classifying low-grade tumors based on unique and localized characteristics. Due to HGG cancers’ aggressive and diffuse nature, the model finds more global patterns in the brain core for the HGG class. This finding might be indicative of how well the model adapts to the complex and wide-ranging patterns typical of high-grade malignancies.

The peripheral and constrained emphasis of the LGG heatmap is consistent with the defined, less invasive character of LGG tumors. The HGG heatmap, on the other hand, is larger and more centralized, demonstrating the widespread and invasive nature of HGG tumors.

The InceptionV4 model effectively differentiates between the two classes by generating attention maps that align with the expected anatomical and clinical characteristics of LGG and HGG malignancies. Its ability to focus on specific regions for LGGs and cover broader areas for HGGs underscores its suitability for this classification task.

As previously noted, transformer-based models can supplement and even outperform CNNs in this classification task because self-attention aggregates data throughout the whole slice, capturing global contextual cues and long-range spatial relationships that primarily local convolutional receptive fields might overlook. This is consistent with our Grad-CAM data, which show that informative reactions might encompass wider, dispersed patterns rather than being limited to a single compact location. In this regard, CNNs remain useful for local, texture-based characteristics, but attention-based primary structures, mainly Vision Transformers (e.g., Swin Transformer, DeiT), are better suited to incorporating such global evidence.

4.7. Comparison with Other Similar Studies

The classification performance, energy efficiency, and resource consumption of InceptionV4, the best deep neural network identified in this study, were evaluated in comparison to other methods. The methods include specialized models such as SADO-Net, BrainNet, and Neuro-XAI, as well as popular architectures such as ResNet, VGG, and EfficientNet.

A comparative study of recent research is given in Table 5, which mainly makes use of BRATS datasets (2017–2023) and specialty datasets like Kempanna, Br35H, and Kaggle. Most of the studies assessed focus on binary classification problems, which is the main objective of the comparison.

4.8. Real-Time Inference Benchmarking of the DL Models in Cloud and Edge Environments

This subsection evaluates real-time feasibility by benchmarking per-image inference latency for all trained architectures in both cloud and edge environments. To reflect a clinically realistic usage pattern, all measurements were performed with a batch size

= 1

, reporting the mean and Std Dev of the inference time across repeated runs.

For the cloud setting, inference was executed on Google Colab using an NVIDIA H100 (GH100, Hopper; 80 GB HBM3) under PyTorch FP16. For the edge setting, inference was executed on an NVIDIA Jetson Orin Nano (Ampere, 16 GB LPDDR5) running Ubuntu 20.04 with NVIDIA JetPack in the 7 W power mode. On the Jetson platform, five architectures (xception41, inception_v4, efficientNet_B0, swin_tiny_patch4_window7_224, and deit3_base_patch16_224) were benchmarked using a TensorRT-optimized FP16 engine (TensorRT v10.3.0) within our in-house containerized runtime to reflect deployment-oriented embedded inference. In contrast, convnextv2_tiny was benchmarked exclusively under PyTorch 2.4.0 FP16 due to runtime constraints preventing TensorRT execution for this specific model; for this case, the Jetson measurement was obtained using the Jetson Containers software environment [78].

Table 6 summarizes the resulting latencies. In the cloud benchmark, the lowest mean inference time was obtained by deit3_base_patch16_224 (2.98 ms), followed by convnextv2_tiny (4.11 ms), EfficientNet_B0 (4.38 ms), and Xception41 (4.81 ms), whereas Inception_v4 exhibits the highest latency (10.04 ms). On the edge computing device (embedded hardware), EfficientNet_B0 yields the lowest mean latency among the reported models (8.66 ms), while Inception_v4 (13.32 ms) and deit3_base_patch16_224 (20.36 ms) show higher inference times, highlighting the expected trade-off between architectural complexity and embedded execution constraints.

convnextv2_tiny was benchmarked on the Jetson Orin Nano using PyTorch 2.4.0 FP16 (rather than TensorRT) because a TensorRT FP16 engine could not be generated for this architecture with our setup. Therefore, its reported edge latency (27.96 ms) should be interpreted as a PyTorch-based embedded reference, while the remaining five models reflect TensorRT-optimized FP16 inference. Even so, this measurement remains useful as a practical baseline of how long the model takes to run without TensorRT acceleration on an edge computing device. Overall, deit3_base_patch16_224 is the fastest model in the cloud setting, whereas EfficientNet_B0 achieves the best latency on the edge device.

4.9. Scenario of Real-World Usage

An end-to-end, cloud-native clinical workflow for AI-assisted brain MRI study analysis is shown in Figure 21. It is divided into five successive stages that are connected by directing arrows. Initially, imaging data are obtained during standard MRI scanning at the point of care. To demonstrate the kind of radiological input coming into the pipeline, representative brain pictures are displayed. Second, the collected data are moved to a cloud storage layer (shown as a cloud database), which allows for remote access across clinical sites and serves as the central repository for further processing.

Third, an API-driven orchestration layer initiates automatic processing when data are saved. An API icon, a webhook/event sign, and gears are used to graphically symbolize this step, emphasizing that the process may be started programmatically (for example, upon upload) and managed by modular services. To highlight that this phase is carried out in a controlled setting that upholds access restrictions and facilitates the safe management of clinical data, a security icon is inserted. Preprocessing is specifically anticipated to function on several slices in this approach, matching real-world pipelines where studies are broken down into 2D inputs or selected slice sets before inference.

As a fourth step, the processed inputs are routed to a cloud GPU inference module (GPU server icon), representing elastic computing resources that are typical of commercial cloud providers (e.g., AWS, GCP, or comparable platforms). As part of this stage, AI outputs are generated, as are interpretability artifacts (such as heatmap overlays), which are intended to facilitate model transparency during clinical review. It is important to emphasize that this component is positioned as a computational infrastructure for the generation of inference and explanation rather than an autonomous clinical component.

Fifth, regular clinical workstations or mobile devices can obtain the data via a web-based visualization interface (monitor and clinician review images). In this case, the AI system is presented as a clinical decision support tool, which speeds up the review by providing visual evidence and summarizing model results. The radiologist is still in charge of interpretation and making the ultimate decision. This posture matches deployment restrictions in the real world, where AI enables integration into standard clinical reporting pathways through a cloud-accessible interface and increases workflow speed and consistency without taking the place of expert judgment. This cloud-based implementation opens the possibility of multi-user systems with different patient databases for each hospital.

4.10. Limitations of the Study

In this study, a key limitation was the lack of sufficiently large and clinically diverse datasets that matched the target classification task. Although BraTS 2019 provides a well-established benchmark, access to current, multi-center imaging data remains challenging due to privacy constraints, acquisition heterogeneity, and curation and annotation costs. In this context, open-source platforms (e.g., Kaggle and similar repositories) have become valuable for facilitating data sharing and reproducibility; however, they may still provide a limited representation of real-world clinical settings. Our assessment is predicated on a 2D best-slice approach, which minimizes computational load but may exclude clinically significant inter-slice context, including 3D spatial continuity, tumor size, and heterogeneity between slices. As a result, the claimed performance may not immediately translate to multi-slice or complete volumetric MRI analysis and should instead be considered proof of viability in a reduced scenario.

Another significant limitation concerns computational complexity. Training and evaluating modern deep learning architectures can be resource-intensive, which may restrict reproducibility in low-resource environments. Cloud-based solutions such as the free edition of Google Colab can partially address this barrier for prototyping and educational use, but large-scale experimentation and practical deployment still typically require more consistent access to high-performance hardware (GPU-enabled workstations or dedicated edge devices) to meet time and efficiency constraints in applied scenarios.

In the case of Grad-CAM, it supports interpretability but cannot provide clinical validation.

4.11. Future Work

Enhancing generalizability by increasing the size and variety of the data will be the main focus of future research. In order to better represent real-world diversity in collection techniques, scanners, and patient demographics, future work will include additional public glioma MRI cohorts and, where practical, multi-center clinical data, even if BraTS 2019 offers a well-established multi-institutional benchmark. The pipeline will be enhanced to include multi-slice and 3D volumetric learning (such as automated slice/volume selection and multi-modal fusion).

Future research will also focus on reducing computational complexity to facilitate practical deployment. To achieve lower memory and energy requirements while maintaining accuracy, we will investigate model compression strategies and optimize inference via efficient runtimes and hardware-aware tuning. With this approach, experimentation will be more accessible in low-resource settings, and devices will be usable in real time or near-real time.

Quantitative overlap with tumor masks/segmentation and expert-driven validation of the Grad-CAM will be included to expand the interpretability of the study. To assess the clinical plausibility and consistency of the highlighted regions with tumor-related patterns, we specifically plan to collaborate with neuro-radiologists and neuro-oncology specialists. We will also use structured protocols (such as region-of-interest review and inter-rater reliability) to measure agreement. In addition, future work will incorporate a clinically oriented error analysis on new, unseen patients, where HGG-LGG misclassifications will be reviewed by specialists to identify failure modes and improve robustness.

Neurodevelopment and neuropathologies can be influenced by cultural and biosocial factors, such as ethnicity/race, gender, and nutrition, among others. Future work will improve the incorporation of demographic diversity among patients to reduce biases and performance gaps between different groups. To achieve this, it will be crucial to expand the dataset with information from more medical institutions and different regions of the world.

5. Conclusions

This paper proposed a novel DL-based method that utilizes six distinct DL models trained using the method of parameter optimization: inception_v4, xception41, convnextv2_tiny, swin_tiny_patch4_window7_224, efficientnet_b0, and deit3_base_patch16_224. It has been demonstrated in this work that DL approaches can be used for the classification of images in the specific context of the classification of high-grade and low-grade glioma tumors. It can be seen in Table 4 that these models demonstrated superior classification ability, even when the dataset was relatively limited, with 25% of the total, and the distribution of classes was unbalanced.

With an accuracy of 99.40% and an

F 1

-Score of 99.40%, the deit3_base_patch16_224 model is the most accurate, exhibiting a remarkable balance between accuracy and recall. It is followed by xception41, which offers dependable performance with 99.18% and 99.16%, respectively, and inception_v4, which has an accuracy of 99.21% and an

F 1

-Score of 99.22%.

The models swin_tiny_patch4_window7_224 and convnextv2_tiny achieved competitive F1 Score values of 99.09% and 98.08% while achieving Accuracies of 99.14% and 98.12%, respectively. In this analysis, efficientnet_b0 has the lowest accuracy with 95.27 percent and an

F 1

-Score of 95.36 percent; however, due to its computational efficiency, it is still a good option.

Based on this proposed approach, inception_v4 displayed distinct, focused, and cohesive attention mappings for both LGG and HGG classes in GradCAM. It is affordable for deployment in energy-constrained locations thanks to its comparatively low power consumption (<100 W).

Due to their robust and evenly distributed attention maps, deit3_base_patch16_224 and swin_tiny_patch4_window7_224 are the best choices for applications that prioritize accuracy and performance. However, the application of some modern technologies or significant growth in computer resources must be considered in this decision.

Grad-CAM maps for LGG reveal confined attention patterns focusing on specific brain regions, consistent with the less invasive characteristics of low-grade tumors. Models such as inception_v4 and deit3_base_patch16_224 effectively capture these patterns. In contrast, HGG attention maps display a more global and centralized focus, demonstrating the models’ ability to recognize the aggressive and diffuse nature of high-grade tumors. Models like inception_v4 and swin_tiny_patch4_window7_224 exemplify these characteristics.

The optimal model selection was aided by the suggested approach. Inception_v4 would be the best option in clinical settings when quick deployment and energy efficiency are crucial. For advanced studies requiring peak accuracy and unlimited computing resources, deit3_base_patch16_224 would be a suitable option.

There is no doubt that DNN research holds enormous promise for solving challenging classification problems in the future. Specifically, it focused on resource-efficient solutions and transfer learning approaches for the classification of brain tumors in magnetic resonance images. For brain cancer specialists, the suggested work can be a useful clinical support tool that lessens their effort and increases diagnostic precision in urgent medical situations.

In conclusion, the suggested approach offers a variety of models with superior performance and resource consumption profiles, demonstrating a reliable, flexible solution for the categorization of brain tumors. This study enabled the choice of the appropriate model based on particular requirements: while deit3_base_patch16_224 is the best option for advanced studies that prioritize peak accuracy in scenarios with sufficient computing resources, Inception_v4 is the best option for clinical scenarios with limited computing resources.

Author Contributions

Conceptualization, E.I.-G. and L.J.-B.; data curation, O.A.A.-C. and J.J.E.-E.; formal analysis, C.T.-G. and G.M.G.-A.; funding acquisition, E.I.-G.; investigation, M.A.G.-G. and E.E.G.-G.; methodology, M.A.G.-G. and E.R.R.-A.; project administration, E.I.-G.; resources, E.E.G.-G.; software, M.A.G.-G. and E.R.R.-A.; supervision, E.I.-G. and L.J.-B.; validation, C.T.-G. and G.M.G.-A.; visualization, J.J.E.-E. and O.A.A.-C.; writing—original draft, M.A.G.-G.; writing—review and editing, E.I.-G. and E.E.G.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Universidad Autónoma de Baja California (UABC) through its 25th Internal Call for Research Projects under grant numbers 215/2/C/63/24 and 402/6/C/53/25. The authors also extend their gratitude to the Secretaría de Ciencia, Humanidades, Tecnología e Innovación (SECIHTI) for the scholarships awarded to M. A. Gómez-Guzmán and E. R. Ramos-Acosta.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Available at IEEE DataPort, Brats MICCAI Brain tumor dataset https://dx.doi.org/10.21227/hdtd-5j88.

Acknowledgments

We express our sincere gratitude to the Universidad Autónoma de Baja California (UABC) for the support provided to this research. We also thank SECIHTI for the scholarships awarded to M.A.G.-G. and E.R.R.-A.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study, in the collection, analysis, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

References

Sarala, B.; Sumathy, G.; Kalpana, A.V.; Hephzipah, J.J. Glioma brain tumor detection using dual convolutional neural networks and histogram density segmentation algorithm. Biomed. Signal Process. Control 2023, 85, 104859. [Google Scholar] [CrossRef]
Disci, R.; Gurcan, F.; Soylu, A. Advanced brain tumor classification in MR images using transfer learning and pre-trained deep CNN models. Cancers 2025, 17, 121. [Google Scholar] [CrossRef] [PubMed]
Nisa, Z.U.; Bhatti, S.M.; Jaffar, A.; Mazhar, T.; Shahzad, T.; Ghadi, Y.Y.; Almogren, A.; Hamam, H. Beyond accuracy: Evaluating certainty of AI models for brain tumour detection. Comput. Biol. Med. 2025, 193, 110375. [Google Scholar] [CrossRef]
Poornam, S.; Angelina, J.J.R. BrainNeuroNet: Advancing brain tumor detection with hierarchical transformers and multiscale attention. Int. J. Inf. Technol. 2024, 16, 4749–4756. [Google Scholar] [CrossRef]
Qureshi, S.A.; Sadiq, T.; Usman, A.; Khawar, A.; Shah, S.T.H.; Rehman, A.U.; Rasheed, A.; Hussain, M. SAlexNet: Superimposed AlexNet using residual attention mechanism for accurate and efficient automatic primary brain tumor detection and classification. Results Eng. 2025, 25, 104025. [Google Scholar]
Li, J.; Zhang, L.; Zhong, K.; Qian, G. A discrepancy-aware self-distillation method for multi-modal glioma grading. Knowl.-Based Syst. 2024, 295, 111858. [Google Scholar] [CrossRef]
Bouguerra, O.; Attallah, B.; Brik, Y. MRI-based brain tumor ensemble classification using two stage score level fusion and CNN models. Egypt. Inform. J. 2024, 28, 100565. [Google Scholar] [CrossRef]
Dutta, A.K.; Bokhari, Y.; Alghayadh, F.; Alsubai, S.; Sait, A.R.W. SADO-Net: A spatial adaptive dart optimized network model for an automated brain tumor diagnosis using MRIs. Alex. Eng. J. 2024, 109, 884–902. [Google Scholar] [CrossRef]
Velayudham, A.; Kumar, K.M.; Priya, M.S.; Baskaran, R. Enhancing clinical diagnostics: Novel denoising methodology for brain MRI with adaptive masking and modified non-local block. Med. Biol. Eng. Comput. 2024, 62, 3043–3056. [Google Scholar] [CrossRef]
ZainEldin, H.; Gamel, S.A.; El-Kenawy, E.-S.M.; Alharbi, A.H.; Khafaga, D.S.; Ibrahim, A.; Talaat, F.M. Brain tumor detection and classification using deep learning and sine-cosine fitness grey wolf optimization. Bioengineering 2022, 10, 18. [Google Scholar] [CrossRef]
Sun, J.; Chen, K.; He, Z.; Ren, S.; He, X.; Liu, X.; Peng, C. Medical image analysis using improved SAM-Med2D: Segmentation and classification perspectives. BMC Med. Imaging 2024, 24, 241. [Google Scholar] [CrossRef]
Farnoosh, R.; Noushkaran, H. Development of an unsupervised pseudo-deep approach for brain tumor detection in magnetic resonance images. Knowl.-Based Syst. 2024, 300, 112171. [Google Scholar] [CrossRef]
Ullah, M.S.; Khan, M.A.; Almujally, N.A.; Alhaisoni, M.; Akram, T.; Shabaz, M. BrainNet: A fusion assisted optimal framework of residual blocks and stacked autoencoders for multimodal brain tumor. classification. Sci. Rep. 2024, 14, 5895. [Google Scholar] [CrossRef]
Sánchez-Moreno, L.; Perez-Peña, A.; Duran-Lopez, L.; Dominguez-Morales, J.P. Ensemble-based convolutional neural networks for brain tumor classification in MRI: Enhancing accuracy and interpretability using explainable AI. Comput. Biol. Med. 2025, 195, 110555. [Google Scholar] [CrossRef] [PubMed]
Tan, S.; Cai, Y.; Zhao, Y.; Hu, J.; Chen, Y.; He, C. FM-LiteLearn: A lightweight brain tumor classification framework integrating image fusion and multi-teacher distillation strategies. In Proceedings of the International Conference on AI in Healthcare, Swansea, UK, 4–6 September 2024; Springer: Cham, Switzerland, 2024; pp. 89–103. [Google Scholar]
Kempanna, S.R.; Rangappa, A.A.; Maheshappa, S.; Siddaraju, D.K.; Gowda, K.P.; Ramachandragowda, S.K.; Tagare, T.S. Revolutionizing brain tumor diagnoses: A ResNet18 and focal loss approach to magnetic resonance imaging-based classification in neuro-oncology. Int. J. Electr. Comput. Eng. 2024, 14. [Google Scholar] [CrossRef]
Saeed, Z.; Bouhali, O.; Ji, J.X.; Hammoud, R.; Al-Hammadi, N.; Aouadi, S.; Torfeh, T. Cancerous and non-cancerous MRI classification using dual DCNN approach. Bioengineering 2024, 11, 410. [Google Scholar] [CrossRef] [PubMed]
Mathur, D.; Barnacle, B.D.; Magera, R.W.; Fazal, Z.; Zafar, A.M. System-based strategies for mitigating burnout in radiology. Emerg. Radiol. 2024, 31, 845–849. [Google Scholar] [CrossRef]
Chatterjee, S.; Das, A. A novel systematic approach to diagnose brain tumor using integrated type-II fuzzy logic and ANFIS (adaptive neuro-fuzzy inference system) model. Soft Comput. A Fusion Found. Methodol. Appl. 2020, 24, 11731–11754. [Google Scholar] [CrossRef]
Nair, A.; Ong, W.; Lee, A.; Leow, N.W.; Makmur, A.; Ting, Y.H.; Lee, Y.J.; Ong, S.J.; Tan, J.J.H.; Kumar, N.; et al. Enhancing Radiologist Productivity with Artificial Intelligence in Magnetic Resonance Imaging (MRI): A Narrative Review. Diagnostics 2025, 15, 1146. [Google Scholar] [CrossRef]
Jayachandran, A.; Anisha, N. Multi-class brain tumor classification system in MRI images using cascades neural network. Comput. Intell. 2024, 40, e12687. [Google Scholar] [CrossRef]
Höller, Y.; Butz, K.H.G.; Thomschewski, A.C.; Schmid, E.V.; Hofer, C.D.; Uhl, A.; Bathke, A.C.; Staffen, W.; Nardone, R.; Schwimmbeck, F.; et al. Prediction of Cognitive Decline in Temporal Lobe Epilepsy and Mild Cognitive Impairment by EEG, MRI, and Neuropsychology. Comput. Intell. Neurosci. 2020, 2020, 8915961. [Google Scholar] [CrossRef]
Papadomanolakis, T.N.; Sergaki, E.S.; Polydorou, A.A.; Krasoudakis, A.G.; Makris-Tsalikis, G.N.; Polydorou, A.A.; Afentakis, N.M.; Athanasiou, S.A.; Vardiambasis, I.O.; Zervakis, M.E. Tumor diagnosis against other brain diseases using T2 MRI brain images and CNN binary classifier and DWT. Brain Sci. 2023, 13, 348. [Google Scholar] [CrossRef] [PubMed]
Gunasekaran, S.; Bai, P.S.M.; Mathivanan, S.K.; Rajadurai, H.; Shivahare, B.D.; Shah, M.A. Automated brain tumor diagnostics: Empowering neuro-oncology with deep learning-based MRI image analysis. PLoS ONE 2024, 19, e0306493. [Google Scholar] [CrossRef]
Aloraini, M.; Khan, A.; Aladhadh, S.; Habib, S.; Alsharekh, M.F.; Islam, M. Combining the transformer and convolution for effective brain tumor classification using MRI images. Appl. Sci. 2023, 13, 3680. [Google Scholar] [CrossRef]
Ishfaq, Q.U.A.; Bibi, R.; Ali, A.; Jamil, F.; Saeed, Y.; Alnashwan, R.O.; Chelloug, S.A.; Muthanna, M.S.A. Automatic Smart Brain Tumor Classification and Prediction System Using Deep Learning. Sci. Rep. 2025, 15, 14876. [Google Scholar] [CrossRef] [PubMed]
Sharma, P.; Nayak, D.R.; Balabantaray, B.K.; Tanveer, M.; Nayak, R. A survey on cancer detection via convolutional neural networks: Current challenges and future directions. Neural Netw. 2024, 169, 637–659. [Google Scholar] [CrossRef]
Muftic, F.; Kadunic, M.; Musinbegovic, A.; Almisreb, A.A.; Jaafar, H. Deep learning for magnetic resonance imaging brain tumor detection: Evaluating ResNet, EfficientNet, and VGG-19. Int. J. Electr. Comput. Eng. 2024, 14. [Google Scholar] [CrossRef]
Saeed, T.; Khan, M.A.; Ameer, H.; Shah, M.; Zada, K.W.; Fatimah, A.; Jaleel, L.; Jamel, B. Neuro-XAI: Explainable deep learning framework based on deeplabV3+ and Bayesian optimization for segmentation and classification of brain tumor in MRI scans. J. Neurosci. Methods 2024, 410, 110247. [Google Scholar] [CrossRef]
Rasool, M.; Nafees, A.; Ghanem, H.; Ismail, N.A.; Wan, M.S.W.M. Brain tumor classification using deep learning: A state-of-the-art review. Eng. Technol. Appl. Sci. Res. 2024, 14, 16586–16594. [Google Scholar] [CrossRef]
Montalbo, F.J.P. TUMbRAIN: A transformer with a unified mobile residual attention inverted network for diagnosing brain tumors from magnetic resonance scans. Neurocomputing 2025, 611, 128583. [Google Scholar] [CrossRef]
Chen, J.; Mei, J.; Li, X.; Lu, Y.; Yu, Q.; Wei, Q.; Luo, X.; Xie, Y.; Adeli, E.; Wang, Y.; et al. TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 2024, 97, 103280. [Google Scholar] [CrossRef]
Rajpoot, R.; Jain, S.; Semwal, V.B. BioTransX: A novel bi-former based hybrid model with bi-level routing attention for brain tumor classification with explainable insights. Comput. Biol. Med. 2025, 195, 110515. [Google Scholar]
Nath, P.; Trivedi, N.K. Review and Evaluation of Existing Deep Learning Methods for Brain Tumor Detection. Cuest. Fisioter. 2025, 54, 622–630. [Google Scholar] [CrossRef]
Bagban, T.I.; Pise, S.P.; Magdum, H.P.; Mathapati, P.M.; Koshti, P.N. Brain Tumor Detection from MRI Images using a Convolutional Neural Network (CNN) Approach. Cuest. Fisioter. 2025, 54, 5237–5247. [Google Scholar]
Al-Milaji, Z.; Yousif, H. Lightweight deep learning model optimization for medical image analysis. Int. J. Imaging Syst. Technol. 2024, 34, e23173. [Google Scholar] [CrossRef]
Zheng, Y.; Huang, D.; Feng, Y.; Hao, X.; He, Y.; Liu, Y. CSF-Glioma: A causal segmentation framework for accurate grading and subregion identification of gliomas. Bioengineering 2023, 10, 887. [Google Scholar] [CrossRef]
Hekmat, A.; Zhang, Z.; Khan, S.U.R.; Shad, I.; Bilal, O. An attention-fused architecture for brain tumor diagnosis. Biomed. Signal Process. Control 2025, 101, 107221. [Google Scholar] [CrossRef]
Divya, B.; Nair, R.P.; Prakashini, K.; Menon, G.; Litvak, P.; Mandava, P.; Vijayasenan, D.; Rao, A. Generalizable DNN model for brain tumor sub-structure segmentation from low-resolution 2D multimodal MR images. Biomed. Signal Process. Control 2025, 100, 106916. [Google Scholar]
Rahman, T.; Islam, M.S.; Uddin, J. MRI-based brain tumor classification using a dilated parallel deep convolutional neural network. Digital 2024, 4, 529–554. [Google Scholar] [CrossRef]
Khan, M.A.; Ashraf, I.; Alhaisoni, M.; Damaševičius, R.; Scherer, R.; Rehman, A.; Bukhari, S.A.C. Multimodal brain tumor classification using deep learning and robust feature selection: A machine learning application for radiologists. Diagnostics 2020, 10, 565. [Google Scholar] [CrossRef]
Bentahar, H.; Djerioui, M.; Beghriche, T.; Zerguine, A.; Beghdadi, A. Customized CNN for multi-class classification of brain tumor based on MRI images. Arab. J. Sci. Eng. 2024, 49, 16903–16918. [Google Scholar] [CrossRef]
Cui, H.; Ren, Z.; Xu, Z.; Liu, X.; Ding, J.; Guo, D. ResMT: A hybrid CNN-transformer framework for glioma grading with 3D MRI. Comput. Electr. Eng. 2024, 120, 109745. [Google Scholar] [CrossRef]
Singh, S.; Saxena, V. A fine-tuned pre-trained model for classification of brain tumor using magnetic resonance imaging. Grenze Int. J. Eng. Technol. 2024, 10, 463–472. [Google Scholar]
Ashimgaliyev, M.; Bakhyt, M.; Alibek, B.; Yessirkepov, L.R.; Ainur, Z. Accurate MRI-based brain tumor diagnosis: Integrating segmentation and deep learning approaches. Appl. Sci. 2024, 14, 7281. [Google Scholar] [CrossRef]
Abdelhamid, K.M.; Abdelgawad, I.A.; Abdelazeem, M.I.; Abdelrahman, K.A. Automatic brain tumor diagnosis using cascaded deep convolutional neural networks with symmetric U-Net and asymmetric residual blocks. Sci. Rep. 2024, 14, 9501. [Google Scholar] [CrossRef] [PubMed]
Martínez-Del-Río-Ortega, R.; Civit-Masot, J.; Luna-Perejón, F.; Domínguez-Morales, M. Brain tumor detection using magnetic resonance imaging and convolutional neural networks. Big Data Cogn. Comput. 2024, 8, 123. [Google Scholar] [CrossRef]
Xie, Y.; Zaccagna, F.; Rundo, L.; Testa, C.; Zhu, R.; Tonon, C.; Lodi, R.; Manners, D.N. IMPA-Net: Interpretable multi-part attention network for trustworthy brain tumor classification from MRI. Diagnostics 2024, 14, 997. [Google Scholar] [CrossRef]
Agarwal, M.; Rani, G.; Kumar, A.; Kumar, P.; Manikandan, R.; Gandomi, A.H. Deep learning for enhanced brain tumor detection and classification. Results Eng. 2024, 22, 102117. [Google Scholar] [CrossRef]
Ramaha, N.T.A.; Mahmood, R.M.; Hameed, A.A.; Fitriyani, N.L.; Alfian, G.; Syafrudin, M. Brain pathology classification of MR images using machine learning techniques. Computers 2023, 12, 167. [Google Scholar] [CrossRef]
Gomaa, M.M.; Elabdeen, A.G.Z.; Elnashar, A.; Zaki, A.M. Brain tumor X-ray images enhancement and classification using anisotropic diffusion filter and transfer learning models. Int. J. Inf. Technol. 2024, 16, 3771–3779. [Google Scholar] [CrossRef]
Wu, P.; Wang, Z.; Zheng, B.; Li, H.; Alsaadi, F.E.; Zeng, N. AGGN: Attention-based glioma grading network with multi-scale feature extraction and multi-modal information fusion. Comput. Biol. Med. 2023, 152, 106457. [Google Scholar] [CrossRef]
Cheng, Y.; Zheng, Y.; Wang, J. CFNet: Automatic multi-modal brain tumor segmentation through hierarchical coarse-to-fine fusion and feature communication. Biomed. Signal Process. Control 2025, 99, 106876. [Google Scholar] [CrossRef]
Ramos-Acosta, E.R.; García-Guerrero, E.E.; López-Bonilla, O.R.; Tamayo-Pérez, U.J.; Aguirre-Castro, O.A.; Ramírez-Rios, L.Y.; Inzunza-González, E. A novel system for the classification of zinc-plated components by benchmarking deep neural networks. Expert Syst. Appl. 2024, 255, 124866. [Google Scholar] [CrossRef]
Gómez-Guzmán, M.A.; Jiménez-Beristain, L.; García-Guerrero, E.E.; Aguirre-Castro, O.A.; Esqueda-Elizondo, J.J.; Ramos-Acosta, E.R.; Galindo-Aldana, G.M.; Torres-Gonzalez, C.; Inzunza-Gonzalez, E. Enhanced Multi-Class Brain Tumor Classification in MRI Using Pre-Trained CNNs and Transformer Architectures. Technologies 2025, 13, 379. [Google Scholar] [CrossRef]
Gómez-Guzmán, M.A.; Jiménez-Beristáin, L.; García-Guerrero, E.E.; López-Bonilla, O.R.; Tamayo-Pérez, U.J.; Esqueda-Elizondo, J.J.; Palomino-Vizcaíno, K.; Inzunza-González, E. Classifying brain tumors on magnetic resonance imaging by using convolutional neural networks. Electronics 2023, 12, 955. [Google Scholar] [CrossRef]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; p. 887. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; PMLR: Cambridge, MA, USA; pp. 6105–6114. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers and distillation through attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 10347–10357. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18-22 June 2023; pp. 16133–16142. [Google Scholar]
Bisong, E. Google Colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Apress: Berkeley, CA, USA, 2019; pp. 59–64. [Google Scholar]
Rossum, G.V.; Drake, F.L. Introduction to Python 3: Python Documentation Manual Part 1; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 721. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Biewald, L. Weights & Biases. Software for Experiment Tracking with Weights and Biases. 2020. Available online: https://wandb.ai/site/ (accessed on 29 November 2025).
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A survey on deep transfer learning. In Proceedings of the 27th International Conference on Artificial Neural Networks (ICANN), Rhodes, Greece, 4–7 October 2018; Springer: Cham, Switzerland; pp. 270–279. [Google Scholar]
Bakas, S. BRATS MICCAI Brain Tumor Dataset; IEEE Dataport: Piscataway, NJ, USA, 2020. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
Nazir, M.I.; Akter, A.; Wadud, M.A.H.; Uddin, M.A. Utilizing customized CNN for brain tumor prediction with explainable AI. Heliyon 2024, 10, e38997. [Google Scholar] [CrossRef] [PubMed]
Yeboah, D.; Dequan, L.; Agordzo, G.K. Enhancing brain MRI data visualization accuracy with UNET and FPN networks. Biomed. Signal Process. Control 2024, 96, 106418. [Google Scholar] [CrossRef]
Theissler, A.; Vollert, S.; Benz, P.; Meerhoff, L.A.; Fernandes, M. ML-ModelExplorer: An explorative model-agnostic approach to evaluate and compare multi-class classifiers. In Proceedings of the 4th International Cross-Domain Conference on Machine Learning and Knowledge Extraction (CD-MAKE), Dublin, Ireland, 25–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 281–300. [Google Scholar]
Haque, R.; Khan, M.A.; Rahman, H.; Khan, S.; Siddiqui, M.I.H.; Limon, Z.H.; Swapno, S.M.M.R.; Appaji, A. Explainable deep stacking ensemble model for accurate and transparent brain tumor diagnosis. Comput. Biol. Med. 2025, 191, 110166. [Google Scholar] [CrossRef] [PubMed]
West, J. Quantitative Magnetic Resonance Imaging of the Brain Applications for Tissue Segmentation and Multiple Sclerosis; Linkopings Universitet: Linkopings, Sweden, 2014. [Google Scholar]
Yu, Z.; Li, X.; Li, J.; Chen, W.; Tang, Z.; Geng, D. HSA-net with a novel CAD pipeline boosts both clinical brain tumor MR image classification and segmentation. Comput. Biol. Med. 2024, 170, 108039. [Google Scholar] [CrossRef] [PubMed]
Franklin, D. Jetson Containers: Machine Learning Containers for Jetson and JetPack, NVIDIA, Software. Available online: https://github.com/dusty-nv/jetson-containers (accessed on 10 January 2026).

Figure 1. Illustration of the proposed brain tumor classification strategy.

Figure 2. Sample of the MRIs in the dataset.

Figure 3. Training-set class distribution for the 2D best-slice dataset. The bar plot reports the number of images per class (HGG = 1074, LGG = 1111), and the donut chart shows the corresponding proportions (49.2% vs. 50.8%) for a total of 2185 training images.

Figure 4. Dataset balance indicators for the training split. The imbalance ratio (IR = 1.03) and entropy balance (EB = 1.000) confirm an almost perfectly balanced class distribution between HGG and LGG images in the constructed 2D training set.

Figure 5. Sample of the MRIs after the data augmentation stage.

Figure 6. K-fold cross-validation, with k = 3 used for each architecture throughout the training and validation phases.

Figure 7. The parallel chart depicts hyperparameter variability across all executions. Training loss is an important target while training the models, while test accuracy measures performance. Test accuracy is reported for post-hoc comparison only; the test set was not used for hyperparameter tuning or model selection.

Figure 8. A parallel chart illustrating hyperparameter variations, filtering with a dataset size of 25% and an initial learning rate set at

1 \times 10^{- 5}

. Training loss is an important objective during training, while test accuracy is used as the evaluation metric. Test accuracy is reported for post-hoc comparison only; the test set was not used for hyperparameter tuning or model selection.

Figure 8. A parallel chart illustrating hyperparameter variations, filtering with a dataset size of 25% and an initial learning rate set at

1 \times 10^{- 5}

. Training loss is an important objective during training, while test accuracy is used as the evaluation metric. Test accuracy is reported for post-hoc comparison only; the test set was not used for hyperparameter tuning or model selection.

Figure 9. Test-set metric boxplot distributions for all evaluated models. To show variability and robustness, test accuracy, F1-Score, precision, recall, and MCC are summed over multiple runs.

Figure 10. Accuracy, training loss, and validation accuracy during the training stage. (a) Training accuracy, (b) training loss, and (c) validation accuracy.

Figure 11. Accuracy versus the number of parameters for each tested architecture.

Figure 12. Ranking of models generated using the tool ML-Model Explorer.

Figure 13. An evaluation of the accuracy and Std Dev of recall in DL models. The chart was generated with the help of the ML-ModelExplorer application.

Figure 14. This boxplot depicts F1-Score, precision, and recall distributions across CNN and transformer models.

Figure 15. Error rates per class (Low Grade and High Grade) for CNN models.

Figure 16. Resources used throughout the training process. (a) GPU usage during training and (b) CPU usage during training.

Figure 17. GPU memory allocation during the training phase.

Figure 18. Power Consumption of GPU throughout the training procedure.

Figure 19. Results of the Grad-CAM approach applied to the trained DL models for the LGG class: (a) Inception_v4, (b) Xception41, (c) Convnextv2_tiny, (d) Swin_tiny_patch4_window7_224, (e) Efficientnet_b0, and (f) Deit3_base_patch16_224.

Figure 20. Visualization of the Grad-CAM results for the inception_v4 model. (a) Regions of interest identified for LGG classifications and (b) regions of interest identified for HGG classifications.

Figure 21. Cloud-based clinical workflow for AI-assisted brain MRI analysis. The pipeline illustrates (i) MRI acquisition, (ii) secure cloud storage of imaging data, (iii) API-driven triggering and preprocessing (slice extraction/standardization) under access-control safeguards, (iv) scalable GPU-based inference in the cloud (e.g., via AWS, GCP, or equivalent cloud providers) with generation of model outputs and visual explanations, and (v) web-based result visualization to support radiologist review and clinical decision-making.

Table 1. Image data augmentation settings.

Augmentation	Settings
Resize	Image dimensions resized to 224 × 224 pixels.
Rotation	Random rotation are applied with a maximum of 30°.
Brightness	Brightness adjusted randomly within a range of 0–20%.
Contrast	Contrast applied randomly from 0–20%.
Saturation	Saturation adjusted randomly within a range of 0–20%.
Horizontal flip	Training subset horizontally flipped.
Vertical flip	Training subset vertically flipped.

Table 2. Hyperparameters employed during the training stage.

Parameter	Value
Epochs	10
Batch size	32, 64
Data size	10%, 25%, 50%, 75%, 100%
Initial learning rate	$1 \times 10^{- 5}$ , $1 \times 10^{- 4}$ , $1 \times 10^{- 3}$
Optimizer	Adam
Learning rate scheduler	ReduceLROnPlateau

Table 3. Mean (Avg) and standard deviation (Std Dev) of test-set metrics across runs for each evaluated architecture.

Model	Accuracy		F1-Score		Precision		Recall		MCC
Model	Std Dev	Avg	Std Dev	Avg	Std Dev	Avg	Std Dev	Avg	Std Dev	Avg
xception41	0.08131	0.95234	0.11762	0.94708	0.10590	0.95748	0.12730	0.94347	0.17361	0.90300
inception_v4	0.08856	0.94457	0.08131	0.94563	0.10935	0.93948	0.07101	0.96071	0.15503	0.89683
deit3_base_patch16_224	0.12155	0.90716	0.19821	0.88466	0.18682	0.90021	0.21111	0.88760	0.23914	0.81641
efficientnet_b0	0.12315	0.89334	0.12251	0.89713	0.13626	0.89341	0.12030	0.91079	0.24179	0.78871
swin_tiny_patch4_window7_224	0.21679	0.82140	0.40546	0.71160	0.40782	0.70462	0.41461	0.73814	0.43660	0.64144
convnextv2_tiny	0.22516	0.80541	0.38611	0.72581	0.39435	0.70500	0.39818	0.76957	0.44807	0.61111

Table 4. Performance results of the CNN models with an initial learning rate of

1 \times 10^{- 5}

, a batch size of 64, and a training dataset size of 25%.

Table 4. Performance results of the CNN models with an initial learning rate of

1 \times 10^{- 5}

, a batch size of 64, and a training dataset size of 25%.

Model	Accuracy	Precision	Recall	F1-Score	MCC
deit3_base_patch16_224	0.99401	0.99201	0.99610	0.99403	0.98807
inception_v4	0.99212	0.99126	0.99325	0.99222	0.98431
xception41	0.99175	0.99025	0.99316	0.99163	0.98360
swin_tiny_patch4_window7_224	0.99139	0.99279	0.98907	0.99092	0.98275
convnextv2_tiny	0.98125	0.98252	0.97922	0.98080	0.96256
efficientnet_b0	0.95274	0.96130	0.94625	0.95360	0.90564

Table 5. Comparison with other works in the same field.

Reference	Model	Dataset	Classes	Best Accuracy
[1]	Dual CNN method	BraTS-IXI	2	98.85%
[6]	ResNet-101, DenseNet-169,
	ConvNeXt-S	BRATS 2018, BRATS 2019	2	93.6%
[8]	SADO-Net	BRATS 2018, 2019, 2020	2	99.2%
[13]	BrainNet	BRATS 2020, 2021	2	99.9%
[16]	ResNet18	Kempanna dataset	2	95.54%
[17]	Dual DCNN	Br35H	2	99%
[24]	ConvNet-ResNeXt101	BRATS 2020	2	99.27%
[25]	TECNN-based model	BRATS 2018	2	96.75%
[28]	ResNet50, EfficientNet, VGG-19	Kaggle dataset	2	99.44%
[29]	Neuro-XAI	BRATS 2021	2	98%
[43]	ResMT	BRATS 2019	2	97.01%
[45]	AlexNet, VGG16, GoogleNet,
	ResNet18, ResNet50	BRATS 2018	2	97.47%
[46]	Cascaded CNN with Symmetric U-Net	BRATS 2017	2	99%
[47]	Proposed CNN model	Preet Viradiya dataset	2	97.5%
[51]	VGG19	Kaggle X-ray dataset	2	98.58%
[72]	Customized CNN	Br35H	2	98.67%
[77]	HSA-Net	BRATS 2021, BRATS 2023	2, 3	95.35%
This work	InceptionV4	BRATS 2019	2	99.21%

Table 6. Inference times. Cloud results were measured using PyTorch FP16 on Google Colab (batch size = 1). Edge results (embedded hardware) were measured on a Jetson Orin Nano (batch size = 1) using a TensorRT-optimized FP16 engine (v10.3.0) when available; convnextv2_tiny was measured under PyTorch 2.4.0 FP16 due to runtime constraints. Therefore, the inference times reflect deployment-oriented runtimes rather than identical inference backends.

Model	Google Colab		Jetson Orin Nano
Model	Mean (ms)	Std (ms)	Mean (ms)	Std (ms)
xception41	4.81	0.29	11.52	2.32
inception_v4	10.04	0.46	13.32	2.48
efficientnet_b0	4.38	0.34	8.66	2.60
convnextv2_tiny	4.11	0.17	27.96	1.68
swin_tiny_patch4_window7_224	5.57	0.35	11.26	2.03
deit3_base_patch16_224	2.98	0.12	20.36	1.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gómez-Guzmán, M.A.; Esqueda-Elizondo, J.J.; Jiménez-Beristain, L.; Galindo-Aldana, G.M.; Aguirre-Castro, O.A.; Ramos-Acosta, E.R.; Torres-Gonzalez, C.; García-Guerrero, E.E.; Inzunza-Gonzalez, E. Enhancing Glioma Classification in Magnetic Resonance Imaging Using Vision Transformers and Convolutional Neural Networks. Electronics 2026, 15, 434. https://doi.org/10.3390/electronics15020434

AMA Style

Gómez-Guzmán MA, Esqueda-Elizondo JJ, Jiménez-Beristain L, Galindo-Aldana GM, Aguirre-Castro OA, Ramos-Acosta ER, Torres-Gonzalez C, García-Guerrero EE, Inzunza-Gonzalez E. Enhancing Glioma Classification in Magnetic Resonance Imaging Using Vision Transformers and Convolutional Neural Networks. Electronics. 2026; 15(2):434. https://doi.org/10.3390/electronics15020434

Chicago/Turabian Style

Gómez-Guzmán, Marco Antonio, José Jaime Esqueda-Elizondo, Laura Jiménez-Beristain, Gilberto Manuel Galindo-Aldana, Oscar Adrian Aguirre-Castro, Edgar Rene Ramos-Acosta, Cynthia Torres-Gonzalez, Enrique Efren García-Guerrero, and Everardo Inzunza-Gonzalez. 2026. "Enhancing Glioma Classification in Magnetic Resonance Imaging Using Vision Transformers and Convolutional Neural Networks" Electronics 15, no. 2: 434. https://doi.org/10.3390/electronics15020434

APA Style

Gómez-Guzmán, M. A., Esqueda-Elizondo, J. J., Jiménez-Beristain, L., Galindo-Aldana, G. M., Aguirre-Castro, O. A., Ramos-Acosta, E. R., Torres-Gonzalez, C., García-Guerrero, E. E., & Inzunza-Gonzalez, E. (2026). Enhancing Glioma Classification in Magnetic Resonance Imaging Using Vision Transformers and Convolutional Neural Networks. Electronics, 15(2), 434. https://doi.org/10.3390/electronics15020434

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Glioma Classification in Magnetic Resonance Imaging Using Vision Transformers and Convolutional Neural Networks

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Inception_v4

3.2. Xception41

3.3. EfficcientNet_b0

3.4. Data Efficient Image Transformer

3.5. Swin Transformer

3.6. Convnextv2_tiny

3.7. Procedure for Model Training

3.7.1. BRATS 2019 Dataset

3.7.2. Image Data Augmentation

3.7.3. Hyperparameter Setup

3.7.4. Performance Metrics

3.7.5. Training Phase Using a Transfer Learning Methodology

4. Results and Discussion

4.1. Test-Set Performance Across Hyperparameter Configurations

4.2. Statistical Summary of Test Performance

4.3. Training Dynamics and Convergence Analysis

4.4. ML-Model Explorer

4.5. DNN Training Runtime, Computational Complexity, and GPU Power Usage

4.6. Grad-CAM

4.7. Comparison with Other Similar Studies

4.8. Real-Time Inference Benchmarking of the DL Models in Cloud and Edge Environments

4.9. Scenario of Real-World Usage

4.10. Limitations of the Study

4.11. Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI