Swin–YOLOv12: A Hybrid Transformer-Based Deep Learning Approach for Enhanced Real-Time Brain Tumor Detection in MRI Images

Tariq, Mubashar; Choi, Kiho

doi:10.3390/math14091447

Open AccessArticle

Swin–YOLOv12: A Hybrid Transformer-Based Deep Learning Approach for Enhanced Real-Time Brain Tumor Detection in MRI Images

by

Mubashar Tariq

and

Kiho Choi

^*

Department of Electronics and Information Convergence Engineering, Kyung Hee University, Yongin 17104, Gyeonggi-do, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(9), 1447; https://doi.org/10.3390/math14091447

Submission received: 1 April 2026 / Revised: 22 April 2026 / Accepted: 23 April 2026 / Published: 25 April 2026

(This article belongs to the Special Issue Applications of Artificial Intelligence in Biomedical Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Brain tumors (BTs) arise from the abnormal growth of cells within brain tissue and may spread rapidly, making them a major cause of mortality worldwide. Early detection of BTs remains highly challenging due to the brain’s complex structure and the heterogeneous nature of tumors. Magnetic Resonance Imaging (MRI) provides detailed information about tumor size, location, and shape, thereby supporting clinical decision-making for treatments such as chemotherapy, radiation therapy, and surgery. Traditional machine learning (ML) approaches mainly rely on manual feature extraction, whereas recent advances in Computer-Aided Diagnosis (CAD) and deep learning (DL) have enabled more accurate detection of small and complex tumor regions. To improve automated tumor detection, we propose a hybrid Swin–YOLO framework that combines the Swin Transformer (ST) with the latest CNN-based YOLOv12 model. In this framework, the Swin Transformer serves as the main backbone for feature extraction, while the Feature Pyramid Network (FPN) and Path Aggregation Network (PANet) are employed in the neck to better capture multi-scale features. For training, we used the publicly available Br35H dataset and applied data augmentation to enhance the model’s robustness and generalization capability. The experimental results show that the proposed framework achieved 99.7% accuracy, 99.4% mAP@50, and 87.2% mAP@50:95. Furthermore, we incorporated Explainable Artificial Intelligence (XAI) techniques, including Grad-CAM and SHAP, to improve the interpretability of the model by visually highlighting the tumor regions that contributed most to the prediction. In addition, we developed NeuroVision AI, a web-based application designed to support faster and more accurate clinical decision-making. Although the proposed model demonstrated strong performance on the dataset, these results should be interpreted within the context of the current experimental setting.

Keywords:

brain tumors; Magnetic Resonance Imaging; YOLOv12; vision transformers; Swin transformer; deep learning; Swin–YOLO; Explainable Artificial Intelligence

MSC:

68U10

1. Introduction

Cancer remains one of the most widely studied diseases worldwide. BTs are one of the dangerous types of the disease, which occur when uncontrolled growth of abnormal cells arises within the brain [1]. Normally, the human body replaces old cells with new ones. However, when this natural process does not work properly, abnormal cells can grow rapidly and form a tumor in the area in question [2]. The uncontrolled growth of tumors damages normal tissues and compresses surrounding cells, which may lead to cell death [3]. Brain tumors (BTs) are a serious global health concern that affect individuals of all age groups, including infants, children, young adults, and older adults [4]. According to a 2024 report, about 321,731 cases of primary malignant brain tumors were diagnosed worldwide in 2022, including approximately 173,699 men and 148,032 women [5]. Biopsy procedures for brain tumors (BTs) are more complicated than those for other tumors because they require surgery [6]. Manual tumor diagnosis is often time-consuming and error-prone, which may place patients at serious risk. Therefore, medical professionals rely on various diagnostic methods, including neurological examinations, sample analysis, and digital screening [7].

Imaging modalities are gradually becoming more popular in medical diagnosis because they use a secure approach with better detection and minimize risks for patients. There are some common processes that include Computed Tomography (CT) scans, X-rays, radiography, tomography, Magnetic Resonance Imaging (MRI), and echocardiography (ECHO), which are used for the detection of BTs [8]. Unlike CT scans and X-rays, MRI provides a highly detailed visualization of brain structures without exposing patients to harmful radiation. It offers a more precise and accurate analysis that is essential for assessing a patient’s condition [9]. Consequently, there has been a growing focus on utilizing the features of artificial intelligence (AI), which has illustrated remarkable ability across many domains. In particular, AI-driven applications have shown a high impact on medical imaging (MI). Moreover, they significantly enhance and improve tumor diagnostic accuracy, which can support timely intervention [10]. In the past few years, the use of machine learning (ML) has greatly improved Computer-Aided Diagnosis systems (CADx) in MI, especially for BT detection, leading to significant progress in accuracy and reliability [11]. ML techniques play an important role in identifying the most important features in MI, which helps make more accurate tumor diagnoses [12]. However, some ML methods have recently shown limitations in achieving high accuracy, generally due to the poor prediction of the models and the complex nature of medical data. As a result, many researchers have found another learning-based technique to enhance detection accuracy [13] and have increasingly moved to advanced Convolutional Neural Network (CNN)-based approaches for the diagnosis of various medical conditions to learn complex features without manual extraction [14].

Deep learning (DL)-based approaches have shown significant success in analyzing medical images through advanced techniques for brain tumor (BT) detection [15]. Moreover, these systems have become valuable tools for enhancing and automating early tumor diagnosis, thereby reducing the need for direct human presence [16]. These approaches not only assist in tumor detection and monitoring but also help doctors make informed decisions about suitable treatment options, ultimately improving patient care [17]. Deep learning (DL) models often outperform traditional CNN-based models because of their superior learning capacity and efficient feature extraction capabilities. Furthermore, they have demonstrated exceptional capabilities in pattern recognition on large-scale MI datasets and learned complex representations between healthy brain tissue and affected tumor regions [18]. Nevertheless, continued advancements are being made to improve their efficiency, accuracy, and availability. The research provides better techniques and helps medical professionals to treat patients more effectively based on tumor detection results [19]. Currently, there are two fundamental approaches used for the detection of BTs. The first is the “two-stage” method, where CNNs such as Region-Based CNN (R-CNN) [20] and Faster R-CNN (FR-CNN) [21] are deployed for object identification, though they tend to be less efficient, and the second approach is “single-stage”, represented by the “You Only Look Once” (YOLO) model, which has been primarily implemented in advanced research. In this context, the YOLO series highlights real-time capabilities and enhances accuracy to identify small objects by evaluating their bounding boxes and managing the regression task [22]. YOLOv12 [23] uses a residual shortcut that connects the input and output of each block, with a small default scaling factor of 0.01. This design is similar to layer scaling, which is commonly used to improve the optimization of deep vision transformers. However, applying layer scaling only to the attention regions is not sufficient to fully address the optimization problem and may also increase inference latency. These observations indicate that the model’s stable convergence is influenced not only by the attention mechanism but also by the ELAN architecture itself. This further supports the effectiveness of the proposed R-ELAN block and after incorporating ELAN, the larger YOLOv12-X model achieved its best performance, reaching an mAP@0.5 of 72.0 and an mAP@0.5:0.95 of 55.2.

Vision Transformers (ViTs) have been widely applied to vision-based tasks such as image classification, instance segmentation, pose estimation, and detailed tumor detection. They are effective at capturing long-range dependencies between image patches and extracting more informative visual patterns [24]. Among recent advancements, the Swin Transformer (ST) [25] is an important development that employs a hierarchical architecture with a shifted-window mechanism. In this approach, self-attention establishes connections among local windows, enabling the model to capture information from small tumor regions while maintaining the global context of BTs. Current BT detection models still face several challenges, including low efficiency, limited accuracy, and difficulty in handling large-scale complex data. To overcome these limitations, a ViT-based ST model was integrated into the STCNet backbone to enhance feature extraction and improve global contextual modeling. In addition, a PANet-based path aggregation framework with a residual structure was used to create the feature pyramid, thereby strengthening multi-scale and global feature representation [26]. The Swin-Small model outperformed the larger model at a lower computational cost, mainly because the larger model contains nearly four times more parameters. In addition, Swin-Small achieved faster inference, demonstrating superior overall efficiency. Both simple and extensive data augmentation strategies were implemented to improve model robustness [27]. An advanced DL-based model was proposed by integrating a Hybrid Shifted-Window Multi-Head Self-Attention (HSW-MSA) block into a refined framework. This integration improved classification accuracy, reduced memory consumption, and decreased training complexity. Furthermore, the conventional MLP was replaced with a residual-based MLP (ResMLP), resulting in higher accuracy, faster training, and better parameter efficiency [28].

Another study described Swin-MedNet, an ST-based framework for brain tumor (BT) diagnosis in medical imaging. Swin-MedNet utilizes a hierarchical ViT architecture with self-attention to effectively capture both local and global features while maintaining linear computational complexity. Moreover, its multi-stage encoder gradually merges patches, improving the scalability and efficiency of deep feature representation learning [29]. Furthermore, a hybrid deep learning (DL) method was introduced by integrating YOLOv11 with a transformer-based detection head, using the Swin Transformer (ST) as the backbone. Through transfer learning, the model leveraged pre-trained ST weights to achieve stronger feature extraction at a lower computational cost. In addition, the model was extensively evaluated on a brain tumor (BT) dataset with bounding box annotations and demonstrated effective detection performance. These findings highlight that the proposed model is highly suitable for medical diagnostic applications because of its enhanced tumor localization capability, faster training process, and higher classification accuracy [30].

Inspired by the limitations of earlier Swin- and YOLO-based frameworks, our proposed model integrates the latest CNN-based YOLOv12 architecture with the Swin Transformer (ST) to enhance global contextual feature learning while preserving efficient tumor localization. In particular, the ST backbone is designed to capture both fine local features and long-range contextual information, which are especially important for brain MRI images characterized by low contrast, irregular boundaries, and diverse tumor appearances. Furthermore, the multi-scale features extracted by the transformer are passed through a hybrid FPN + PANet neck, which strengthens cross-scale feature fusion by combining high-level semantic information with localization-sensitive low-level details. In addition, the refined pyramid representation is fed into the YOLOv12 detection head to develop accurate tumor predictions and accurate bounding box localization. Finally, after training, the best-performing model was selected, and XAI techniques, including Grad-CAM and SHAP, were applied to enhance transparency and verify whether the model focuses on clinically meaningful tumor-related regions. Therefore, the main contribution of the proposed Swin–YOLO approach lies in integrating the latest YOLOv12 detection framework with the contextual learning capability of the Swin Transformer (ST), supported by enhanced multi-scale feature fusion and interpretability analysis to achieve more accurate brain tumor (BT) detection. Figure 1 presents sample MRI images from the Br35H dataset, including (a) abnormal brain scans containing tumor regions and (b) normal brain scans without visible abnormalities. This visual comparison showed the differences in tumor appearance and intensity, which the proposed model considers in distinguishing between normal and abnormal cases.

In summary, the key contributions and innovations of this study are clearly highlighted as follows:

Novel architecture design: This study presents an innovative deep learning framework that has strong real-time tumor detection ability and excellent feature representation. The Swin–YOLO model enhances the strengths of DL in many ways that can be applied to MI. These methods have rapidly grown in healthcare systems using CADs and are very beneficial for physicians, helping them to make better decisions to ensure that they accurately take care of tumor patients in early stages.
Customized CNN model: We replaced the YOLOv12 backbone with an ST for feature extraction through shifted-window self-attention to improve the model’s performance by extracting the small and broader boundaries from MRI scans. This integration significantly enhanced the accuracy and reliability of tumor diagnosis.
Integrated an FPN and PANet: The implementation of FPN and PANet in the YOLOv12 neck part amplifies robust multi-scale tumor feature extraction for better localization and consistent recognition. Moreover, this approach provides details on the model’s effectiveness with respect to tumor properties, such as varying shapes, sizes, and spatial distributions. Through this integration, more accurate tumor localization is supported, particularly in low-contrast or noisy MRI images, compared to traditional CNN-based methods that struggle to recognize objects clearly.
Explainable AI (XAI) techniques: Grad-CAM and SHAP, present a more precise, intangible, and clear visualization of the model’s predictions.

The remainder of the paper is organized as follows:

Section 2 presents a review of previous BT detection studies using ML, DL, and TL methods, identifying their limits and motivating the proposed Swin–YOLO model. Section 3 demonstrates how the proposed model combines YOLOv12 and the ST to improve MRI-based tumor detection. Section 4 describes the experimental training process. Section 5 discusses the results of the proposed method. Section 6 presents the XAI tools, such as Grad-CAM and SHAP, and presents the “NeuroVision AI” web app, along with a consideration of future work. Section 7 summarizes the overall conclusions of the study and the future research directions.

2. Related Work

In the last two decades, MI research has attained remarkable results due to its wide range of applications in healthcare, particularly the diagnosis of BTs using MRI scans. In this section, we discuss various ML, DL, transfer learning (TL), and YOLO model implementations for tumor detection. According to the latest study, we focus on the major advancements made in this field as well as the drawbacks that still exist. These techniques can be involved in improving patient healthcare and providing more reliable medical treatment.

2.1. Machine Learning Algorithm

Correct BT detection is very important for enhancing treatment and contributes to reducing cancer-related death. A lot of ML algorithms have been implemented and have established the automation of tumor detection based on MRI scans. For instance, clustered algorithms called Particle Swarm Optimization (PSO) have been shown to accurately identify small tumor areas [31]. Additionally, the Discrete Wavelet Transform (DWT) is extensively used as an extraction feature technique, enabling the transformation of images to capture essential data for analysis. Consequently, Principal Component Analysis (PCA), normally applied to decrease the range of image points, provides a clearer and more optimized process [32].

In BT detection, Support Vector Machines (SVMs) are among the most reliable supervised learning algorithms that effectively classify tumors due to significant machine-based complexity, requiring long training times on large datasets. They are suitable for multi-feature data and effectively handle complex tumor identification problems [33]. Despite this, Random Forest (RF), an ensemble-based learning approach, improves detection performance through the integration of multiple decision trees and is used in tumor diagnosis in medical imaging for challenging cases [34]. The K-Nearest Neighbors (KNNs) algorithm is utilized for detection in many tumor detection tasks and makes decisions easily based on nearby data points, and it works efficiently with small datasets [35].

2.2. Deep Learning-Based Techniques

The rapid development of DL frameworks has significantly transformed MI studies by enabling highly complex feature extraction. Deep neural network (DNN) transformer-based models are very efficient at representing the detailed spatial patterns in MRI data. In particular, for BT multiclass detection tasks, CNN-based methods have been described for the effective analysis of large-scale image datasets [36]. Integrating an attention mechanism within the CNN framework enables the models to highlight the related ROIs, thereby providing both segmentation and detection accuracy for tumor grading. Such approaches highlight the attention-based value and clarity of tumors [37].

In a recent study, a hybrid model was introduced that combines Neutrosophy with a CNN. In this approach, the Neutrosophy technique is first used to identify tumor regions, which the CNN then processes for detailed feature extraction and analysis of the deep tumor regions for better understanding of model training [38]. Fine-tuning of the pre-trained model, just like Dense Net, enabled the achievement of high accuracy on small-scale datasets. This analysis indicates that the fundamental role of exploring pre-trained knowledge to upgrade performance on labeled data is limited [39]. Figure 2 illustrates how a deep neural network (DNN) gradually transforms raw MRI scans into useful features.

2.3. Transfer Learning-Based Approaches

TL has become a highly effective technique in medical analysis, especially when working with huge MRI images. The main advantages are that it can run a pre-trained ImageNet model on large datasets, which include millions of images distributed in different categories. Initially, pre-trained weights are used and then fine-tuned for specific tumor detection tasks. TL allows models to attain strong performance even with small amounts of medical data and shows excellent diagnostic accuracy in reliable clinical applications. A major advantage is that it reduces computational requirements, as this approach generally requires training the model only with fully connected (FC) layers rather than the entire network [40]. One study adapted a fine-tuned VGG19 model for the detection of multiple BTs [41], while another implemented the AlexNet architecture for brain abnormality diagnosis on a small dataset of only 291 images [42]. Overfitting can occur in these layers while maintaining high performance [43]. Feature-based frameworks are used to capture deep features, and noteworthy self-attention-based models have the capability to augment contextual feature representation and improve generalization in BT diagnostics [44].

The ConvNeXt V2 backbone is utilized to remove clear spatial features, while a parallel lightweight ConvNeXt-based process takes frequency-domain inputs. These features are subsequently fused and improved using an ST-V2 to capture long-range contextual relationships [45]. ViTs provide a more detailed comprehension of MRI scans by modeling long-range dependencies within picture patches by utilizing self-attention mechanisms. A comparison of transfer learning strategies was performed using VGG16 and a ViT-based model [46]. Multiple feature extraction approaches can present major computational complexity problems. Deploying these approaches in real-world scenarios requires longer inference times to obtain results. However, researchers have addressed the computational complexity issue through many techniques, such as model pruning and quantization [47]. ViTs and ST hybrid models explicitly model long-range dependencies and global context, often achieving state-of-the-art results. While both CNNs and ViTs have demonstrated high accuracy, the current state of the art often includes hybrid architectures that leverage the strengths of both methods [48]. A New TwinFormer model that captures global and local attention mechanisms to enhance multi-scale feature extraction and which also incorporates a Segmented Cross-Flow Stage (SCFStage) to preserve semantic–spatial features for better tumor localization has been described. In addition, a combination loss function based on CIOU loss, focal loss and BCE loss to improve training stability has been introduced [49].

2.4. YOLO-Based CCN Models

YOLO algorithms have made remarkable progress over time, achieving higher accuracy, faster processing speeds, and improved real-time performance on the MS COCO dataset [50]. YOLOv1 [51] was implemented for BT detection with a splitting size of two, where each block generates two prediction boxes. The model considered two classes: “Tumor,” representing the presence of an object, and “No Tumor,” indicating the absence of object detection. YOLOv2 [52] utilizes some additional layers that are added to enable both tumor extraction and identification on MRI images. YOLOv3 [53] presents a multi-scale predictive model and a further advancement for the detection method by implementing the DarkNet-53 backbone, which collects important features at many stages that are integrated with residual connections. The series of developments has gradually evolved with YOLOv4 [54], which applied CSPDarkNet-53 [55] and PANet [56], along with multi-image augmentation improvements and enhanced detection accuracy. YOLOv5 [57] represented a more flexible and efficient framework that is characterized by model execution and practical implementation.

YOLOv6 [58] is a progressive refinement of PANet for feature assessment and merges CSPNet with adaptable refinement of anchor boxes. YOLOv7 [59] managed these problems by joint model acceleration and compression using an EfficientRep backbone and multi-stage labeling. This model advancement technique is more efficient at learning deep BT features, thereby minimizing errors and enhancing accuracy. YOLOv8 [60] displays an upgraded PANet with Dynamic Kernel Attention, which significantly enhances the model’s efficiency and accuracy in MRI images. YOLOv9 [61] deployed the idea of programmable gradient information (PGI), which improves network weights by achieving more accurate gradient information. YOLOv10 [62] upgrades the system design using a lightweight detection head, channel transformations, and different locations. YOLOv11 [63] included the C3k2 block in its backbone and used C2PSA for enhanced spatial attention.

2.5. Identified Gaps and Study Motivation

Despite significant advances in DL-based BT analysis, several important research gaps still exist in the early-stage detection of brain cancer. First, many existing CNN-based approaches mainly focus on local feature extraction and often struggle to capture long-range dependencies and global contextual relationships, which are important for analyzing complex brain MRI structures. These limitations may reduce detection reliability, particularly when tumors present heterogeneous textures, low contrast, blurred boundaries, or irregular shapes. Nevertheless, YOLO-based methods provide fast and efficient detection; conventional variants may still face difficulties in preserving fine tumor boundary details and robustly detecting small lesion regions across multiple scales. In medical imaging, this issue is particularly critical, as missing small or subtle abnormal regions may directly compromise diagnostic reliability. Previous ViT-based and hybrid CNN–transformer models have demonstrated improved contextual feature learning; however, many of these approaches still suffer from higher computational complexity, increased inference cost, or insufficient emphasis on precise tumor localization in practical medical detection frameworks.

Furthermore, several existing hybrid methods place greater emphasis on classification performance than on accurate and efficient tumor detection with precise localization capability. Another important research gap is that many recent studies have not adequately addressed the trade-off between detection accuracy, computational efficiency, and model scalability. Although these models may achieve high performance, their practical value in real-time or clinical support systems is limited when they become overly complex or resource-intensive, and a more effective detection model is needed that can simultaneously capture local details and global contextual information.

In addition, several studies focus mainly on detection accuracy without sufficiently addressing the joint need for accurate localization, multi-scale representation, and clinically meaningful interpretability. Motivated by these gaps, the proposed Swin–YOLO framework is designed as a hybrid medical image detection model that combines the global contextual modeling capability of the ST with the efficient localization strength of YOLOv12. Furthermore, the integration of the FPN + PANet hybrid neck improves multi-scale feature fusion for more robust tumor boundary representation, while the added XAI analysis supports interpretability and transparency in medical decision-support scenarios. Therefore, this study aims to address the limitations of existing methods by improving the balance between accuracy, localization precision, computational efficiency, and clinical interpretability in brain tumor MRI detection. Table 1 summarizes CNN-based models, accuracies, and identified research gaps.

3. Methodology

This section presents the details of the proposed Swin–YOLO model adopted to improve tumor diagnosis performance and describes how ViTs are integrated into the CNN-based YOLOv12 model to overcome the important challenges of attaining targeted tumor detection.

3.1. Model Architecture

An overview of the proposed architecture is presented in Figure 3. The model is implemented and analyzed using the Br35H dataset. As the most advanced model in the YOLO series developed by Ultralytics the YOLOv12 framework substantially enhances training performance and inference speed while maintaining high accuracy with minor parameters. This model is highly efficient and suitable for real-time clinical applications that require reliable tumor detection. Moreover, this model includes area attention, combined feature aggregation, and Flash Attention, collectively enhancing detection accuracy, particularly in cases involving small or irregular tumor shapes in low-contrast MRI images.

In contrast, the ST introduces hierarchical ViTs with a shifted-window scheme, enabling it to capture the local and global contextual information while maintaining computational efficiency. We preferred YOLOv12 as the baseline model because it presents high BT detection accuracy, a fast inference speed, and effective computation, which are significant for practical medical image analysis. In addition, the Swin Transformer (ST) was selected because it can effectively learn both local features and global contextual information through its hierarchical attention mechanism. Moreover, this capability is particularly beneficial for MRI images, in which tumors often exhibit irregular shapes.

3.2. Input Stage

The proposed method consists of three basic parts: backbone, neck, and head. The first input stage process begins with MRI brain scans, which can be individual 2D slices. These images are preprocessed and then divided into patches for transformer-based feature extraction. An MRI image slice or 2D volume is represented as

X \in R^{H \times W \times C}

(1)

where X denotes a 2D image and

R

represents a real number. H, W, and C denote the height, width, and number of channels in the images, respectively. For this study, the image dimensions are set to (224 × 224) because the ST requires this input image size.

3.3. Backbone

We replaced the YOLOv12 backbone with an ST because the CNN exhibited global feature, scalability, gradient flow, and multi-scale fusion issues. The ST focuses on these problems by utilizing self-attention across shifted windows, capturing both fine tumor details and overall brain regions. Its multi-level design supports multi-resolution features to integrate attention layers to enhance the gradient flow and model stability. This context-driven attention mechanism further enhances the model’s ability to handle low-contrast images, noise, and irregular tumor shapes while maintaining a lightweight, robust, and efficient design for real-time tumor detection. Furthermore, this combination significantly enhances both spatial recognition and contextual understanding in MRI images. Beyond backbone replacement, the advancement of the proposed model depends on the task-specific integration of the ST within the YOLOv12 detection pipeline for BT MRI analysis. In the Swin–YOLO architecture, the transformer serves not only as a feature extractor but also as a dedicated representation-learning module that improves the quality of the learned features passed to the large-scale and localization stages. Moreover, combined integration strengthens the interaction between global contextual information and local tumor boundary features, which is essential for challenging MRI cases with low contrast, irregular shapes, and mixed appearance. Furthermore, the key contribution of this work is not merely the replacement of one architecture with another, but the development of a hybrid medical image detection framework that integrates transformer-based feature learning to enhance tumor localization performance on brain MRI images.

3.3.1. Patch Positioning

In this block, the MRI images are divided into small, equal-sized patches (4 × 4 px) since the ST works on sequences rather than continuous images. Each patch acts as a token, which is subsequently transformed into an embedding (a numerical feature vector) in the next block.

N = \frac{H}{P} \times \frac{W}{P},

(2)

where N represents the total number of patches; H and W denote the height and width of the input image, respectively; and P illustrates the patch size.

\frac{H}{P}

and

\frac{W}{P}

must be integers, meaning that both the image height and width must be divisible by P to allow non-overlapping patch partitioning. This limitation confirms that the image is evenly divided into consistent patches without partial tumor regions. Moreover, such regular partitioning is important for ST-based processing, since every patch is treated as a visual token for subsequent embedding and hierarchical attention-based feature learning.

3.3.2. Linear Embedding (Patch Projection Layer)

In this block, each flattened patch is estimated in a higher-dimensional embedding space (i.e., D). This step converts the image patches into a feature embedding space for better understanding of the transformer so that they can be understood by the transformer. After the patch partition, the next process is sequencing all the patches. Each patch and vector xi dimension (i.e., P₂) is mapped into a new embedding as

Z_{i} = W_{E} X_{i} + b_{E},

(3)

where

X_{i} \in R^{P^{2} C}

is the flattened representation of the i-th patch,

W_{E} \in R^{P^{2} C}

denotes the learnable projection matrix,

b_{E} \in R^{D}

is the bias term, and

Z_{i} \in R^{D}

is the corresponding embedded patch token. D denotes the embedding dimension, which defines the size of the feature representation used by the transformer. This embedding step transforms low-level pixel information into a more discriminative feature space, enabling the transformer to process each image patch as a token in a sequential representation-learning framework. For grayscale MRI images, where C = 1, each P × P patch is flattened into a vector of length P² before projection into the D-dimensional embedding space.

3.3.3. Swin Block

After the division of small patches (4 × 4), they are converted into fixed-length feature vectors (tokens) and passed into the first ST block. This is highly effective for BT detection because it captures both fine local details and broader global features in MRI scans. Each block begins with a window partitioning step, where the feature map is classified into separated windows of size M × M. We will discuss each stage of the ST block in more detail below, sequencing all the patches. Each patch and vector xi dimension (i.e., P₂) is mapped into a new embedding.

N_{w} = \frac{H^{'}}{M} \times \frac{W^{'}}{M}

(4)

where N_w represents the total number of non-overlapping windows, H′ and W′ denote the height and width of the current feature map, and M is the window size used for local self-attention. In the Swin Transformer, self-attention is not computed over the entire feature map at once. Instead, the feature map is divided into smaller fixed-size local windows, and attention is calculated independently within each window. This design significantly reduces computational complexity compared with global self-attention, while still allowing the model to learn meaningful local contextual relationships. More specifically,

\frac{W^{'}}{M}

indicates how many windows are formed along the height dimension and

\frac{H^{'}}{M}

indicates how many windows are formed along the width dimension. Multiplying these two values yields the total number of windows in the current transformer stage. This formulation is valid when both H′ and W′ are divisible by M, ensuring that the feature map can be partitioned into equal-sized windows without incomplete boundary regions. Therefore, Equation (4) gives the total number of non-overlapping windows processed within an ST block.

3.3.4. Patch Merging

During patch merging, every group of (2 × 2 px) neighboring tokens is combined, reducing the total number of tokens by a factor of four across both height and width, as shown in Figure 4. The equation for patch merging can be formulated as follows:

Y \in R^{\frac{H}{2} \times \frac{W}{2} \times 2 C}

(5)

where Y denotes the output feature map after patch merging and H, W, and C denote the height, width, and channel dimensions of the input token grid, respectively. Specifically, four neighboring tokens are first concatenated, producing an intermediate feature of dimension 4C, which is then linearly projected into a 2C-dimensional representation. Consequently, the spatial dimensions are reduced by half, while the channel dimension is increased, enabling hierarchical representation learning with improved contextual modeling and reduced computational complexity in deeper ST stages. This avoids the high computational cost of 4C and enables the model to learn more detailed context-aware features at each stage. Each merged token then signifies a larger perspective field with twice the channel capacity, which captures contextual connections more efficiently. A linear layer is implemented to balance the feature distribution and prepare it for further transformation inside the ST blocks.

3.3.5. Multi-Stage Extracted Features (C3–C6)

When an MRI image goes through the ST backbone, its features are gradually refined in a specific area, allowing the extraction of deeper and more valuable tumor representations. In initial patch embeddings (i.e., Z0) from the linear embedding block, it generates the final feature map (i.e., C3), which primarily captures fine-grained local tumor details. The feature map (i.e., C4) is based on the previous stage to identify the regional tumor properties, for instance, to localize growth patterns of irregular boundaries. The feature maps (i.e., C5) emphasize the wide-range dependencies of the tumor’s relationship with larger brain structures. Finally, the feature map (i.e., C6) presents deep contextual information effectively. The equation below indicates the combined feature extraction process across all stages.

C_{i + 1} = F_{S T G i + 1} (C_{i})

(6)

where C_i denotes the input feature map to the (i + 1)-th stage, F_STGi+1(⋅) represents the transformation function of that stage, and C_i+1 is the output feature map. Each stage performs a sequence of window-based self-attention, feed-forward transformation, and optional patch merging, thereby progressively refining the feature representation. Starting from the initial patch embedding, Z₀, the backbone generates hierarchical features, C3, C4, C5, and C6, where shallow stages preserve fine local structures and deeper stages capture increasingly abstract semantic and contextual information. This staged feature extraction enables the network to model both subtle tumor details and broader anatomical relationships, which is important for accurate brain tumor detection in MRI images.

3.4. Neck

A previous CNN-based model missed important features that effectively spanned all these scales in tumor detection. To address this limitation, we employed the FPN + PANet hybrid in the neck part on YOLOv12 for more multi-scale detail-preserving feature fusion. Through the integration of FPN and PANet, the neck achieves more effective multi-scale feature aggregation by combining high-level semantic context with low-level spatial detail. This design allows the proposed model to handle tumors with substantial variations in size, appearance, and structural complexity.

3.4.1. FPN + PANet Hybrid

The proposed FPN + PANet hybrid neck enhances multi-scale feature fusion for brain tumor (BT) detection. Specifically, the FPN establishes a top-down pathway that integrates high-level semantic information with low-level detailed features, enabling the model to capture both global tumor context and fine boundary information. In parallel, PANet strengthens bottom-up information flow by transferring localization-sensitive features from lower layers to higher layers, thereby improving spatial precision and detection robustness. Unlike a conventional neck design, this hybrid module plays a critical role in the proposed Swin–YOLO framework by effectively redistributing the contextual representations learned by the Swin Transformer backbone across different pyramid levels. As a result, the model becomes more effective in detecting tumors with varying sizes, irregular boundaries, and low-contrast appearance. Therefore, the FPN + PANet hybrid neck serves as an essential component of the proposed architecture, and its contribution is analyzed separately in the ablation study.

Moreover, in the top-down path, the backbone feature map at level i is fused with the up-sampled feature from the deeper pyramid level (i + 1). In the bottom-up path, the feature map is further enriched by the down-sampled feature propagated from the shallower pyramid level (i − 1). The hybrid fusion process is formulated as

P_{i}^{'} = C o n v (C o n c a t ({C^{'}}_{i}, U p (P_{i + 1}), D o w n (P_{i - 1})))

(7)

where

{P^{'}}_{i}

denotes the refined hybrid feature map at level i,

{C^{'}}_{i}

is the lateral feature map from the backbone, Up (Pi + 1) represents the up-sampled feature from the deeper level, and Down (Pi − 1) represents the down-sampled feature from the shallower level. The concatenation operation merges these multi-scale features along the channel dimension, while the convolution layer smooths and refines the fused representation. This bidirectional fusion strategy enables the network to combine high-level semantic information with fine-grained localization cues, thereby improving multi-scale feature representation for accurate tumor detection.

3.4.2. Lateral Conv

A Lateral Conv is a (i.e., 1 × 1) convolution layer applied to feature maps from backbone stages (C3, C4, C5, and C6). Basically, it not only resizes channels but also adjusts the weights to emphasize the useful features and decrease the irrelevant noise in a tumor image. For this process, it unifies all feature maps into the same fixed number of (256) channels, making them compatible with the FPN + PANet structure. The equation for Lateral Conv can be expressed as follows:

C_{i}^{'} = {C o n v}_{1 \times 1} (C i), {C^{'}}_{i} \in R^{H_{i} \times W_{i} {\times D}_{i}}, {C^{'}}_{i} \in R^{H_{i} \times W_{i} \times 256}

(8)

where

C_{i} \in R^{H_{i} \times W_{i} \times D_{i}}

denotes the input feature map from the i-th backbone stage, with spatial dimensions Hi × Wi and channel depth D_i, and

{C^{'}}_{i} \in R^{H_{i} \times W_{i} \times 256}

represents the transformed output feature map after the 1 × 1 lateral convolution. The 1 × 1 lateral convolution projects this feature map into C_i while preserving the spatial dimensions and unifying the channel dimensions to 256. This transformation makes the multi-scale backbone features compatible with subsequent fusion in the FPN + PANet neck. More specifically, the lateral convolution projects feature channels into a common 256-dimensional space without changing the spatial resolution. It preserves the positional structure of tumor-related patterns while ensuring that backbone features (C3, C4, C5, and C6) are maintained.

3.4.3. Up and Down Sample Block

In MRI brain scans, tumors can appear in different forms; for instance, they can be small in terms of height and width, exhibiting low-resolution areas with fine textures or appearing as large masses spread across broad regions. In the FPN top-down path, semantic features are up-sampled to align with higher-resolution layers (i.e., C3). The network compresses these fine-grained maps into smaller, more semantic representations that are easier to combine with deeper layers (i.e., C4, C5, and C6), which naturally hold stronger semantic meaning. The equation for the Up sample and Down sample can be expressed as follows:

{F^{'}}_{i} = Conv Up (C_{i,} s c a l e = 2 + C_{k \times k, s = 2,} C_{i})

(9)

where

k \times k

represents the convolution kernel, s denotes the stride that reduces the spatial resolution, and

{F^{'}}_{i}

represents the feature map that combines both fine (up-sampled) and coarse (down-sampled) information for accurate tumor detection.

3.4.4. Pyramid Feature Maps (PFMs)

The PFMs are constructed by combining backbone outputs (i.e., C3–C6) through lateral convolution, up-sampling, and down-sampling operations. Each pyramid level (i.e., P3–P6) is designed to detect tumors at different spatial scales. The P3 level maintains the high-resolution spatial details for detecting tumors. The P4 level captures mid-scale tumor portions by stabilizing the structural resolution with contextual data, whereas P5 boosts the detection of large tumors for preserving global structure. Finally, P6 provides a detailed semantic representation, which helps identify highly massive tumors by analyzing the deep contextual information. Overall, the multi-scale pyramid design enables the model to preserve fine-grained spatial details while simultaneously strengthening high-level semantic information. This balanced representation is particularly important for brain MRI analysis, where tumors may appear in different sizes with irregular boundaries and heterogeneous textures. Consequently, PFMs improve the model’s capability to achieve more accurate and reliable tumor detection.

3.5. Head

The detection head is the final stage of the proposed method, where processed feature maps are transformed into tumor detection results. It converts all the BT features into bounding boxes with confidence scores by combining the regression, abjectness, and classification under a joint loss function. Furthermore, the analysis features at different pyramid levels (i.e., P3–P6) predict bounding boxes for both small and large abnormalities that are captured.

3.5.1. Bounding Boxes

Figure 5 illustrates the detailed explanation of the bounding boxes of the proposed model used to identify BTs. The detection head needs the anchor-free approach used to make bounding boxes before giving actual detection results, and MRI images are split into an S × S grid, where each grid cell is concerned with detection boxes. These units estimate the class confidence values, which are used to determine the class of each object. In a single YOLO model, all the tumor bounding boxes in the given datasets are simultaneously predicted during the training process. For each predicted box, the model estimates the center coordinates, width, and height, which together define the spatial extent of the suspected tumor region. This design allows the network to directly localize tumors of different shapes and sizes without relying on predefined anchor templates.

3.5.2. Inner-GIOU

The Inner-GIoU loss function is an advanced method that greatly improves our model’s ability. It enhances bounding box accuracy by making it easier to identify the difference between predicted (i.e., anchor) boxes and the actual GT (i.e., target) boxes shown in Figure 6. A scaling ratio is applied to customize the size of bounding boxes, making the training process more flexible and adaptive. In loss calculation, the model learns to adjust bounding boxes more accurately and increases the accuracy of tumor localization. Moreover, the loss calculation for different IoU samples through the scaling factor is also essential in model training. This approach delivers more reliable performance of the model and is also helpful for detecting small, irregular tumors. In this method, both the predicted box and the GT box are transformed into smaller inner regions by shrinking their width and height according to a predefined ratio. These inner boxes concentrate more on the central and informative object area rather than the outer boundaries alone. As a result, the model becomes more sensitive to subtle localization differences between the predicted and target boxes. It is especially important in medical imaging, where even a small positional deviation may lead to incorrect lesion localization.

The IoU and GIoU are expressed as follows:

I o U = \frac{| b \cap b^{g t} |}{| b \cup b^{g t} |},

(10)

where b denotes the estimated bounding box, b^gt represents the ground-truth bounding box, ∣b ∩ b^gt∣ is the intersection area between the two boxes, and ∣b ∪ b^gt∣ is their union area. Thus, Equation (10) is used to calculate the overlap ratio between the predicted and target boxes, where a larger IoU value signifies better localization accuracy. Moreover, the IoU value ranges from 0 to 1, where a value of 0 indicates that the predicted and ground-truth boxes do not overlap at all and a value of 1 demonstrates perfect overlap between them. Thus, a larger IoU value reflects a near spatial match and more accurate bounding box localization due to the measurements’ overlapping in a normalized manner. IoU is widely used in object detection tasks to evaluate how well the predicted box aligns with the actual lesion or object region.

The GIoU extends IoU by considering the smallest enclosing box covering both the predicted and ground-truth boxes and is defined as

G I o U = I o U - \frac{| C - (b \cup b^{g t}) |}{|C|},

(11)

where C represents the smallest enclosing box covering both the predicted bounding box b and the ground-truth bounding box ∣C − (b ∪ b^gt)∣ and b^gt denotes the area inside C that is not occupied by the union of the two boxes. In other words, this term measures the extra background region enclosed by C beyond the combined area of the predicted and target boxes. Therefore, Equation (11) extends the conventional IoU by introducing a geometric penalty that reflects how far the two boxes are from each other, even when they do not overlap. In particular, when the predicted and ground-truth boxes overlap well, the penalty term becomes small and the GIoU value approaches the IoU value. However, when the two boxes are far apart or do not overlap, the penalty term becomes larger, which reduces the GIoU score. In this way, GIoU provides more informative guidance than IoU alone because IoU becomes zero for non-overlapping boxes and cannot describe the spatial separation between them. By contrast, GIoU still captures the geometric relationship between the two boxes through the enclosing region C.

To further improve regression performance, the Inner-GIoU formulation introduces a scaling factor that controls the influence of inner auxiliary boxes and is expressed as

Inner-GIoU = I o U - λ \times \frac{| C_{i n n e r} - (b \cup b^{g t}) |}{| C_{i n n e r} |},

(12)

where λ is a scaling coefficient that controls the contribution of the inner penalty term and C_inner denotes the smallest enclosing box computed from the inner auxiliary boxes derived from the predicted and ground-truth boxes. The term

| C_{i n n e r} - (b \cup b^{g t}) |

represents the area inside the inner enclosing box that is not covered by the union of the predicted and ground-truth boxes. Therefore, this penalty measures the spatial discrepancy between the two boxes within a more focused internal region rather than across the entire outer enclosing area. To be more specific, Inner-GIoU extends conventional GIoU by focusing on the inner auxiliary regions of the predicted and target boxes. By emphasizing the more informative central area rather than only the global enclosing box, it becomes more sensitive to small localization errors, which is especially useful for small objects and precise alignment tasks.

To define the inner auxiliary box for the ground-truth target, let the GT box center be

x_{c}^{g t}

,

y_{c}^{g t}

, with width w^gt and height h^gt. Using a scaling factor, r (ratio), the left and right boundaries of the inner GT box are defined as

b_{l}^{g t} = x_{c}^{g t} - \frac{w^{g t} r}{2}, b_{r}^{(g t)} = x_{c}^{g t} + \frac{w^{g t} r}{2},

(13)

where

x_{c}^{g t}

and

y_{c}^{g t}

denote the center coordinates of the ground-truth box, w^gt and h^gt denote its width and height, and r is a scaling ratio used to generate the inner auxiliary box. The terms

b_{t}^{g t}

and

b_{r}^{g t}

represent the left and right boundaries of the scaled inner ground-truth box, respectively. This formulation preserves the center position of the original ground-truth box while proportionally adjusting its width according to r when r < 1. The resulting inner box focuses on the more central and informative target region, which is beneficial for fine-grained localization.

Similarly, the top and bottom boundaries of the inner GT box are defined as

b_{t}^{g t} = y_{c}^{g t} - \frac{h^{g t} \times r}{2}, b_{b}^{g t} = y_{c}^{g t} + \frac{h^{g t} \times r}{2},

(14)

where

b_{t}^{g t}

and

b_{b}^{g t}

represent the top and bottom boundaries of the scaled inner ground-truth box. Thus, Equation (14) defines the vertical boundaries of the inner auxiliary ground-truth box by scaling the original box height around its center coordinate. More specifically, the vertical center,

y_{c}^{g t}

, remains unchanged, while the original height,

h_{c}^{g t}

, is multiplied by the scaling factor, r. As a result, the top and bottom boundaries are symmetrically adjusted with respect to the center of the original ground-truth box.

This ensures that the inner auxiliary box preserves the original target location while reducing or refining its vertical extent. For the predicted box, let the center coordinates be (x^c, y^c), with width w and height h. The left and right boundaries of the corresponding inner predicted box are computed as

b_{l} = x_{c} - \frac{w \times r}{2}, b_{r} = b_{r} + \frac{w \times r}{2},

(15)

where b_l and b_r denote the left and right boundaries of the inner auxiliary predicted box, respectively; x^c and y^c are the center coordinates of the predicted bounding box; w and h denote its width and height; and r is the scaling ratio. Equation (15) proportionally adjusts the horizontal extent of the predicted box while preserving its center position.

This formulation is consistent with the ground-truth inner box definition and enables a fair inner-region comparison during Inner-GIoU-based regression. Likewise, the top and bottom boundaries of the inner predicted box are given by

b_{t} = y_{c} - \frac{h \times r}{2}, b_{b} = y_{c} + \frac{w \times r}{2} .

(16)

where b_t and bb denote the top and bottom boundaries of the inner auxiliary predicted box, respectively; y_c is the vertical center coordinate of the predicted bounding box; h is its height; and r is the scaling ratio. Equation (16) proportionally adjusts the vertical span of the predicted box while preserving its center position.

The targeted and anchor boxes are separated, with the targeted box (i.e., the blue solid box), which represents the GT bounding box, on the left side and the predicted or reference bounding box (i.e., the orange solid box) on the right side. A smaller, left-side, inner targeted (blue dashed box) region derived from the target box by shrinking it proportionally along the width (w_inner) and height (h_inner) dimensions has an input resolution of n = 640 as a result of utilizing optimized memory access patterns for more efficient execution.

3.6. YOLOv12 Architectural Advancements

YOLOv12 is composed of three main parts: the backbone, neck, and head. First, the input image is fed into the backbone, where an image of size 640 × 640 × 3 is progressively processed through convolutional layers and R-ELAN (Residual Efficient Layer Aggregation Network) blocks to extract hierarchical multi-scale features. The convolution layers gradually reduce the spatial resolution from 640 × 640 to 320 × 320, 160 × 160, 80 × 80, 40 × 40, and finally 20 × 20, while increasing the depth of the feature representations. At each stage, the R-ELAN blocks enhance feature learning, preserve important information, and strengthen the representation capability of the network. The extracted multi-scale features are then passed through a Position Perceiver module to preserve spatial and positional information before being forwarded to the Flash Attention [70] A2 modules for further refinement. Figure 7 shows the detailed architecture of YOLOv12.

3.6.1. A2 Module

This module plays a very important role in further feature processing, as it refines the extracted features by efficiently capturing long-range dependencies and global contextual information, thereby enabling the backbone to generate stronger and more informative feature maps for subsequent detection tasks. In addition, A2 maintains a large receptive field while simplifying the attention mechanism to reduce the computational cost and improve the inference speed. It also incorporates segmented feature processing with Flash Attention, which reduces computational complexity by 50 percent through spatial reshaping while preserving broad contextual coverage. Moreover, A2 supports real-time detection at a fixed input resolution of n = 640 by utilizing optimized memory access patterns for more efficient execution.

3.6.2. R-ELAN Module

The neck is responsible for aggregating and refining the multi-scale features received from the backbone before they are forwarded to the detection head. It combines feature maps of different resolutions through up-sampling, concatenation, and convolution operations, enabling effective fusion of high-level semantic information with low-level spatial details. Starting from the deeper 20 × 20 feature map, the features are progressively up-sampled to 40 × 40 and 80 × 80, where they are concatenated with the corresponding features from earlier layers to enrich the representation. After each fusion stage, the combined features are processed by the R-ELAN + A2 modules, which further enhance feature extraction and contextual understanding through attention. In the downward path, convolution layers reduce the spatial resolution again, while concatenation merges features across different scales, allowing the network to preserve strong semantic information at the 40 × 40 and 20 × 20 resolutions.

3.6.2.1. CSPNet

The R-ELAN module is built upon a CSPNet [71]-inspired design, in which the input feature map is partially divided into different paths to improve gradient flow and reduce redundant computation. One portion of the features is forwarded directly, while the other passes through a sequence of convolutional and transformation operations before both models are merged through concatenation. This structure helps preserve original information, enhance feature reuse, lower computational complexity, and improve learning efficiency. By incorporating this CSP-based strategy, the R-ELAN module is able to extract richer and more stable feature representations in both the backbone and the neck.

Figure 7. YOLOv12 architecture [72]. Modified YOLOv12 Architecture.

3.6.2.2. R-ELAN

The R-ELAN module is built upon a CSPNet-inspired design, in which the input feature map is partially divided into different paths to improve gradient flow and reduce redundant computation. One portion of the features is forwarded directly, while the other passes through a sequence of convolutional and transformation operations before both branches are merged through concatenation. This structure helps preserve original information, enhance feature reuse, lower computational complexity, and improve learning efficiency. By incorporating this CSP-based strategy, the R-ELAN module is able to extract richer and more stable feature representations in both the backbone and the neck. This design also strengthens feature propagation across layers.

3.6.3. C3K2

The R-ELAN module incorporates a C3K2 [73]-style CSP-based structure, in which the input features are first divided into multiple branches and then processed through convolutional layers, repeated blocks, and transition layers before being merged through concatenation. This design improves feature reuse, enhances gradient flow, reduces redundant computation, and enables more efficient and richer feature representation in both the backbone and the neck. Moreover, the integration of the C3K2-style CSP structure supports effective multi-branch feature learning without introducing excessive computational burden. It also facilitates better information propagation across layers.

3.6.4. Multi-Scale Detection Head

The refined multi-scale features from the neck are forwarded to three Flash Attention A2-based detection branches to generate the final predictions. Each branch operates at a different feature scale, enabling the model to detect objects of various sizes more effectively. Within each branch, the Flash Attention A2 module further enhances feature representation by capturing important contextual relationships and emphasizing relevant regions before prediction. Following this refinement, the Detect layer performs the final object detection task by predicting object locations, class probabilities, and confidence scores. The outputs from all detection scales are then combined to produce the final detection result.

3.7. Swin Transformer Architecture

The ST is a modern ViT framework that captures strong visual information using self-attention and attains high accuracy to enhance multi-scale feature representation based on MRI scans. These advancements enable the efficient processing of large-scale image data and constitute a highly effective architecture for resolving complex issues in computer vision tasks. However, the overall structure of the ST is shown in Figure 8. The architecture comprises four main stages, where the input image is initially partitioned into small patches and subsequently processed through multiple transformer blocks within the backbone to extract hierarchical feature representations. In the initial Stage 0, the MRI brain images are taken as input:

X \in R^{H \times W \times 3}

. After that, ViTs implement the patch partitioning process, where the image is divided into separate, disjoint (i.e., 4 × 4) patches called “tokens”. In Stage 2, the model captures the broader regional tumor characteristics, such as clustered or irregular shapes. In this stage, every (2 × 2) group of neighboring patches is merged into a single feature, making a 4C-dimensional feature vector. This process leads to a reduction in the number of tokens; the output dimension is set to 2C and decreases the image resolution by half while doubling the channel depth for the wider contextual view of a larger tumor region encompassing the patch feature transformation and merging, while maintaining the image resolution at (i.e., H × 8, W × 8). This patch merging and feature transformation process is executed twice, the stages referred to as “Stage 3” and “Stage 4”, respectively, with output resolutions of H/16 × H/16 and H/32 × H/32. Finally, the SoftMax function is executed for the output of per-class probabilities, ensuring that the final decision brain contains a tumor or not. In the final Stage 5, classification involves determining the decision boundaries between the two possible classes, “Tumor” and “No Tumor”.

3.7.1. Window-Based Self-Attention (WSA) Architecture

In the ST block, WSA is a module that divides an image into small windows and calculates self-attention not only within each window but also in the whole image. In this block, the model first analyzes large regions to capture the overall brain structure, then focuses on smaller regions to extract tumor details. As illustrated in Figure 9, two types of local and shifted windows are used to enable clearer detection of the tumor region. In the local window self-attention mechanism, the whole MRI image is split into many smaller, fixed-size windows (e.g., 4 × 4, 8 × 8, or 16 × 16). Each window acts as an enclosed region, where self-attention calculates the patches that only exist inside that specific window. Moreover, in the local window, only using local attention, the model becomes highly sensitive to minor pixel-level abnormalities, which are important for detecting tumors that might be too small or low-contrast for old methods to identify. This design makes the attention process computationally effective because it highlights the local neighborhoods rather than the entire image, significantly minimizing the computational cost of global self-attention. Furthermore, this localized attention mechanism is highly suitable for brain MRI analysis, where tumor regions often exhibit subtle structural and intensity variations.

3.7.2. Window-Based Multi-Head Self-Attention (MASA)

In each window of the ST, an MSA process is created to observe separately fine tissue patterns, brightness variations, and local affiliation in the specific brain region. This part is very important for tumor detection because it ensures that abnormalities and irregular tissue growth are clearly shown from the neighboring windows. In Figure 10, the MRI image is divided into four windows, and each window is observed individually to capture localized tumor features within particular brain regions. Following this, every window is analyzed, and the capturing results are merged to give the model a full picture of the brain image. This localized attention strategy enables the model to extract discriminative regional features while maintaining computational efficiency. It is particularly beneficial for brain MRI analysis, where tumors may appear with a small size, irregular shape, or low contrast against surrounding tissues. The independent analysis of each window supports precise modeling of local structural variations and abnormal patterns. After processing all windows, the outputs are compiled:

W - M S A (X) = ⋃_{i = 1}^{n} M S A (X^{i})

(17)

where Xⁱ indicates the feature extracted from the ith window of the MRI image. Once all the windows (i = 1 to n) are processed, the outputs are recombined (i.e.,

⋃_{i = 1}^{n} X^{i}

) to form the complete feature map. In this step, each window output is placed back in its correct tumor location (see Algorithm 1).

Algorithm 1. Proposed method architecture

Input: Brain MRI image X
Output: Tumor/healthy prediction Y and bounding box localization B

Begin

1: P ← Patch Partition (X, 4 × 4)//divide the MRI image into non-overlapping patches

2: T ← LinearEmbedding (P)//convert image patches into token embeddings

3: C3 ← SwinStage₁ (T)//extract shallow local-contextual features

4: C4 ← SwinStage₂ (Patch Merging(C₃))//reduce resolution and learn deeper features

5: C5 ← SwinStage₃ (Patch Merging(C₄))//capture richer multi-scale representations

6: C5′ ← SwinStage₄ (Patch Merging(C₅))//refine deepest global contextual features

7: Ftd ← FPN ({C₃, C₄, C₅′})//fuse multi-scale features through top-down pathway

8: Fbu ← PANet (Ftd)//enhance localization-sensitive features via bottom-up pathway

9: {P₃, P₄, P₅, P₆} ← Pyramid Features (Fbu)//generate enriched pyramid feature maps

10: (Y, B) ← YOLOv12Head ({P₃, P₄, P₅, P₆})//predict class scores and bounding boxes

11: Return (Y, B)//output final classification and localization result

End

4. Experimental Settings

In this section, we provide a comprehensive overview of the dataset and preprocessing steps, give details on how the raw MRI images were prepared for model training, and explain the comparative analysis of dataset division with augmentation techniques. Moreover, we demonstrate a precise structural division of the proposed Swin–YOLO model and illustrate how its components are consistently organized from input preprocessing to the final stage of tumor detection.

4.1. Dataset

In this study, we used the Br35H dataset [74], which contains 3000 brain MRI images in JPEG format for binary classification. Of these, 1500 images belong to the tumor class and 1500 to the healthy class. The original dataset was randomly divided into training, validation, and testing sets in a 70:20:10 ratio, resulting in 2100 training images, 600 validation images, and 300 testing images. This split maintained class balance between healthy and tumorous cases across all subsets.

4.2. Data Preprocessing

We applied various data preprocessing techniques to ensure that the dataset was well structured and well suited for training the Swin–YOLO model. This step was essential before feeding MRI data into the network. During preprocessing, all images were adjusted according to the model requirements, including input size, training suitability, and pixel intensity normalization to a fixed range for balanced learning. After normalization, several augmentation techniques were applied, including vertical flipping, horizontal flipping, 90° rotation, and width shifting (0.4), as shown in Figure 11. This procedure generated additional diverse samples, enabling the model to better capture spatial variations and tumor patterns.

4.2.1. Data Normalizing

MRI images often exhibit variations in brightness levels due to differences in scanning conditions, imaging devices, or patient-specific factors. To address this variability, the pixel intensity values were normalized to a fixed range of [0, 1].

X_{n o r m} = \frac{X - μ}{σ}

(18)

where x denotes the original MRI,

μ

represents pixel intensity value, and

σ

refers to the standard deviation of the pixel intensities.

4.2.2. Image Resizing

All images were resized to a fixed spatial resolution of 224 × 224 pixels to maintain uniformity throughout the dataset. This preprocessing step ensured that all images had identical dimensions before being input into the model. Such resizing not only standardized the dataset but also supported a more stable and efficient training process by reducing variability in the input space.

X^{'} = T (F (R_{S} (θ) \times X, 224 \times 224)), Δ w

(19)

where in the terms of X_rotated,

θ

denotes the rotation angle about X_flipped and F denotes a function that reflects the image horizontally or vertically. Finally, considering X_shifted, T (X, Δw), where Δw is the width shift.

4.2.3. Data Augmentation

To further improve feature learning and strengthen the generalization capability of the proposed model, data augmentation was applied to the original MRI images. After augmentation, the dataset size increased from 3000 to 6300 images. The augmented dataset was then divided into 4410 training images, 1260 validation images, and 630 testing images, following the same 70:20:10 ratio. The detailed class-wise distribution of both the original and augmented Br35H datasets is presented in Table 2.

4.3. Performance Matrices

The performance of the proposed model was analyzed using sub-metrics, demonstrating the model’s capability to accurately classify non-tumor areas while identifying them in tumor-affected regions. The following metrics provide insights into the model’s predictive capability across the training, validation, and test datasets. These matrices were utilized to detect overfitting and underfitting problems, analyze the effect of hyperparameter refinements, and acquire a wide-range understanding of the model’s diagnostic performance.

4.3.1. Precision–Recall Curve (PR)

The PR curve for the tumor detection reliability was plotted as recall (x-axis) against precision (y-axis), and the precision score measured the validity of predicted tumor regions, while recall indicated how many actual tumors were detected. The calculations are as follows:

R e c a l l = \frac{T P}{T P + F N},

(20)

P r e c i s i o n = \frac{T P}{T P + F P} .

(21)

4.3.2. F1-Score

F1-score is a generally used metric to identify diagnostic accuracy, as it provides a reliable measure of both precision and recall. However, it calculates the harmonic mean of the F1-score values that exist between 0 and 1. The metric can be formulated as follows:

F_{1} = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} = \frac{T P}{2 T P + F P + F N} .

(22)

4.3.3. Mean Average Precision (mAP)

mAP is an evaluation metric widely used in object detection to measure the b localization accuracy of a model. In this study, mAP was used to assess how effectively the proposed model detects tumor regions and how accurately the predicted bounding boxes overlap with the corresponding ground-truth tumor locations. The metric is derived from the precision–recall (PR) curve, where the Average Precision (AP) for each class is calculated as the area under the corresponding PR curve. In object detection, the correctness of a predicted bounding box is determined by the Intersection over the Union (IoU), which measures the overlap between the predicted box and the ground-truth box.

In this work, two common mAP variants are reported. First, mAP@0.5 is computed using a fixed IoU threshold of 0.5, meaning that a predicted box is considered correct if the overlap with the ground-truth box is at least 50%. This metric reflects the general detection capability of the model. Second, mAP@0.5:0.95 is a stricter and more comprehensive metric that averages the AP values over multiple IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05 and provides a more robust evaluation because it measures model performance under increasingly strict localization requirements. Therefore, while mAP@0.5 indicates whether the model can generally detect tumors, mAP@0.5:0.95 better reflects the precision of tumor boundary localization. This distinction is especially important in medical imaging, where accurate localization of abnormal regions is essential for reliable diagnosis and clinical interpretation. The final mAP is then obtained by averaging the AP values across all classes, as expressed below:

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i} .

(23)

where N denotes the total number of classes and APi represents the Average Precision of the i-th class.

M C C = \frac{(T P \times T N) - (F P \times F N)}{\sqrt{(T P + F P) \times (T P + F N) \times (T N + F P) \times (T N + F N)}}

(24)

4.4. Training Configuration

The performance of the developed model might be enhanced by using multiple approaches and settings during the model training. In our experiments, we first applied the YOLOv12 pre-trained model to all its versions (n, s, m, l, and x), using various image sizes ranging from 128 px to 1024 px. This method is very effective for helping the model to learn tumor recognition on multiple-scale feature information, ultimately improving its ability to detect tumors properly. Table 3 presents the experimental computational setup used during the model training on a high-performance workstation. This computational setup presents efficient handling of powerful matrix operations.

4.5. Hyperparameters

We focused on optimizing the important parameters that extensively affect the model’s training to increase the predictive performance. These parameters, logically, changed in the model training to attain good results. After many experimental tests, we picked the most suitable hyperparameters for better predictive reliability and computational efficiency. These hyperparameters were selected through experimental tuning based on iterative trials. In particular, key settings such as image size, batch size, optimizer, learning rate, weight decay, momentum, patience, and the number of workers were optimized and evaluated to achieve an appropriate balance between predictive performance, training stability, and computational efficiency. Table 4 provides a detailed overview of the model hyperparameter configuration.

4.6. Model Training

In DL model training, both overfitting and underfitting problems commonly affect a model’s capability and performance. To tackle these issues, we implemented preprocessing approaches, such as an augmentation method to improve the model’s overall performance. Figure 12 presents the entire workflow for model training, starting with data acquisition and augmentation of the Br35H dataset, followed by dataset splitting, and then training the Swin–YOLOv12 model, which ultimately classifies MRI scans as tumorous or healthy. According to Ultralytics recommendations for YOLOv12 training, each model was trained for 100 epochs. We evaluated the model’s performance over 50 epochs at the beginning, and training was stopped early if no valuable improvement was noticed. Following this step, we increased the training epochs to 100 and optimized the performance. In the present study, we tested both the AdamW [75] and Stochastic Gradient Descent (SGD) [76] optimizers. To better explain the model training pipeline, the pseudo-code is provided below (Algorithm 2).

Algorithm 2. Pseudo-code training pipeline

Input: Br35H dataset D, epochs E = 100, early-check epoch Esp = 50, optimizers OP = {AdamW, SGD}

Output: Trained Swin–YOLOv12 model M* and final classification results

Begin

1: Load Br35H dataset D

2: Apply preprocessing and data augmentation

3: Split D into training, validation, and test sets

4: For each optimizer 0 in OP do

(a) Train M on the training set for up to E epochs.

(b) Validate the model after each epoch.

(c) Save best-performing weights.

(d) If epoch EsP observes no meaningful improvement, then stop training early.

End if

5: Select the best-performing model M*

6: Evaluate M* on the test set

7: Classify MRI scans as Tumor or Healthy

8: Return M* and results

End

The proposed model implements a combined FPN + PANet neck because it provides stronger multi-scale feature fusion than alternative single-path feature aggregation methods. Moreover, FPN enhances the top-down flow of high-level semantic features, while PANet enhances bottom-up propagation of low-level spatial information. Their combination enables the network to maintain both contextual and detailed features, which is particularly significant in brain MRI analysis, where tumor regions vary in size, shape, and boundary clarity. As a result, FPN + PANet offers more robust feature representation and improved performance compared with simpler neck designs. This integrated neck structure promotes more effective bidirectional information exchange between feature levels. It ensures that semantically rich features are delivered to high-resolution maps while detailed localization cues are preserved through bottom-up reinforcement. Such a design is particularly beneficial for medical images, where lesion appearance may vary significantly across patients and imaging conditions.

Furthermore, the proposed neck architecture strengthens cross-scale dependency learning by integrating semantically enriched and spatially refined features in a unified framework. This characteristic is highly valuable for medical imaging applications, where abnormalities may differ widely in size, texture, and edge visibility. Consequently, the FPN + PANet combination improves the discriminative capability of the network and contributes significantly to robust multi-scale brain tumor detection.

5. Results and Discussion

This section describes the experimental results and performance evaluation of two approaches: YOLOv12, a CNN-based model, and Swin–YOLO, a hybrid ViT framework. A comparative analysis focuses on the advantages and drawbacks of each model in terms of accuracy, stability, and training observations.

5.1. YOLOv12 Metric Configration

Table 5 shows how the YOLOv12 model performs well in the detection of BTs in a single class. These results indicate the performance, the precision (97.2%), recall (97.9%), and F1-score (97.6%) illustrating that the model attains high accuracy in identifying the small tumor regions. Furthermore, the result for mAP50 is 98.8%, providing more confirmation of the model’s ability to indicate tumors correctly, and the standard IoU threshold mAP50:95 score is 77.6%. Overall, the results demonstrate that YOLOv12 has a stable and accurate tumor detection performance for the single class.

Table 6 shows the results for YOLOv12. For the tumor class, the model attains the highest precision of 98.4%, the F1-score is 97.3%, and the recall is 96.4%, with strong localization accuracy reflected in the mAP50 (98.9%) and mAP50:95 values (79.2%). In addition, in healthy tumor prediction, the model reached an excellent recall of 99.7%, almost perfect, leading to an F1-score of 97.4%, and the precision is 95.1%, supported by an mAP50 of 98.4% and an mAP50:95 of 79.2%. Figure 13 presents a graphical visualization of the mAP50 and F1-scores for both classes of YOLOv12 models. Figure 14 presents the YOLOv12 tumor detection results. Table 7 and Table 8 show comparisons of the results with those of other CNNs and YOLO variants to clarify the baseline for YOLOv12.

5.2. Performance Analysis of the Proposed Method Compared to the Baseline Model

We show the comparison of the evaluation results for YOLOv12 and the Swin–YOLO model. Figure 15 shows that the proposed method steadily outperformed the baseline model for all evaluation metrics. In particular, it achieved an excellent performance with a precision of 98.9% and a recall of 99.4%, outperforming the base model, which attained 97.2% and 97.9 respectively. Moreover, the performance of the Swin–YOLO model, considering both FN and FP through the F1-score, shows a higher result: 99.2%. Furthermore, the proposed framework also exhibits higher mAP50 (99.4%) and mAP50:95 scores, with improved stability among many IoU thresholds compared to the YOLOv12 model, which obtained 87.82% and 77.9%, respectively. Figure 16 and Figure 17 show the strong regularization effect and minimization of overfitting, as evidenced by consistent training and validation (99.7%) and accuracy and loss trends (0.107%). Although the proposed model achieved 99.7% accuracy on the Br35H dataset, this result should be interpreted within the current experimental setting. Table 9 presents a comparison of the results of the proposed method with those of other hybrid CNNs.

Figure 18 shows augmented BT detection results, and Table 10 and Table 11 present the comparison of the results of the proposed method with those of other YOLO and ViT models.

5.3. Layer Structure of Proposed Method

Table 12 describes the detailed layer structure of the proposed architecture. In the Swin bakbone, the first layer form is −1 and Timm. models. The parameters are around 27.5 million with a patch size of 16 and a window size of 7. Moreover, forms 0–2 are the ConvBNAct module with step-by-step dimensions (i.e., [786, 256, 1], [384, 256, 1], and [192, 256, 1]). In the neck (top-down), the model focuses more on essential tumor regions by using the following form: 3, 5, 7, and 9 layers, with two AreaAttention2 modules and 8-head attentions over 8 × 8 windows. Furthermore, this model uses approximately 5.9 million parameter collections with long-range dependencies for multiple BT sizes. In the neck (bottom-up), two more RELANBlock modules are utilized with the following forms: 4, 6, 8, and 10 layers. Inside this block, an extra 0.5 channel expansion ratio and, separately, 0.15 million parameters are used to make the neck segments more efficient, lightweight, and generalizable. Forms 11–13 contain layers along with three ConvBNAct modules. Finally, YOLOv12Head is the last detection module utilized. The layers from 14 to 16 take 256 channel features from the neck. Figure 19, Figure 20, Figure 21, Figure 22, Figure 23 and Figure 24 illustrate the F1-score, precision, recall, specificity, mAP@0.5, and mAP@50:95 results of the proposed method.

5.4. Confusion Matrix and Detection Summary for Swin–YOLO

Figure 25 presents a confusion matrix; the vertical axis represents the predicted labels No Tumor or Tumor, while the horizontal axis represents the true labels. Regarding the No Tumor cases, the results show that 95% were correctly predicted, indicating a high True-Negative Rate (TNR), which means strong reliability, and just 1% were misclassified (1% False-Positive Rate (FPR)), indicating a very low 5% False-Negative Rate (FNR). Similarly, in Tumor cases, the model rightly detected 99% (TNR), with just a 5% FPR and a 1% FNR. To improve the model consistency and flexibility, we describe the overall class distribution of 51% Tumor and 49% No Tumor cases for the stabilization dataset. Figure 26 displays the detection summary of the model’s performance, differentiating between the two classes: Class 0, No Tumor, and Class 1, Tumor. Furthermore, for Class 0 the correctly detected TP number is 566, the FP number is 103, and the missed cases (FN) number is 64, indicating moderate reliability. For Class 1, it correctly identified 557 images as TPs, while generating 293 FPs and 73 FNs, showing poor performance compared to Class 0.

5.5. Receiver Operating Characteristic Curve (ROC)

The ROC curve is a statistical approach used to evaluate a model’s classification ability to illustrate the relationship between the TPR and FPR across different decision thresholds. However, the Area Under the Curve (AUC) calculates the overall differentiation of a model’s performance, where a value of 1.0 indicates an ideal classification and 0.5 illustrates random selection, as presented in Figure 27. For the YOLOv12 model, the ROC curve demonstrates an AUC of 0.949, indicating reliable tumor localization between abnormal and healthy brain regions. In comparison, the Swin–YOLO model gained a much higher AUC of 0.987, the proposed method presenting a superior BT diagnosis ability. This huge AUC value arose due to the implementation of the ST in the YOLO model, which enhances both local feature extraction through its hierarchical architecture and shifted-window attention mechanism. Furthermore, this combination helps the model to effectively capture correct tumor boundaries as well as broader spatial relationships within MRI images and presents good potential for clinical BT detection as compared to traditional CNN-based approaches.

5.6. Ablation Study

In this section, we further analyze the contribution of each major component of the proposed framework for brain tumor (BT) detection. To clearly demonstrate the effectiveness of the architecture, the ablation study compares three settings: YOLO-only, Swin Transformer-only, and the proposed hybrid Swin–YOLO model, as summarized in Table 13. The comparison is based on inference time, GFLOPs, mAP@50, and mAP@50:95. Among the YOLOv12-only models, the computational complexity increases progressively from YOLOv12-N (6.3 GFLOPs) to YOLOv12-X (198.5 GFLOPs). This increase in model capacity leads to gradual performance improvement, where mAP@50 rises from 97.1% in YOLOv12-N to 98.9% in YOLOv12-X, while mAP@50:95 improves from 73.0% to 79.5%. These results confirm that the YOLO-only backbone provides strong detection capability with relatively efficient inference. To isolate the effect of the transformer branch, we also evaluated a Swin Transformer-only configuration. This model achieved a 99.1% mAP@50 and an 81.0% mAP@50:95, showing that the Swin Transformer is highly effective in capturing discriminative global contextual information in brain MRI images. However, this improvement comes with an increased computational cost, requiring 211.1 GFLOPs and a 3.5 ms inference time. Although the Swin-only model outperformed all YOLO-only variants in detection accuracy, its localization precision remained lower than that of the final hybrid model.

Finally, the proposed Swin–YOLO hybrid model achieved the best overall performance, with a 99.4% mAP@50 and an 87.2% mAP@50:95, as described in Figure 21 and Figure 22, indicating that integrating the Swin Transformer with YOLOv12 allows the model to benefit from both global contextual feature learning and efficient multi-scale object localization. Although the hybrid model has the highest computational demand (730 GFLOPs) and inference time (25.0 ms), it provides the most accurate and robust detection results. Therefore, the ablation study confirms that the performance gain is not due to either the CNN-based detector or the transformer alone, but rather their effective hybridization. Figure 28 and Figure 29 describe the ST mAP@0.5 and mAP@50:95 results, and Figure 30 shows the results for BT detection of tumor regions with a bounding box.

5.6.1. Quantitative Contribution of the Swin Transformer Backbone and Hybrid Neck

We performed an ablation study to systematically investigate the key components of the proposed architecture, including the ST backbone and the hybrid FPN + PANet neck, to evaluate their individual contributions. The baseline YOLOv12 model was first used to represent the original CNN-based detector, after which the ST backbone was incorporated and analyzed to assess the effect of transformer-based global contextual learning. Finally, the proposed Swin–YOLO model, which integrates the Swin Transformer backbone with a hybrid FPN + PANet neck, was validated. These ablation results indicate that the ST backbone improves global feature representation and captures long-range dependencies in brain MRI images, while the hybrid FPN + PANet neck enhances multi-scale feature aggregation and preserves localization-sensitive information essential for accurate tumor boundary detection. These findings confirm that the performance improvement of the proposed model results from the combined contribution of the transformer backbone and the hybrid neck design, rather than from backbone substitution alone.

5.6.2. Performance Comparison of Swin–YOLO Variants

Table 14 shows the comparison of the performance of four Swin–YOLO versions, Base, Tiny, Small, and Large, to analyze the BT detection through standard metrics such as mAP50, mAP50:95, recall, precision, and F1-score. Among all the models, the Swin–YOLO small model achieved excellent results with an mAP:50 of 99.4%, an mAP50:95 of 87.2%, a recall of 99.7%, a precision of 98.9%, and an F1-score of 99.2%, demonstrating the best balance between computational efficiency and model accuracy. However, the large model also performed well in all the matrices in terms of mAP:50, and the recall result is also very good, being close to the small variant results of 99.2% and 99.4%. Overall, these results confirm that the Swin–YOLO architecture attains reliability, making it a strong and flexible option compared to the baseline framework.

5.6.3. Matthew’s Correlation Coefficient (MCC)

MCC was used to measure the baseline and proposed model estimation, which correlated with the actual tumor classification results. It was used to analyze the TP, TN, FP, and FN results, producing a reliable evaluation metric for binary classification problems such as tumor vs. non-tumor detection. In Figure 31, the MCC value that the Swin–YOLO framework achieved is higher (0.981) than the YOLOv12 value of 0.951. However, this remarkable development is shown in the proposed method’s results, which describe a more stable correlation between predicted and true classes with limited misclassifications. Moreover, these improvements can be associated with the implementation of the ST hierarchical attention mechanism, which allows the model to capture fine-grained tumor boundaries and contextual information in MRI scans. Overall, the Swin–YOLO model presents superior reliability, confirming that the integration of transformer-based features significantly improves the predictive stability and accuracy of BT detection compared to the YOLOv12 framework.

6. Explainable AI (XAI)

XAI plays a key role in the correct detection of small BT regions because it provides more visual explanations of a model’s decision process. Moreover, it helps radiologists to discern fine patterns and sub-details, allowing them to detect abnormalities more precisely. There are many DL models used for diagnosing deep BTs, but they are often treated as “black boxes,” which limits their acceptance in medical practice. In contrast, using XAI visualization methods such as Grad-CAM, SHAP, and LIME improves model transparency by allowing visual observation of the tumor features that support the decision-making procedure in patient healthcare. In addition, these highlighted models show reliability in prediction for medical applications, supporting a strong connection between AI-driven analysis and real-world diagnostic practice. Figure 32 displays how XAI enhances tumor detection by presenting a comprehensive model.

6.1. Contribution of the XAI Module

We included the XAI module as an interpretability component composed of Grad-CAM and SHAP to further analyze the contribution of the proposed framework. Unlike the ST backbone and the FPN + PANet hybrid neck, the XAI module is not a trainable part of the detector and therefore does not directly improve numerical detection metrics. Its contribution lies in enhancing model transparency by showing whether the trained Swin–YOLO model focuses on clinically meaningful tumor regions during prediction. In this study, Grad-CAM highlights the important spatial regions responsible for the decision, while SHAP explains the contribution of individual features. Thus, the XAI module strengthens the practical and clinical relevance of the proposed framework by improving interpretability and trustworthiness in brain tumor MRI analysis.

6.2. Gradient-Weighted Class Activation Mapping (Grad-CAM)

In BT detection, Grad-CAM acts as a strong detection tool that generates heatmaps highlighting the most important small tumor regions, without getting distracted by irrelevant parts like background or normal tissue, thereby enhancing the reliability of clinical applications. Moreover, it helps doctors to better understand visual AI results so that they can easily understand and create links between machine predictions and medical expertise. Figure 33 illustrates that the visualization Grad-CAM used in BT workflows encourages the growth of XAI systems that provide analytical-based diagnostic accuracy with descriptive clarity. Furthermore, it is very helpful for a radiologist to focus more on cancerous tumor regions, thereby supporting clinical decision-making and analysis of critical diagnosis errors during the testing phase. In addition, this clarifying feedback confirms that the AI model’s evaluation remains stable with human radiological decisions. The Grad-CAM equation and algorithm (Algorithm 3) are described below.

L_{G r a d - C A M}^{c} = ReLU (\sum_{k} α_{k}^{c} A^{k})

(25)

where

L_{G r a d - C A M}^{c}

denotes the class-discriminative localization map for class ccc, A^k represents the k-th activation map extracted from the final convolutional layer, and

α_{k}^{c}

denotes the corresponding importance coefficient that measures the contribution of the feature map to the target class. The weighted summation

\sum_{k} α_{k}^{c}

A^k combines the spatial activation patterns according to their relevance to the target class. Finally, the ReLU operation suppresses negative responses and preserves only the positive evidence associated with class c, thereby generating a class-discriminative heatmap that highlights the most influential image regions in the model’s prediction.

Algorithm 3. Grad-CAM

Begin

1: Input image X and target class c

2: Perform a forward pass through the network

3: Obtain class score yc

5: Extract last convolutional feature maps Ak

6: Compute gradients

\partial yc / \partial A_{i j}^{k}

7: Compute weights

α_{k}^{c}

by global average pooling

8: Generate a heatmap using the weighted sum of feature maps

9: Apply ReLU to keep a positive influence only

10: Resize heatmap to input image size

11: Visualize a heatmap on the input image

End

6.3. Shapley Additive Explanations (SHAP)

SHAP helps the model understand how each pixel affects the prediction, allowing it to detect small tumor regions more accurately and make clearer AI decisions. Figure 34 presents a color-coded heatmap visualizing how different regions of an MRI scan affect model prediction. The SHAP values show consistent patterns across multiple MRI samples, further reinforcing the reliability of the model’s attention. The red areas display the positive SHAP values, showing the comprehensive tumor regions that enhance the model’s confidence in analyzing the appearance of a tumor, while the blue areas express the negative SHAP values, reducing the likelihood of a tumor prediction. The SHAP value equation and algorithm (Algorithm 4) are described below.

Φ i = \sum_{S \subseteq F {i}} \frac{|S|! (M - S |S| - 1)!}{M!} [f (S U {i} - f (S)]

(26)

where Φi denotes the SHAP value of the i-th feature, representing its contribution to the model prediction; F denotes the full set of input features; S ⊆ F∖{i} represents all possible subsets that exclude feature iii; the term ∣S∣ indicates the number of features in subset S; and M is the total number of input features. The weighting factor

\frac{|S|! (M - S |S| - 1)!}{M!}

assigns a fair importance to each subset based on its size. Further, f(S) denotes the model output for subset S, and f(S ∪ {i}) denotes the output after adding feature iii to that subset. Thus, the difference [f(S ∪ {i}) − f(S)] measures the marginal contribution of feature i. By combining these weighted marginal contributions over all possible subsets, the SHAP value provides a reliable explanation of the influence of each feature on the final prediction.

Algorithm 4. SHAP value computation

Begin

1: Input image X and best model f

2: Determine all features F

3: For each feature i:

a. Generate all subsets S ⊆ F

b. Compute f(S)

c. Compute f(S ∪ {i})

d. Find marginal contribution f(S ∪ {i}) − f(S)

e. Apply Shapley weighting factor

f. Sum all weighted contributions to obtain ϕi

4: Return SHAP values for all features

End

6.4. Qualitative Evaluation of Model Interpretability

To improve the transparency of the proposed framework, the XAI analysis in this study is mainly treated as a qualitative interpretability evaluation. In particular, Grad-CAM highlights the spatial regions that most strongly influence the model prediction, while SHAP explains the relative contribution of influential features to the final decision. These visual and feature-level explanations help determine whether the trained Swin–YOLO model focuses on clinically meaningful tumor regions instead of irrelevant background information. Therefore, the XAI results provide supportive evidence of the interpretability and trustworthiness of the proposed model. However, the present study does not yet include a dedicated quantitative interpretability evaluation, which remains an important direction for future work.

6.5. NeuroVision AI Tumor Detection Application

Figure 35 presents the interface of the proposed NeuroVision AI application, where the upper panel illustrates a positive case (tumor detected) and the lower panel illustrates a negative case (no tumor detected). The application was developed as an interactive prototype to demonstrate how the best-performing trained version of the proposed Swin–YOLO model can be integrated into a user-friendly tumor detection workflow. In this system, the backend detection process is entirely performed by the best-trained proposed model, while the front-end interface is designed to enable simple image upload, prediction execution, and result visualization. The application was implemented using Streamlit, an open-source Python framework for building interactive machine learning and computer vision interfaces. From a usability perspective, the system includes an uploaded image panel, a prediction result panel, scan history, and detection statistics, along with functional buttons such as Upload MRI Scan, Detect Tumor, Clear, and Save Results. After an MRI image is uploaded, the backend model processes the image and returns the prediction result. For positive cases, the detected tumor region is highlighted using a bounding box together with the corresponding confidence score. For negative cases, the interface displays a No Tumor prediction with its confidence value. The confidence values shown in the interface are the prediction scores produced by the backend model for the displayed example cases and are included only to demonstrate the application output format. They should not be interpreted as the complete performance distribution of the system. In this study, the NeuroVision AI application is presented as a prototype decision-support interface for AI-assisted brain MRI analysis rather than as a fully validated clinical deployment tool. Therefore, although it demonstrates the practical usability of the proposed model in an accessible interface, formal usability testing, user-based evaluation, and real-world clinical validation remain important directions for future work.

6.6. Limitations and Future Directions

6.6.1. Limitations

Although the proposed framework achieved promising results in brain tumor MRI detection, several limitations should be acknowledged. First, the primary experiments in this study were conducted using the augmented dataset configuration, and an explicit ablation study comparing augmented and non-augmented training was not performed. As a result, the precise contribution of data augmentation to the observed performance improvements could not be fully quantified. Second, external validation, cross-validation, and testing on additional public datasets were not included in the current study, which may limit a broader assessment of the model’s robustness, generalization ability, and applicability across different datasets and clinical settings. Therefore, the findings should be interpreted with caution, and overgeneralization should be avoided. These limitations are now clearly acknowledged in the revised manuscript. Future work will address these issues through detailed augmentation-based ablation studies, external validation, cross-validation, and broader cross-dataset evaluation to further strengthen the reliability, robustness, and clinical applicability of the proposed method.

6.6.2. Future Work

Future research will aim to further strengthen the robustness, reproducibility, interpretability, and clinical applicability of the proposed framework. Since the main experiments in the present study were conducted using the augmented dataset configuration, a dedicated ablation study comparing augmented and non-augmented training will be an important future direction to more clearly quantify the contribution of data augmentation to the overall performance. In addition, the proposed model will be evaluated on larger, more diverse, and multi-center MRI datasets, including additional public datasets such as Figshare, Kaggle-based datasets, and BraTS, to support broader external validation, cross-dataset testing, and segmentation-oriented assessment. Repeated experiments with different random seeds, 5-fold cross-validation, and stronger statistical analyses, reporting means ± standard deviations, will also be conducted to provide more reliable evidence of model stability and reproducibility. Moreover, future work will include a dedicated quantitative interpretability evaluation to more rigorously assess the explainability of the proposed method. Finally, although the current study demonstrates the practical usability of the framework through an accessible interface, formal usability testing, user-based evaluation, and real-world clinical validation remain necessary to confirm its effectiveness in practical clinical environments.

7. Conclusions

In this study, we proposed a hybrid Swin–YOLO framework to address the limitations of the conventional CNN-based YOLOv12 model for brain tumor (BT) detection. The proposed framework integrates the efficient local feature extraction capability of YOLOv12 with the strong global contextual representation ability of the ST. To establish a clear baseline, all YOLOv12 variants (n, s, m, l, and x) were first trained and comparatively evaluated. Based on these results, the backbone and neck of the YOLOv12 architecture were redesigned to develop the proposed hybrid model, aiming to improve feature representation, localization precision, and detection robustness. Experimental results on the Br35H dataset demonstrated that the proposed model achieved strong performance, attaining an accuracy of 99.7%, a precision of 98.8%, a recall of 99.7%, an F1-score of 99.2%, an mAP@50 of 99.4%, and an mAP@50:95 of 87.2%. In addition, the confusion matrix, ROC analysis, MCC, training and validation curves, and detection results further confirmed the model’s effectiveness and stability. The integration of XAI approaches further enhanced transparency by identifying clinically relevant tumor regions in MRI images, while the developed NeuroVision AI application demonstrated the potential practical utility of the framework in an assistive clinical environment. However, these promising findings should be interpreted with caution, as the limited size and diversity of the Br35H dataset may restrict the generalizability of the results to broader clinical settings. In addition, the hybrid Swin–YOLO design improves computational efficiency compared with YOLO-only and Swin Transformer-only models; however, repeated-run experiments, statistical significance analysis, and cross-validation were not fully investigated in the current study. Future work will focus on validating the proposed model on larger, more diverse, and multi-center MRI datasets, together with more rigorous statistical analysis, to further confirm its robustness and clinical applicability.

Author Contributions

Conceptualization, M.T.; Methodology, M.T.; Writing—original draft, M.T.; Writing—review and editing, K.C.; Supervision, K.C.; Project administration, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare of Korea (RS-2025-02220492), and by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (RS-2025-16069081).

Data Availability Statement

The datasets analyzed during the current study are available at https://www.kaggle.com/datasets/ahmedhamada0/brain-tumor-detection (accessed on 2 February 2026). The running code and application are available at https://github.com/Mubashir-Tariq/NeoroVision-AI-Tomor-detection-application (accessed on 1 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bent, M.J.v.D.; Geurts, M.; French, P.J.; Smits, M.; Capper, D.; Bromberg, J.E.C.; Chang, S.M. Primary brain tumours in adults. Lancet 2023, 402, 1564–1579. [Google Scholar] [CrossRef]
Lapointe, S.; Perry, A.; Butowski, N.A. Primary brain tumours in adults. Lancet 2018, 392, 432–446. [Google Scholar] [CrossRef]
Elshaikh, B.G.; Garelnabi, M.; Omer, H.; Sulieman, A.; Habeeballa, B. Recognition of brain tumors in MRI images using texture analysis. Saudi J. Biol. Sci. 2021, 28, 2381–2387. [Google Scholar] [CrossRef]
Akinyelu, A.A.; Zaccagna, F.; Grist, J.T.; Castelli, M.; Rundo, L. Brain tumor diagnosis using machine learning, convolutional neural networks, capsule neural networks, and vision transformers applied to MRI: A survey. J. Imaging 2022, 8, 205. [Google Scholar] [CrossRef] [PubMed]
Bray, F.; Laversanne, M.; Sung, H.; Ferlay, J.; Siegel, R.L.; Soerjomataram, I.; Jemal, A. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2024, 74, 229–263. [Google Scholar] [CrossRef] [PubMed]
Sharif, M.I.; Li, J.P.; Khan, M.A.; Kadry, S.; Tariq, U. M3BTCNet: Multi-model brain tumor classification using metaheuristic deep neural network feature optimisation. Neural Comput. Appl. 2024, 36, 95–110. [Google Scholar] [CrossRef]
Tandel, G.S.; Biswas, M.; Kakde, O.G.; Tiwari, A.; Suri, H.S.; Turk, M.; Laird, J.R.; Asare, C.K.; Ankrah, A.A.; Khanna, N.N.; et al. A review on a deep learning perspective in brain cancer classification. Cancers 2019, 11, 111. [Google Scholar] [CrossRef]
Shanmuga Priya, S.; Saran Raj, S.; Surendiran, B.; Arulmurugaselvi, N. Brain tumor detection in MRI using deep learning. In Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA), Singapore, 2020; Springer: Singapore, 2020; pp. 395–403. [Google Scholar] [CrossRef]
Khan, M.F.; Iftikhar, A.; Anwar, H.; Ramay, S.A. Brain tumor segmentation and classification using optimised deep learning. J. Comput. Biomed. Inform. 2024, 7, 632–640. [Google Scholar]
Gull, S.; Akbar, S.; Khan, H.U. Automated detection of brain tumor through magnetic resonance images using a convolutional neural network. Biomed. Res. Int. 2021, 2021, 3365043. [Google Scholar] [CrossRef]
Dehghan, S.; Rabiei, R.; Choobineh, H.; Maghooli, K.; Nazari, M.; Vahidi-Asl, M. Comparative study of machine learning approaches integrated with a genetic algorithm for IVF success prediction. PLoS ONE 2024, 19, e0310829. [Google Scholar] [CrossRef]
Soomro, T.A.; Zheng, L.; Afifi, A.J.; Ali, A.; Soomro, S.; Yin, M.; Gao, J. Image segmentation for MR brain tumor detection using machine learning: A review. IEEE Rev. Biomed. Eng. 2023, 16, 70–90. [Google Scholar] [CrossRef]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef]
Iqbal, S.; Qureshi, A.N.; Ullah, A.; Li, J.; Mahmood, T. Improving the robustness and quality of biomedical CNN models through adaptive hyperparameter tuning. Appl. Sci. 2022, 12, 11870. [Google Scholar] [CrossRef]
Coşkun, D.; Karaboğa, D.; Baştürk, A.; Akay, B.; Nalbantoğlu, Ö.U.; Doğan, S.; Paçal, I.; Karagöz, M.A. A comparative study of YOLO models and a transformer-based YOLOv5 model for mass detection in mammograms. Turk. J. Electr. Eng. Comput. Sci. 2023, 31, 1294–1313. [Google Scholar] [CrossRef]
Fujita, H. AI-based computer-aided diagnosis (AI-CAD): The latest review to read first. Radiol. Phys. Technol. 2020, 13, 6–19. [Google Scholar] [CrossRef]
Amin, J.; Sharif, M.; Haldorai, A.; Yasmin, M.; Nayak, R.S. Brain tumor detection and classification using machine learning: A comprehensive survey. Complex Intell. Syst. 2022, 8, 3161–3183. [Google Scholar] [CrossRef]
Taher, F.; Shoaib, M.R.; Emara, H.M.; Abdelwahab, K.M.; El-Samie, F.E.A.; Haweel, M.T. Efficient framework for brain tumor detection using different deep learning techniques. Front. Public Health 2022, 10, 959667. [Google Scholar] [CrossRef]
Karaman, A.; Pacal, I.; Basturk, A.; Akay, B.; Nalbantoglu, U.; Coskun, S.; Sahin, O.; Karaboga, D. Robust real-time polyp detection system design based on YOLO algorithms by optimising activation functions and hyper-parameters with artificial bee colony (ABC). Expert Syst. Appl. 2023, 224, 119741. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar] [CrossRef]
Tariq, M.; Choi, K. YOLO11-driven deep learning approach for enhanced detection and visualisation of wrist fractures in X-ray images. Mathematics 2025, 13, 1419. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Celard, P.; Iglesias, E.L.; Sorribes-Fdez, J.M.; Romero, R.; Vieira, A.S.; Borrajo, L. A survey on deep learning applied to medical images: From simple artificial neural networks to generative models. Neural Comput. Appl. 2023, 35, 2291–2323. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Zhao, K.; Lu, R.; Wang, S.; Yang, X.; Li, Q.; Fan, J. ST-YOLOA: A Swin-transformer-based YOLO model with an attention mechanism for SAR ship detection under a complex background. Front. Neurorobot. 2023, 17, 1170163. [Google Scholar] [CrossRef]
Cakmak, Y.; Pacal, I. Comparative analysis of transformer architectures for brain tumor classification. Explor. Med. 2025, 6, 1001377. [Google Scholar] [CrossRef]
Pacal, I. A novel Swin transformer approach utilizing a residual multi-layer perceptron for diagnosing brain tumors in MRI images. Int. J. Mach. Learn. Cybern. 2024, 15, 3579–3597. [Google Scholar] [CrossRef]
Kumar, S.S.; Kriplani, K.; Riadhusin, R.; Sahoo, N.P.; Srinivas, V.; Salaman, Z.N.; Meqdad, M.N.; Abushraida, A.A.J. Swin Transformer Architecture for Accurate Brain Tumor Classification and Localization in MRI-Based Medical Diagnosis. In Proceedings of the 2025 3rd International Conference on Cyber Resilience (ICCR), Dubai, United Arab Emirates, 3–4 July 2025; pp. 1–7. [Google Scholar] [CrossRef]
Bhuvaneswari, E.; Ranjath, G.A.; Steni Dev, S.A.; Vishva, V.; Valan, R.F.; Harish, V. Advanced Brain Tumour Detection Using YOLOv11 with Swin Transformer and Transfer Learning. In International Conference on Web Intelligence and Human-Machine Interaction; Springer Nature: Singapore, 2025; pp. 367–389. [Google Scholar] [CrossRef]
Sharif, M.I.; Khan, M.A.; Alhussein, M.; Aurangzeb, K.; Raza, M. A decision support system for multimodal brain tumor classification using deep learning. Complex Intell. Syst. 2022, 8, 3007–3020. [Google Scholar] [CrossRef]
Greenacre, M.; Groenen, P.J.; Hastie, T.; d’Enza, A.I.; Markos, A.; Tuzhilina, E. Principal component analysis. Nat. Rev. Methods Primers 2022, 2, 100. [Google Scholar] [CrossRef]
Nazir, M.; Shakil, S.; Khurshid, K. Role of deep learning in brain tumor detection and classification (2015–2020): A review. Comput. Med. Imaging Graph. 2021, 91, 101940. [Google Scholar] [CrossRef] [PubMed]
Isensee, F.; Jäger, P.F.; Full, P.M.; Vollmuth, P.; Maier-Hein, K.H. nnU-Net for brain tumor segmentation. In Proceedings of the International Workshop on Brain Lesion (BrainLes), Lima, Peru, 4 October 2020; Springer: Cham, Switzerland, 2021; pp. 118–132. [Google Scholar] [CrossRef]
Liu, Z.; Tong, L.; Chen, L.; Jiang, Z.; Zhou, F.; Zhang, Q.; Zhang, X.; Jin, Y.; Zhou, H. Deep learning-based brain tumor segmentation: A survey. Complex Intell. Syst. 2023, 9, 1001–1026. [Google Scholar] [CrossRef]
Kumar, S.; Dhir, R.; Chaurasia, N. Brain tumor detection analysis using CNN: A review. In Proceedings of the International Conference on Artificial Intelligence and Smart Systems (ICAIS); IEEE: Piscataway, NJ, USA, 2021. [Google Scholar] [CrossRef]
Tripathi, P.C.; Bag, S. An attention-guided CNN framework for segmentation and grading of glioma using 3D MRI scans. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 1890–1904. [Google Scholar] [CrossRef]
Özyurt, F.; Sert, E.; Avci, E.; Dogantekin, E. Brain tumor detection based on a convolutional neural network with neutrosophic expert maximum fuzzy sure entropy. Measurement 2019, 147, 106830. [Google Scholar] [CrossRef]
Ullah, N.; Khan, J.A.; Khan, M.S.; Khan, W.; Hassan, I.; Obayya, M.; Negm, N.; Salama, A.S. An effective approach to detect and identify brain tumors using transfer learning. Appl. Sci. 2022, 12, 5645. [Google Scholar] [CrossRef]
Shahin, A.I.; Aly, W.; Aly, S. MBTFCN: A novel modular fully convolutional network for MRI brain tumor multi-classification. Expert Syst. Appl. 2023, 212, 118776. [Google Scholar] [CrossRef]
Swati, Z.N.K.; Zhao, Q.; Kabir, M.; Ali, F.; Ali, Z.; Ahmed, S.; Lu, J. Brain tumor classification for MR images using transfer learning and fine-tuning. Comput. Med. Imaging Graph. 2019, 75, 34–46. [Google Scholar] [CrossRef] [PubMed]
Lu, S.; Lu, Z.; Zhang, Y.-D. Pathological brain detection based on AlexNet and transfer learning. J. Comput. Sci. 2019, 30, 41–47. [Google Scholar] [CrossRef]
Kumar, R.L.; Kakarla, J.; Isunuri, B.V.; Singh, M. Multi-class brain tumor classification using residual network and global average pooling. Multimed. Tools Appl. 2021, 80, 13429–13438. [Google Scholar] [CrossRef]
Goyal, B.; Hans, R.; Sharma, S.K.; Singh, H. Deep Learning Based Brain Tumor Diagnosis with Pre-Trained and Self-Attention Based Models Using MRI Scans: A Systematic Literature Review. In Archives of Computational Methods in Engineering; Springer Nature: Berlin/Heidelberg, Germany, 2026; pp. 1–49. [Google Scholar] [CrossRef]
Arockia Selvarathinam, A.L.X.R.; Lilhore, U.K.; Alroobaea, R.; Alsafyani, M.; Baqasah, A.M.; Algarni, S.; Khan, M. MM FD ConvFormer: Multimodal Frequency Aware Deformable CNN Transformer Network for Robust Brain Tumor Classification. Sci. Rep. 2026, 16, 12669. [Google Scholar] [CrossRef]
Gupta, S.; Abd Aziz, A. Advancing Brain Tumor Classification through Pre-Trained Transformer and Transfer Learning Models. Frankl. Open 2026, 14, 100493. [Google Scholar] [CrossRef]
Singh, A.; Shrivastava, R.K.; Srivastava, A. Efficient and Compressed Deep Learning Model for Brain Tumour Classification with Explainable AI for Smart Healthcare and Information Communication Systems. Expert Syst. 2025, 42, e13770. [Google Scholar] [CrossRef]
Aamir, M.; Rahman, Z.; Choudhry, N.; Bhutto, J.A.; Abro, W.A.; Zhu, Z. From CNNs to Transformers: A Review of Evolving Deep Learning Architectures for Brain Tumor Classification. IEEE Access 2025, 13, 184918–184936. [Google Scholar] [CrossRef]
Raju, N.; Srinivas, K.; Rajesh, C.; Chintakindi, B.M. Advanced Brain Tumor Detection Using YOLO-β11 in MRI Images. Alex. Eng. J. 2025, 132, 181–190. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]
Hu, H.; Li, X.; Yao, W.; Yao, Z. Brain tumor diagnosis applying CNN through MRI. In Proceedings of the International Conference on Artificial Intelligence and Computer Engineering (ICAICE); IEEE: Piscataway, NJ, USA, 2021; pp. 430–434. [Google Scholar] [CrossRef]
Huang, X.; Wang, X.; Lv, W.; Bai, X.; Long, X.; Deng, K.; Dang, Q.; Han, S.; Liu, Q.; Hu, X.; et al. PP-YOLOv2: A practical object detector. arXiv 2021, arXiv:2104.10419. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Li, D.; Wang, J.; Zhang, Z.; Dai, B.; Zhao, K.; Shen, W.; Yin, Y.; Li, Y. Cow-YOLO: Automatic cow mounting detection based on non-local CSPDarknet53 and multiscale neck. Int. J. Agric. Biol. Eng. 2024, 17, 193–202. [Google Scholar] [CrossRef]
Piao, Y.; Jiang, Y.; Zhang, M.; Wang, J.; Lu, H. PANet: Patch-aware network for light field salient object detection. IEEE Trans. Cybern. 2023, 53, 379–391. [Google Scholar] [CrossRef]
Jocher, G. Ultralytics YOLOv5. GitHub Repository 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 23 November 2022).
Srividya, A.; Raparthi, N.; Thouti, S.; Kumar, P.V.; Goddu, J.; Athiraja, A. Enhancing brain tumor detection: YOLOv6 algorithm for accurate identification in MRI scans. In Proceedings of the International Conference on Electrical, Electronics, Information and Communication Technology (ICEEICT); IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Glenn, J. Ultralytics YOLOv8. GitHub Repository 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 January 2023).
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Zahid, U.; Ashraf, I.; Khan, M.A.; Alhaisoni, M.; Yahya, K.M.; Hussein, H.S.; Alshazly, H. BrainNet: Optimal deep learning feature fusion for brain tumor classification. J. Comput. Eng. 2022, 2022, 1465173. [Google Scholar] [CrossRef]
Bashkandi, A.H.; Sadoughi, K.; Aflaki, F.; Alkhazaleh, H.A.; Mohammadi, H.; Jimenez, G. Combination of political optimiser, particle swarm optimiser, and convolutional neural network for brain tumor detection. Biomed. Signal Process. Control 2023, 81, 104434. [Google Scholar] [CrossRef]
Vo, D.-D.; Ngo, B.-V.; Nguyen, T.-D.; Nguyen, T.-N. A convolutional neural network with the VGG-16 model for classifying human brain tumors. In Proceedings of the International Conference on Green Technology and Sustainable Development (GTSD); IEEE: Piscataway, NJ, USA, 2022; pp. 714–719. [Google Scholar] [CrossRef]
Sinha, A.; Rai, R.; Kumar, A.; Varma, S.K.; Sen, S. Explainable-AI-based model for brain tumor detection. J. Comput. Sci. Eng. 2023, 12. [Google Scholar] [CrossRef]
Nazir, M.I.; Akter, A.; Wadud, M.A.H.; Uddin, M.A. Utilising customised CNN for brain tumor prediction with explainable AI. J. Comput. Sci. Eng. 2024, 10, e38997. [Google Scholar] [CrossRef]
Khushi, H.M.T.; Masood, T.; Jaffar, A.; Akram, S.; Bhatti, S.M. Performance analysis of state-of-the-art CNN architectures for brain tumor detection. Int. J. Imaging Syst. Technol. 2024, 34, e22949. [Google Scholar] [CrossRef]
Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv 2023, arXiv:2307.08691. [Google Scholar] [CrossRef]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A new backbone that can enhance the learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar] [CrossRef]
Sapkota, R.; Flores-Calero, M.; Qureshi, R.; Badgujar, C.; Nepal, U.; Poulose, A.; Zeno, P.; Vaddevolu, U.B.P.; Khan, S.; Shoman, M.; et al. YOLO advances to its genesis: A decadal and comprehensive review of the You Only Look Once (YOLO) series. Artif. Intell. Rev. 2025, 58, 274. [Google Scholar] [CrossRef]
Zhang, J.; Peng, X.L. YOLOv11-LC: Accurate Detection of Benign and Malignant Endometrial Lesions Using LCA-Enhanced YOLOv11 with CMA-C3k2. Signal Image Video Process. 2026, 20, 15. [Google Scholar] [CrossRef]
Br35H: Brain Tumor Detection 2020. Kaggle 2020. Available online: https://www.kaggle.com/datasets/ahmedhamada0/brain-tumor-detection (accessed on 2 February 2026).
Zhou, P.; Xie, X.; Lin, Z.; Yan, S. Towards understanding convergence and generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6486–6493. [Google Scholar] [CrossRef]
Kabir, F.; Siddique, S.; Kotwal, M.R.A.; Huda, M.N. Bangla text document categorization using stochastic gradient descent (SGD) classifier. In Proceedings of the International Conference on Cognitive Computing and Information Processing (CCIP); IEEE: Piscataway, NJ, USA, 2015; pp. 1–4. [Google Scholar] [CrossRef]
Jyoti, A.; Kumar, A.; Sachar, S. Comparative analysis of transfer learning models for brain tumor detection. Procedia Comput. Sci. 2025, 258, 2415–2424. [Google Scholar] [CrossRef]
Islam, N.; Azam, S.; Islam, S.; Kanchan, M.H.; Parvez, A.S.; Islam, M. An improved deep learning-based hybrid model with ensemble techniques for brain tumor detection from MRI images. Inform. Med. Unlocked 2024, 47, 101483. [Google Scholar] [CrossRef]
Hekmat, A.; Zuping, Z.; Bilal, O.; Khan, S.U.R. Differential evolution-driven optimised ensemble network for brain tumor detection. Int. J. Mach. Learn. Cybern. 2025, 16, 6447–6472. [Google Scholar] [CrossRef]
Rasheed, Z.; Ma, Y.-K.; Bharany, S.; Shandilya, G.; Ullah, I.; Ali, F. Classification of MRI brain tumor with hybrid VGG19 and ensemble classifier approach. In Proceedings of the International Conference on Innovative Communication, Electrical and Computer Engineering (ICICEC); IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar] [CrossRef]
Dip, S.R.; Meena, H.K. Enhanced brain tumor identification using graph Fourier transform. Comput. Electr. Eng. 2025, 127, 110575. [Google Scholar] [CrossRef]
Sadr, H.; Nazari, M.; Yousefzadeh-Chabok, S.; Emami, H.; Rabiei, R.; Ashraf, A. Enhancing brain tumor classification in MRI images: A deep learning-based approach for accurate diagnosis. Image Vis. Comput. 2025, 159, 105555. [Google Scholar] [CrossRef]
Amran, G.A.; Alsharam, M.S.; Blajam, A.O.A.; Hasan, A.A.; Alfaifi, M.Y.; Amran, M.H.; Gumaei, A.; Eldin, S.M. Brain tumor classification and detection using a hybrid deep tumor network. Electronics 2022, 11, 3457. [Google Scholar] [CrossRef]
Sun, L.; Zheng, L.; Xiao, Z.; Xin, Y.; Jiang, L. STAR-YOLO: A high-accuracy and ultra-lightweight method for brain tumor detection. IEEE Access 2025, 13, 109914–109930. [Google Scholar] [CrossRef]
Kang, M.; Ting, F.F.; Phan, R.C.W.; Ting, C.M. PK-YOLO: Pretrained knowledge-guided YOLO for brain tumor detection in multiplanar MRI slices. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2025; pp. 3732–3741. [Google Scholar] [CrossRef]
Bai, R.; Xu, G.; Shi, Y. SCC-YOLO: An improved object detector for assisting in brain tumor diagnosis. In Proceedings of the International Conference on Health Big Data; Association for Computing Machinery: New York, NY, USA, 2025; pp. 114–120. [Google Scholar] [CrossRef]
Yao, Q.; Zhuang, D.; Feng, Y.; Wang, Y.; Liu, J. Accurate detection of brain tumor lesions from medical images based on improved YOLOv8 algorithm. IEEE Access 2024, 12, 144260–144279. [Google Scholar] [CrossRef]
Alnageeb, M.H.O.; Supriya, M.H. Real-time brain tumour diagnoses using a novel lightweight deep learning model. Comput. Biol. Med. 2025, 192, 110242. [Google Scholar] [CrossRef]
Chhimpa, G.R.; Awasthi, S.; Bhati, N.; Yadav, P.; Wani, N.A. A transfer learning-driven fine-tuning of YOLOv10 for improved brain tumor detection in MRI images. Sci. Rep. 2025, 16, 98. [Google Scholar] [CrossRef]
Li, Y.; Xu, H.; Zhu, X.; Huang, X.; Li, H. BTDet: Towards lightweight and enhanced feature aggregation network for brain tumor detection. Biomed. Signal Process. Control 2026, 114, 109283. [Google Scholar] [CrossRef]
Zhuang, D.; Yao, Q.; Feng, Y.; Xv, H.; Zhang, C.; Wang, X. YOLO-BT: A novel brain tumor detector with gradient-guided feature diffusion. Biomed. Signal Process. Control 2026, 113, 108696. [Google Scholar] [CrossRef]
Murala, P.U.R.N.A.C.H.A.N.D.R.A.R.A.O.; Rao, K.N. Deep Learning Approaches for Tumor Detection Using MRI Data. J. Theor. Appl. Inf. Technol. 2025, 103, 1117–1127. [Google Scholar]
Gade, V.S.R.; Cherian, R.K.; Rajarao, B.; Kumar, M.A. BMO based improved Lite Swin transformer for brain tumor detection using MRI images. Biomed. Signal Process. Control 2024, 92, 106091. [Google Scholar] [CrossRef]
Tayefeh, M.M.; Ghahramani, S.A.G.; Hemmatyar, A.M.A. Advancing Brain Tumor Detection via ViRCNN: A Fusion of Vision Transformers and Faster R-CNN. In Proceedings of the 2025 15th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 28–29 October 2025; pp. 1–6. [Google Scholar] [CrossRef]
Dash, S.; Mishra, S.R.; Mohapatra, H.; Sekhar, K.C. Brain Tumor Detection and Classification in MRI Images Using SWIN Transformer. In Proceedings of the 2025 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC); IEEE: Piscataway, NJ, USA, 2025; pp. 1–6. [Google Scholar] [CrossRef]
Panigrahi, S.; Adhikary, D.R.D.; Biswal, B.; Jena, B.; Mohapatra, K.N.; Patnaik, U. Vision Transformers for Brain Tumor Classification: A Novel Deep Learning Approach. In Proceedings of the 2025 International Conference on Innovations in Intelligent Systems: Advancements in Computing, Communication, and Cybersecurity (ISAC3); IEEE: Piscataway, NJ, USA, 2025; pp. 1–6. [Google Scholar] [CrossRef]

Figure 1. Brain MRI Br35H dataset.

Figure 2. Feature extraction process of a DNN in tumor detection.

Figure 3. Swin–YOLO model architecture.

Figure 4. Patch merging process.

Figure 5. Grid partitioning for tumor bounding box prediction.

Figure 6. Comparison of target box and anchor box with inner Inner-GIoU.

Figure 8. Swin Transformer block diagram.

Figure 9. Window-based self-attention (WSA) box diagram.

Figure 10. Window-based MSA.

Figure 11. Comprehensive overview of data preprocessing.

Figure 12. Model training pipeline.

Figure 13. YOLOv12 performance (F1, mAP50): (a) single; (b) binary.

Figure 14. Brain tumor detection results with the YOLOv12 models.

Figure 15. Comparison results: YOLOv12 vs. Swin–YOLO.

Figure 16. Proposed model training and validation accuracy.

Figure 17. Proposed model training and validation loss.

Figure 18. Augmented tumor BT detection of the proposed model.

Figure 19. Proposed model F1-score.

Figure 20. Proposed model precision.

Figure 21. Proposed model recall.

Figure 22. Proposed model specificity.

Figure 23. Proposed model mAP@0.5.

Figure 24. Proposed model mAP@50:95.

Figure 25. The confusion matrix for the proposed method.

Figure 26. Detection summary of the Swin–YOLO model.

Figure 27. Detection summary: Swin–YOLO model.

Figure 28. Swin Transformer mAP@0.5 analysis.

Figure 29. Swin Transformer mAP@50:95 analysis.

Figure 30. Brain tumor detection results for proposed method.

Figure 31. MCC comparison of YOLOv12 and Swin–YOLO model.

Figure 32. XAI-based tumor identification.

Figure 33. Grad-CAM visualization.

Figure 34. SHAP value visualization.

Figure 35. Neurovision AI application.

Table 1. Summary of relevant studies.

Study	Model	Accuracy (%)
Zahid et al. [64]	ResNet101	96.70
Bashkandi et al. [65]	Metaheuristic-CNN	97.09
Vo Dung et al. [66]	CNN-VGG16	97.30
Sinha et al. [67]	Multiple CNN	98.62
Nazir et al. [68]	CNN, XAI	98.67
Khushi et al. [69]	AlexNet	98.79

Table 2. Class-wise distribution of the MRI Br35H datasets.

Original Images
Tumor Class	Total Images (3000)	Training (70%)	Validation (20%)	Testing (10%)
Healthy	1500	1050	300	150
Cancerous	1500	1050	300	150
Total	3000	2100	600	300
Augmented Images
Tumor Class	Total Images (6300)	Training (70%)	Validation (20%)	Testing (10%)
Healthy	3150	2205	630	315
Cancerous	3150	2205	630	315
Total	6300	4410	1260	630

Table 3. System configuration.

System Components	Specification
CPU	13th Gen Intel(R) Core (TM) i7-1360P (2.20 GHz)
Operating System	Arch Linux
Linux Kernel	6.16.3
GPU	NVIDIA RTX A6000
Driver Version	580.76.05
GPU Memory	48 GB VRAM
CUDA Version	13.0
Python	3.12.0
Pytorch	2.6.0

Table 4. Hyperparameter settings for model training.

Hyperparameter	Optimized Value
Image Size	224 × 224
Epochs	100
Batch Size	16, 32, 64 (Depend on GPU memory)
Optimizer	SGD
Learning Rate	0.002
Weight Decay	0.0005
Momentum	0.0937
Patience	10
Worker	4

Table 5. YOLOv12 single class matrices.

Class	Precision	Recall	F1_Score	mAP50	mAP50:95
Tumor	97.2	97.9	97.6	98.8	77.6

Table 6. YOLOv12 binary class matrices.

Class	Precision	Recall	F1_Score	mAP50	mAP50:95
Tumor	98.4	96.4	97.3	98.9	79.2
No Tumor	95.1	99.7	97.4	98.4	79.2

Table 7. Comparison results for the CNN models.

YOLO’s Version [49]	Precision	Recall
YOLOv8	95.0	87.5
YOLOv9	87.1	86.9
YOLOv10	87.5	86.6
YOLOv11	92.5	92.26
YOLOv11 + Twinformer	93.4	92.5
YOLOv11 + SCFStage	93.0	93.8
YOLOv11 + Additional Detector	92.8	92.7
YOLOv11 + Hybrid Loss	92.0	93.3
YOLOβ11	93.6	92.7
Ours YOLOv12	97.2	97.6

Table 8. Performance comparison of other YOLO variants with YOLOv12.

CNN Model [77]	Precision	Recall	F1-Score
MobileNetV3Large	71.11	84.76	77.34
MobileNet-V2	89.82	99.33	94.33
DenseNet121	95.51	100	95.56
AlexNet	92.59	99.33	95.84
DenseNet201	95.99	100	96.15
VGG16	96.02	96.02	96.02
InceptionResNetV2	96.64	95.36	95.99
Ours YOLOv12	97.2	97.6	97.9

Table 9. Comparative performance analysis (%) of hybrid CNN models.

References	Model	Precision	Recall	F1_Score	Accuracy
Islam et al. [78]	CNN-LSTM	97.3	96.2	94.7	97.9
Hekmati et al. [79]	DE	98.8	97.0	97.9	98.0
Rasheed et al. [80]	Hybird-VGG19	98.3	98.2	98.2	98.2
Dilip et al. [81]	LGBM	97.5	99.0	98.6	98.5
Sard et al. [82]	Hybird-VGG16	98.5	98.5	99.0	98.5
Amran [83]	Hybrid-GoogleNet	98.9	98.6	98.0	99.1
Proposed method	Swin–YOLO	98.9	99.7	99.2	99.7

Table 10. Comparative performance analysis (%) of hybrid YOLO models.

Study	Model	Precision	Recall	mAP50	mAP50:95
Sun et al. [84]	STAR-YOLO	93.70	85.2	79.4	64.2
Kang et al. [85]	PK-YOLO	85.80	89.6	94.7	68.1
Bai al. [86]	SCC-YOLO	92.2	94.3	95.7	75.2
Yao et al. [87]	Hybrid-YOLO	95.4	93.9	96.9	74.8
Alnageeb et al. [88]	MK-YOLOv8	96.30	96.5	98.6	80.8
Proposed method	Swin–YOLO	98.9	99.7	99.4	87.2

Table 11. ViT-based performance.

ViT-Based Results	Accuracy
Hybird-YOLOv8YOLO-BT [89]	92.6
YOLOv10-ViTs [90]	96.1
YOLOv8-ViTs (BTDet) [91]	96.6
ECNN-ViTs [92]	97.0
Light Swin-ViTs [93]	97.52
Hybrid-ViTs [94]	97.83
Swin-ViTs [95]	98.94
ViTB16 [96]	99.33
Swin–YOLO (Ours)	99.70

Table 12. Swin–YOLO layer structure. (Proposed model layer structure).

Form	Params	Module	Arguments
Swin Backbone
−1	2,750,000	Timm. models	[swin, batch16, window7_224]
0	262,144	ConvBNAct	[786, 256, 1]
1	131,072	ConvBNAct	[384, 256, 1]
2	65,536	ConvBNAct	[192, 256, 1]
Neck (Top-Down)
3	5,980,640	AreaAttention2D	[256, num_heads = 8, window = (8, 8)]
4	147,456	RELANBlock	[256, expansion = 0.5]
5	5,980,640	AreaAttention2D	[256, num_heads = 8, window = (8, 8)]
6	147,456	RELANBlock	[256, expansion = 0.5]
Neck (Bottom-up)
7	5,980,640	AreaAttention2D	[256, num_heads = 8, window = (8, 8)]
8	147,456	RELANBlock	[256, expansion = 0.5]
9	5,980,640	AreaAttention2D	[256, num_heads = 8, window = (8, 8)]
10	147,456	RELANBlock	[256, expansion = 0.5]
11	590,080	ConvBNAct	[256, 256, 3]
12	590,080	ConvBNAct	[256, 256, 3]
13	590,080	ConvBNAct	[256, 256, 3]
Head
14	885,437	YOLOv12Head	[256, num_classs = 2]
15	885,437	YOLOv12Head	[256, num_classs = 2]
16	885,437	YOLOv12Head	[256, num_classs = 2]

Table 13. Comparative analysis of YOLOv12 and the proposed model.

Model	Inference (ms)	GFLOPs	mAP50	mAP50:95
YOLOv12-N	0.7	6.3	97.1	73.0
YOLOv12-S	0.8	21.2	98.1	74.0
YOLOv12-M	11.1	67.1	98.4	76.3
YOLOv12-L	1.3	88.6	98.6	77.2
YOLOv12-X	2.0	198.5	98.9	79.5
Swin Transformer	3.5	211.1	99.1	81.0
Swin–YOLO	25.0	730	99.4	87.2

Table 14. Comparative analysis (%) of Swin–YOLO variants.

Swin–YOLO	mAP50	mAP50:95	Precision	Recall	F1
Base	98.9	86.6	98.4	99.1	98.9
Tinny	99.0	86.9	98.7	99.2	99.0
Small	99.4	87.2	98.9	99.7	99.2
Large	99.2	87.1	98.7	99.4	99.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tariq, M.; Choi, K. Swin–YOLOv12: A Hybrid Transformer-Based Deep Learning Approach for Enhanced Real-Time Brain Tumor Detection in MRI Images. Mathematics 2026, 14, 1447. https://doi.org/10.3390/math14091447

AMA Style

Tariq M, Choi K. Swin–YOLOv12: A Hybrid Transformer-Based Deep Learning Approach for Enhanced Real-Time Brain Tumor Detection in MRI Images. Mathematics. 2026; 14(9):1447. https://doi.org/10.3390/math14091447

Chicago/Turabian Style

Tariq, Mubashar, and Kiho Choi. 2026. "Swin–YOLOv12: A Hybrid Transformer-Based Deep Learning Approach for Enhanced Real-Time Brain Tumor Detection in MRI Images" Mathematics 14, no. 9: 1447. https://doi.org/10.3390/math14091447

APA Style

Tariq, M., & Choi, K. (2026). Swin–YOLOv12: A Hybrid Transformer-Based Deep Learning Approach for Enhanced Real-Time Brain Tumor Detection in MRI Images. Mathematics, 14(9), 1447. https://doi.org/10.3390/math14091447

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Swin–YOLOv12: A Hybrid Transformer-Based Deep Learning Approach for Enhanced Real-Time Brain Tumor Detection in MRI Images

Abstract

1. Introduction

2. Related Work

2.1. Machine Learning Algorithm

2.2. Deep Learning-Based Techniques

2.3. Transfer Learning-Based Approaches

2.4. YOLO-Based CCN Models

2.5. Identified Gaps and Study Motivation

3. Methodology

3.1. Model Architecture

3.2. Input Stage

3.3. Backbone

3.3.1. Patch Positioning

3.3.2. Linear Embedding (Patch Projection Layer)

3.3.3. Swin Block

3.3.4. Patch Merging

3.3.5. Multi-Stage Extracted Features (C3–C6)

3.4. Neck

3.4.1. FPN + PANet Hybrid

3.4.2. Lateral Conv

3.4.3. Up and Down Sample Block

3.4.4. Pyramid Feature Maps (PFMs)

3.5. Head

3.5.1. Bounding Boxes

3.5.2. Inner-GIOU

3.6. YOLOv12 Architectural Advancements

3.6.1. A2 Module

3.6.2. R-ELAN Module

3.6.2.1. CSPNet

3.6.2.2. R-ELAN

3.6.3. C3K2

3.6.4. Multi-Scale Detection Head

3.7. Swin Transformer Architecture

3.7.1. Window-Based Self-Attention (WSA) Architecture

3.7.2. Window-Based Multi-Head Self-Attention (MASA)

4. Experimental Settings

4.1. Dataset

4.2. Data Preprocessing

4.2.1. Data Normalizing

4.2.2. Image Resizing

4.2.3. Data Augmentation

4.3. Performance Matrices

4.3.1. Precision–Recall Curve (PR)

4.3.2. F1-Score

4.3.3. Mean Average Precision (mAP)

4.4. Training Configuration

4.5. Hyperparameters

4.6. Model Training

5. Results and Discussion

5.1. YOLOv12 Metric Configration

5.2. Performance Analysis of the Proposed Method Compared to the Baseline Model

5.3. Layer Structure of Proposed Method

5.4. Confusion Matrix and Detection Summary for Swin–YOLO

5.5. Receiver Operating Characteristic Curve (ROC)

5.6. Ablation Study

5.6.1. Quantitative Contribution of the Swin Transformer Backbone and Hybrid Neck

5.6.2. Performance Comparison of Swin–YOLO Variants

5.6.3. Matthew’s Correlation Coefficient (MCC)

6. Explainable AI (XAI)

6.1. Contribution of the XAI Module

6.2. Gradient-Weighted Class Activation Mapping (Grad-CAM)

6.3. Shapley Additive Explanations (SHAP)

6.4. Qualitative Evaluation of Model Interpretability

6.5. NeuroVision AI Tumor Detection Application

6.6. Limitations and Future Directions

6.6.1. Limitations

6.6.2. Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines