1. Introduction
Cancer remains one of the most widely studied diseases worldwide. BTs are one of the dangerous types of the disease, which occur when uncontrolled growth of abnormal cells arises within the brain [
1]. Normally, the human body replaces old cells with new ones. However, when this natural process does not work properly, abnormal cells can grow rapidly and form a tumor in the area in question [
2]. The uncontrolled growth of tumors damages normal tissues and compresses surrounding cells, which may lead to cell death [
3]. Brain tumors (BTs) are a serious global health concern that affect individuals of all age groups, including infants, children, young adults, and older adults [
4]. According to a 2024 report, about 321,731 cases of primary malignant brain tumors were diagnosed worldwide in 2022, including approximately 173,699 men and 148,032 women [
5]. Biopsy procedures for brain tumors (BTs) are more complicated than those for other tumors because they require surgery [
6]. Manual tumor diagnosis is often time-consuming and error-prone, which may place patients at serious risk. Therefore, medical professionals rely on various diagnostic methods, including neurological examinations, sample analysis, and digital screening [
7].
Imaging modalities are gradually becoming more popular in medical diagnosis because they use a secure approach with better detection and minimize risks for patients. There are some common processes that include Computed Tomography (CT) scans, X-rays, radiography, tomography, Magnetic Resonance Imaging (MRI), and echocardiography (ECHO), which are used for the detection of BTs [
8]. Unlike CT scans and X-rays, MRI provides a highly detailed visualization of brain structures without exposing patients to harmful radiation. It offers a more precise and accurate analysis that is essential for assessing a patient’s condition [
9]. Consequently, there has been a growing focus on utilizing the features of artificial intelligence (AI), which has illustrated remarkable ability across many domains. In particular, AI-driven applications have shown a high impact on medical imaging (MI). Moreover, they significantly enhance and improve tumor diagnostic accuracy, which can support timely intervention [
10]. In the past few years, the use of machine learning (ML) has greatly improved Computer-Aided Diagnosis systems (CADx) in MI, especially for BT detection, leading to significant progress in accuracy and reliability [
11]. ML techniques play an important role in identifying the most important features in MI, which helps make more accurate tumor diagnoses [
12]. However, some ML methods have recently shown limitations in achieving high accuracy, generally due to the poor prediction of the models and the complex nature of medical data. As a result, many researchers have found another learning-based technique to enhance detection accuracy [
13] and have increasingly moved to advanced Convolutional Neural Network (CNN)-based approaches for the diagnosis of various medical conditions to learn complex features without manual extraction [
14].
Deep learning (DL)-based approaches have shown significant success in analyzing medical images through advanced techniques for brain tumor (BT) detection [
15]. Moreover, these systems have become valuable tools for enhancing and automating early tumor diagnosis, thereby reducing the need for direct human presence [
16]. These approaches not only assist in tumor detection and monitoring but also help doctors make informed decisions about suitable treatment options, ultimately improving patient care [
17]. Deep learning (DL) models often outperform traditional CNN-based models because of their superior learning capacity and efficient feature extraction capabilities. Furthermore, they have demonstrated exceptional capabilities in pattern recognition on large-scale MI datasets and learned complex representations between healthy brain tissue and affected tumor regions [
18]. Nevertheless, continued advancements are being made to improve their efficiency, accuracy, and availability. The research provides better techniques and helps medical professionals to treat patients more effectively based on tumor detection results [
19]. Currently, there are two fundamental approaches used for the detection of BTs. The first is the “two-stage” method, where CNNs such as Region-Based CNN (R-CNN) [
20] and Faster R-CNN (FR-CNN) [
21] are deployed for object identification, though they tend to be less efficient, and the second approach is “single-stage”, represented by the “You Only Look Once” (YOLO) model, which has been primarily implemented in advanced research. In this context, the YOLO series highlights real-time capabilities and enhances accuracy to identify small objects by evaluating their bounding boxes and managing the regression task [
22]. YOLOv12 [
23] uses a residual shortcut that connects the input and output of each block, with a small default scaling factor of 0.01. This design is similar to layer scaling, which is commonly used to improve the optimization of deep vision transformers. However, applying layer scaling only to the attention regions is not sufficient to fully address the optimization problem and may also increase inference latency. These observations indicate that the model’s stable convergence is influenced not only by the attention mechanism but also by the ELAN architecture itself. This further supports the effectiveness of the proposed R-ELAN block and after incorporating ELAN, the larger YOLOv12-X model achieved its best performance, reaching an mAP@0.5 of 72.0 and an mAP@0.5:0.95 of 55.2.
Vision Transformers (ViTs) have been widely applied to vision-based tasks such as image classification, instance segmentation, pose estimation, and detailed tumor detection. They are effective at capturing long-range dependencies between image patches and extracting more informative visual patterns [
24]. Among recent advancements, the Swin Transformer (ST) [
25] is an important development that employs a hierarchical architecture with a shifted-window mechanism. In this approach, self-attention establishes connections among local windows, enabling the model to capture information from small tumor regions while maintaining the global context of BTs. Current BT detection models still face several challenges, including low efficiency, limited accuracy, and difficulty in handling large-scale complex data. To overcome these limitations, a ViT-based ST model was integrated into the STCNet backbone to enhance feature extraction and improve global contextual modeling. In addition, a PANet-based path aggregation framework with a residual structure was used to create the feature pyramid, thereby strengthening multi-scale and global feature representation [
26]. The Swin-Small model outperformed the larger model at a lower computational cost, mainly because the larger model contains nearly four times more parameters. In addition, Swin-Small achieved faster inference, demonstrating superior overall efficiency. Both simple and extensive data augmentation strategies were implemented to improve model robustness [
27]. An advanced DL-based model was proposed by integrating a Hybrid Shifted-Window Multi-Head Self-Attention (HSW-MSA) block into a refined framework. This integration improved classification accuracy, reduced memory consumption, and decreased training complexity. Furthermore, the conventional MLP was replaced with a residual-based MLP (ResMLP), resulting in higher accuracy, faster training, and better parameter efficiency [
28].
Another study described Swin-MedNet, an ST-based framework for brain tumor (BT) diagnosis in medical imaging. Swin-MedNet utilizes a hierarchical ViT architecture with self-attention to effectively capture both local and global features while maintaining linear computational complexity. Moreover, its multi-stage encoder gradually merges patches, improving the scalability and efficiency of deep feature representation learning [
29]. Furthermore, a hybrid deep learning (DL) method was introduced by integrating YOLOv11 with a transformer-based detection head, using the Swin Transformer (ST) as the backbone. Through transfer learning, the model leveraged pre-trained ST weights to achieve stronger feature extraction at a lower computational cost. In addition, the model was extensively evaluated on a brain tumor (BT) dataset with bounding box annotations and demonstrated effective detection performance. These findings highlight that the proposed model is highly suitable for medical diagnostic applications because of its enhanced tumor localization capability, faster training process, and higher classification accuracy [
30].
Inspired by the limitations of earlier Swin- and YOLO-based frameworks, our proposed model integrates the latest CNN-based YOLOv12 architecture with the Swin Transformer (ST) to enhance global contextual feature learning while preserving efficient tumor localization. In particular, the ST backbone is designed to capture both fine local features and long-range contextual information, which are especially important for brain MRI images characterized by low contrast, irregular boundaries, and diverse tumor appearances. Furthermore, the multi-scale features extracted by the transformer are passed through a hybrid FPN + PANet neck, which strengthens cross-scale feature fusion by combining high-level semantic information with localization-sensitive low-level details. In addition, the refined pyramid representation is fed into the YOLOv12 detection head to develop accurate tumor predictions and accurate bounding box localization. Finally, after training, the best-performing model was selected, and XAI techniques, including Grad-CAM and SHAP, were applied to enhance transparency and verify whether the model focuses on clinically meaningful tumor-related regions. Therefore, the main contribution of the proposed Swin–YOLO approach lies in integrating the latest YOLOv12 detection framework with the contextual learning capability of the Swin Transformer (ST), supported by enhanced multi-scale feature fusion and interpretability analysis to achieve more accurate brain tumor (BT) detection.
Figure 1 presents sample MRI images from the Br35H dataset, including (a) abnormal brain scans containing tumor regions and (b) normal brain scans without visible abnormalities. This visual comparison showed the differences in tumor appearance and intensity, which the proposed model considers in distinguishing between normal and abnormal cases.
In summary, the key contributions and innovations of this study are clearly highlighted as follows:
Novel architecture design: This study presents an innovative deep learning framework that has strong real-time tumor detection ability and excellent feature representation. The Swin–YOLO model enhances the strengths of DL in many ways that can be applied to MI. These methods have rapidly grown in healthcare systems using CADs and are very beneficial for physicians, helping them to make better decisions to ensure that they accurately take care of tumor patients in early stages.
Customized CNN model: We replaced the YOLOv12 backbone with an ST for feature extraction through shifted-window self-attention to improve the model’s performance by extracting the small and broader boundaries from MRI scans. This integration significantly enhanced the accuracy and reliability of tumor diagnosis.
Integrated an FPN and PANet: The implementation of FPN and PANet in the YOLOv12 neck part amplifies robust multi-scale tumor feature extraction for better localization and consistent recognition. Moreover, this approach provides details on the model’s effectiveness with respect to tumor properties, such as varying shapes, sizes, and spatial distributions. Through this integration, more accurate tumor localization is supported, particularly in low-contrast or noisy MRI images, compared to traditional CNN-based methods that struggle to recognize objects clearly.
Explainable AI (XAI) techniques: Grad-CAM and SHAP, present a more precise, intangible, and clear visualization of the model’s predictions.
The remainder of the paper is organized as follows:
Section 2 presents a review of previous BT detection studies using ML, DL, and TL methods, identifying their limits and motivating the proposed Swin–YOLO model.
Section 3 demonstrates how the proposed model combines YOLOv12 and the ST to improve MRI-based tumor detection.
Section 4 describes the experimental training process.
Section 5 discusses the results of the proposed method.
Section 6 presents the XAI tools, such as Grad-CAM and SHAP, and presents the “NeuroVision AI” web app, along with a consideration of future work.
Section 7 summarizes the overall conclusions of the study and the future research directions.
3. Methodology
This section presents the details of the proposed Swin–YOLO model adopted to improve tumor diagnosis performance and describes how ViTs are integrated into the CNN-based YOLOv12 model to overcome the important challenges of attaining targeted tumor detection.
3.1. Model Architecture
An overview of the proposed architecture is presented in
Figure 3. The model is implemented and analyzed using the Br35H dataset. As the most advanced model in the YOLO series developed by Ultralytics the YOLOv12 framework substantially enhances training performance and inference speed while maintaining high accuracy with minor parameters. This model is highly efficient and suitable for real-time clinical applications that require reliable tumor detection. Moreover, this model includes area attention, combined feature aggregation, and Flash Attention, collectively enhancing detection accuracy, particularly in cases involving small or irregular tumor shapes in low-contrast MRI images.
In contrast, the ST introduces hierarchical ViTs with a shifted-window scheme, enabling it to capture the local and global contextual information while maintaining computational efficiency. We preferred YOLOv12 as the baseline model because it presents high BT detection accuracy, a fast inference speed, and effective computation, which are significant for practical medical image analysis. In addition, the Swin Transformer (ST) was selected because it can effectively learn both local features and global contextual information through its hierarchical attention mechanism. Moreover, this capability is particularly beneficial for MRI images, in which tumors often exhibit irregular shapes.
3.2. Input Stage
The proposed method consists of three basic parts: backbone, neck, and head. The first input stage process begins with MRI brain scans, which can be individual 2D slices. These images are preprocessed and then divided into patches for transformer-based feature extraction. An MRI image slice or 2D volume is represented as
where X denotes a 2D image and
represents a real number. H, W, and C denote the height, width, and number of channels in the images, respectively. For this study, the image dimensions are set to (224 × 224) because the ST requires this input image size.
3.3. Backbone
We replaced the YOLOv12 backbone with an ST because the CNN exhibited global feature, scalability, gradient flow, and multi-scale fusion issues. The ST focuses on these problems by utilizing self-attention across shifted windows, capturing both fine tumor details and overall brain regions. Its multi-level design supports multi-resolution features to integrate attention layers to enhance the gradient flow and model stability. This context-driven attention mechanism further enhances the model’s ability to handle low-contrast images, noise, and irregular tumor shapes while maintaining a lightweight, robust, and efficient design for real-time tumor detection. Furthermore, this combination significantly enhances both spatial recognition and contextual understanding in MRI images. Beyond backbone replacement, the advancement of the proposed model depends on the task-specific integration of the ST within the YOLOv12 detection pipeline for BT MRI analysis. In the Swin–YOLO architecture, the transformer serves not only as a feature extractor but also as a dedicated representation-learning module that improves the quality of the learned features passed to the large-scale and localization stages. Moreover, combined integration strengthens the interaction between global contextual information and local tumor boundary features, which is essential for challenging MRI cases with low contrast, irregular shapes, and mixed appearance. Furthermore, the key contribution of this work is not merely the replacement of one architecture with another, but the development of a hybrid medical image detection framework that integrates transformer-based feature learning to enhance tumor localization performance on brain MRI images.
3.3.1. Patch Positioning
In this block, the MRI images are divided into small, equal-sized patches (4 × 4 px) since the ST works on sequences rather than continuous images. Each patch acts as a token, which is subsequently transformed into an embedding (a numerical feature vector) in the next block.
where N represents the total number of patches; H and W denote the height and width of the input image, respectively; and P illustrates the patch size.
and
must be integers, meaning that both the image height and width must be divisible by P to allow non-overlapping patch partitioning. This limitation confirms that the image is evenly divided into consistent patches without partial tumor regions. Moreover, such regular partitioning is important for ST-based processing, since every patch is treated as a visual token for subsequent embedding and hierarchical attention-based feature learning.
3.3.2. Linear Embedding (Patch Projection Layer)
In this block, each flattened patch is estimated in a higher-dimensional embedding space (i.e., D). This step converts the image patches into a feature embedding space for better understanding of the transformer so that they can be understood by the transformer. After the patch partition, the next process is sequencing all the patches. Each patch and vector xi dimension (i.e., P
2) is mapped into a new embedding as
where
is the flattened representation of the i-th patch,
denotes the learnable projection matrix,
is the bias term, and
is the corresponding embedded patch token. D denotes the embedding dimension, which defines the size of the feature representation used by the transformer. This embedding step transforms low-level pixel information into a more discriminative feature space, enabling the transformer to process each image patch as a token in a sequential representation-learning framework. For grayscale MRI images, where C = 1, each P × P patch is flattened into a vector of length P
2 before projection into the D-dimensional embedding space.
3.3.3. Swin Block
After the division of small patches (4 × 4), they are converted into fixed-length feature vectors (tokens) and passed into the first ST block. This is highly effective for BT detection because it captures both fine local details and broader global features in MRI scans. Each block begins with a window partitioning step, where the feature map is classified into separated windows of size M × M. We will discuss each stage of the ST block in more detail below, sequencing all the patches. Each patch and vector xi dimension (i.e., P
2) is mapped into a new embedding.
where N
w represents the total number of non-overlapping windows, H′ and W′ denote the height and width of the current feature map, and M is the window size used for local self-attention. In the Swin Transformer, self-attention is not computed over the entire feature map at once. Instead, the feature map is divided into smaller fixed-size local windows, and attention is calculated independently within each window. This design significantly reduces computational complexity compared with global self-attention, while still allowing the model to learn meaningful local contextual relationships. More specifically,
indicates how many windows are formed along the height dimension and
indicates how many windows are formed along the width dimension. Multiplying these two values yields the total number of windows in the current transformer stage. This formulation is valid when both H′ and W′ are divisible by M, ensuring that the feature map can be partitioned into equal-sized windows without incomplete boundary regions. Therefore, Equation (4) gives the total number of non-overlapping windows processed within an ST block.
3.3.4. Patch Merging
During patch merging, every group of (2 × 2 px) neighboring tokens is combined, reducing the total number of tokens by a factor of four across both height and width, as shown in
Figure 4. The equation for patch merging can be formulated as follows:
where Y denotes the output feature map after patch merging and H, W, and C denote the height, width, and channel dimensions of the input token grid, respectively. Specifically, four neighboring tokens are first concatenated, producing an intermediate feature of dimension 4C, which is then linearly projected into a 2C-dimensional representation. Consequently, the spatial dimensions are reduced by half, while the channel dimension is increased, enabling hierarchical representation learning with improved contextual modeling and reduced computational complexity in deeper ST stages. This avoids the high computational cost of 4C and enables the model to learn more detailed context-aware features at each stage. Each merged token then signifies a larger perspective field with twice the channel capacity, which captures contextual connections more efficiently. A linear layer is implemented to balance the feature distribution and prepare it for further transformation inside the ST blocks.
3.3.5. Multi-Stage Extracted Features (C3–C6)
When an MRI image goes through the ST backbone, its features are gradually refined in a specific area, allowing the extraction of deeper and more valuable tumor representations. In initial patch embeddings (i.e., Z0) from the linear embedding block, it generates the final feature map (i.e., C3), which primarily captures fine-grained local tumor details. The feature map (i.e., C4) is based on the previous stage to identify the regional tumor properties, for instance, to localize growth patterns of irregular boundaries. The feature maps (i.e., C5) emphasize the wide-range dependencies of the tumor’s relationship with larger brain structures. Finally, the feature map (i.e., C6) presents deep contextual information effectively. The equation below indicates the combined feature extraction process across all stages.
where C
i denotes the input feature map to the (i + 1)-th stage, F
STGi+1(⋅) represents the transformation function of that stage, and C
i+1 is the output feature map. Each stage performs a sequence of window-based self-attention, feed-forward transformation, and optional patch merging, thereby progressively refining the feature representation. Starting from the initial patch embedding, Z
0, the backbone generates hierarchical features, C3, C4, C5, and C6, where shallow stages preserve fine local structures and deeper stages capture increasingly abstract semantic and contextual information. This staged feature extraction enables the network to model both subtle tumor details and broader anatomical relationships, which is important for accurate brain tumor detection in MRI images.
3.4. Neck
A previous CNN-based model missed important features that effectively spanned all these scales in tumor detection. To address this limitation, we employed the FPN + PANet hybrid in the neck part on YOLOv12 for more multi-scale detail-preserving feature fusion. Through the integration of FPN and PANet, the neck achieves more effective multi-scale feature aggregation by combining high-level semantic context with low-level spatial detail. This design allows the proposed model to handle tumors with substantial variations in size, appearance, and structural complexity.
3.4.1. FPN + PANet Hybrid
The proposed FPN + PANet hybrid neck enhances multi-scale feature fusion for brain tumor (BT) detection. Specifically, the FPN establishes a top-down pathway that integrates high-level semantic information with low-level detailed features, enabling the model to capture both global tumor context and fine boundary information. In parallel, PANet strengthens bottom-up information flow by transferring localization-sensitive features from lower layers to higher layers, thereby improving spatial precision and detection robustness. Unlike a conventional neck design, this hybrid module plays a critical role in the proposed Swin–YOLO framework by effectively redistributing the contextual representations learned by the Swin Transformer backbone across different pyramid levels. As a result, the model becomes more effective in detecting tumors with varying sizes, irregular boundaries, and low-contrast appearance. Therefore, the FPN + PANet hybrid neck serves as an essential component of the proposed architecture, and its contribution is analyzed separately in the ablation study.
Moreover, in the top-down path, the backbone feature map at level i is fused with the up-sampled feature from the deeper pyramid level (i + 1). In the bottom-up path, the feature map is further enriched by the down-sampled feature propagated from the shallower pyramid level (i − 1). The hybrid fusion process is formulated as
where
denotes the refined hybrid feature map at level i,
is the lateral feature map from the backbone, Up (Pi + 1) represents the up-sampled feature from the deeper level, and Down (Pi − 1) represents the down-sampled feature from the shallower level. The concatenation operation merges these multi-scale features along the channel dimension, while the convolution layer smooths and refines the fused representation. This bidirectional fusion strategy enables the network to combine high-level semantic information with fine-grained localization cues, thereby improving multi-scale feature representation for accurate tumor detection.
3.4.2. Lateral Conv
A Lateral Conv is a (i.e., 1 × 1) convolution layer applied to feature maps from backbone stages (C3, C4, C5, and C6). Basically, it not only resizes channels but also adjusts the weights to emphasize the useful features and decrease the irrelevant noise in a tumor image. For this process, it unifies all feature maps into the same fixed number of (256) channels, making them compatible with the FPN + PANet structure. The equation for Lateral Conv can be expressed as follows:
where
denotes the input feature map from the i-th backbone stage, with spatial dimensions Hi × Wi and channel depth D
i, and
represents the transformed output feature map after the 1 × 1 lateral convolution. The 1 × 1 lateral convolution projects this feature map into C
i while preserving the spatial dimensions and unifying the channel dimensions to 256. This transformation makes the multi-scale backbone features compatible with subsequent fusion in the FPN + PANet neck. More specifically, the lateral convolution projects feature channels into a common 256-dimensional space without changing the spatial resolution. It preserves the positional structure of tumor-related patterns while ensuring that backbone features (C3, C4, C5, and C6) are maintained.
3.4.3. Up and Down Sample Block
In MRI brain scans, tumors can appear in different forms; for instance, they can be small in terms of height and width, exhibiting low-resolution areas with fine textures or appearing as large masses spread across broad regions. In the FPN top-down path, semantic features are up-sampled to align with higher-resolution layers (i.e., C3). The network compresses these fine-grained maps into smaller, more semantic representations that are easier to combine with deeper layers (i.e., C4, C5, and C6), which naturally hold stronger semantic meaning. The equation for the Up sample and Down sample can be expressed as follows:
where
represents the convolution kernel, s denotes the stride that reduces the spatial resolution, and
represents the feature map that combines both fine (up-sampled) and coarse (down-sampled) information for accurate tumor detection.
3.4.4. Pyramid Feature Maps (PFMs)
The PFMs are constructed by combining backbone outputs (i.e., C3–C6) through lateral convolution, up-sampling, and down-sampling operations. Each pyramid level (i.e., P3–P6) is designed to detect tumors at different spatial scales. The P3 level maintains the high-resolution spatial details for detecting tumors. The P4 level captures mid-scale tumor portions by stabilizing the structural resolution with contextual data, whereas P5 boosts the detection of large tumors for preserving global structure. Finally, P6 provides a detailed semantic representation, which helps identify highly massive tumors by analyzing the deep contextual information. Overall, the multi-scale pyramid design enables the model to preserve fine-grained spatial details while simultaneously strengthening high-level semantic information. This balanced representation is particularly important for brain MRI analysis, where tumors may appear in different sizes with irregular boundaries and heterogeneous textures. Consequently, PFMs improve the model’s capability to achieve more accurate and reliable tumor detection.
3.5. Head
The detection head is the final stage of the proposed method, where processed feature maps are transformed into tumor detection results. It converts all the BT features into bounding boxes with confidence scores by combining the regression, abjectness, and classification under a joint loss function. Furthermore, the analysis features at different pyramid levels (i.e., P3–P6) predict bounding boxes for both small and large abnormalities that are captured.
3.5.1. Bounding Boxes
Figure 5 illustrates the detailed explanation of the bounding boxes of the proposed model used to identify BTs. The detection head needs the anchor-free approach used to make bounding boxes before giving actual detection results, and MRI images are split into an S × S grid, where each grid cell is concerned with detection boxes. These units estimate the class confidence values, which are used to determine the class of each object. In a single YOLO model, all the tumor bounding boxes in the given datasets are simultaneously predicted during the training process. For each predicted box, the model estimates the center coordinates, width, and height, which together define the spatial extent of the suspected tumor region. This design allows the network to directly localize tumors of different shapes and sizes without relying on predefined anchor templates.
3.5.2. Inner-GIOU
The Inner-GIoU loss function is an advanced method that greatly improves our model’s ability. It enhances bounding box accuracy by making it easier to identify the difference between predicted (i.e., anchor) boxes and the actual GT (i.e., target) boxes shown in
Figure 6. A scaling ratio is applied to customize the size of bounding boxes, making the training process more flexible and adaptive. In loss calculation, the model learns to adjust bounding boxes more accurately and increases the accuracy of tumor localization. Moreover, the loss calculation for different IoU samples through the scaling factor is also essential in model training. This approach delivers more reliable performance of the model and is also helpful for detecting small, irregular tumors. In this method, both the predicted box and the GT box are transformed into smaller inner regions by shrinking their width and height according to a predefined ratio. These inner boxes concentrate more on the central and informative object area rather than the outer boundaries alone. As a result, the model becomes more sensitive to subtle localization differences between the predicted and target boxes. It is especially important in medical imaging, where even a small positional deviation may lead to incorrect lesion localization.
The IoU and GIoU are expressed as follows:
where b denotes the estimated bounding box, b
gt represents the ground-truth bounding box, ∣b ∩ b
gt∣ is the intersection area between the two boxes, and ∣b ∪ b
gt∣ is their union area. Thus, Equation (10) is used to calculate the overlap ratio between the predicted and target boxes, where a larger IoU value signifies better localization accuracy. Moreover, the IoU value ranges from 0 to 1, where a value of 0 indicates that the predicted and ground-truth boxes do not overlap at all and a value of 1 demonstrates perfect overlap between them. Thus, a larger IoU value reflects a near spatial match and more accurate bounding box localization due to the measurements’ overlapping in a normalized manner. IoU is widely used in object detection tasks to evaluate how well the predicted box aligns with the actual lesion or object region.
The GIoU extends IoU by considering the smallest enclosing box covering both the predicted and ground-truth boxes and is defined as
where C represents the smallest enclosing box covering both the predicted bounding box b and the ground-truth bounding box ∣C − (b ∪ b
gt)∣ and b
gt denotes the area inside C that is not occupied by the union of the two boxes. In other words, this term measures the extra background region enclosed by C beyond the combined area of the predicted and target boxes. Therefore, Equation (11) extends the conventional IoU by introducing a geometric penalty that reflects how far the two boxes are from each other, even when they do not overlap. In particular, when the predicted and ground-truth boxes overlap well, the penalty term becomes small and the GIoU value approaches the IoU value. However, when the two boxes are far apart or do not overlap, the penalty term becomes larger, which reduces the GIoU score. In this way, GIoU provides more informative guidance than IoU alone because IoU becomes zero for non-overlapping boxes and cannot describe the spatial separation between them. By contrast, GIoU still captures the geometric relationship between the two boxes through the enclosing region C.
To further improve regression performance, the Inner-GIoU formulation introduces a scaling factor that controls the influence of inner auxiliary boxes and is expressed as
where λ is a scaling coefficient that controls the contribution of the inner penalty term and C
inner denotes the smallest enclosing box computed from the inner auxiliary boxes derived from the predicted and ground-truth boxes. The term
represents the area inside the inner enclosing box that is not covered by the union of the predicted and ground-truth boxes. Therefore, this penalty measures the spatial discrepancy between the two boxes within a more focused internal region rather than across the entire outer enclosing area. To be more specific, Inner-GIoU extends conventional GIoU by focusing on the inner auxiliary regions of the predicted and target boxes. By emphasizing the more informative central area rather than only the global enclosing box, it becomes more sensitive to small localization errors, which is especially useful for small objects and precise alignment tasks.
To define the inner auxiliary box for the ground-truth target, let the GT box center be
,
, with width w
gt and height h
gt. Using a scaling factor, r (ratio), the left and right boundaries of the inner GT box are defined as
where
and
denote the center coordinates of the ground-truth box, w
gt and h
gt denote its width and height, and r is a scaling ratio used to generate the inner auxiliary box. The terms
and
represent the left and right boundaries of the scaled inner ground-truth box, respectively. This formulation preserves the center position of the original ground-truth box while proportionally adjusting its width according to r when r < 1. The resulting inner box focuses on the more central and informative target region, which is beneficial for fine-grained localization.
Similarly, the top and bottom boundaries of the inner GT box are defined as
where
and
represent the top and bottom boundaries of the scaled inner ground-truth box. Thus, Equation (14) defines the vertical boundaries of the inner auxiliary ground-truth box by scaling the original box height around its center coordinate. More specifically, the vertical center,
, remains unchanged, while the original height,
, is multiplied by the scaling factor, r. As a result, the top and bottom boundaries are symmetrically adjusted with respect to the center of the original ground-truth box.
This ensures that the inner auxiliary box preserves the original target location while reducing or refining its vertical extent. For the predicted box, let the center coordinates be (x
c, y
c), with width w and height h. The left and right boundaries of the corresponding inner predicted box are computed as
where b
l and b
r denote the left and right boundaries of the inner auxiliary predicted box, respectively; x
c and y
c are the center coordinates of the predicted bounding box; w and h denote its width and height; and r is the scaling ratio. Equation (15) proportionally adjusts the horizontal extent of the predicted box while preserving its center position.
This formulation is consistent with the ground-truth inner box definition and enables a fair inner-region comparison during Inner-GIoU-based regression. Likewise, the top and bottom boundaries of the inner predicted box are given by
where b
t and bb denote the top and bottom boundaries of the inner auxiliary predicted box, respectively; y
c is the vertical center coordinate of the predicted bounding box; h is its height; and r is the scaling ratio. Equation (16) proportionally adjusts the vertical span of the predicted box while preserving its center position.
The targeted and anchor boxes are separated, with the targeted box (i.e., the blue solid box), which represents the GT bounding box, on the left side and the predicted or reference bounding box (i.e., the orange solid box) on the right side. A smaller, left-side, inner targeted (blue dashed box) region derived from the target box by shrinking it proportionally along the width (w_inner) and height (h_inner) dimensions has an input resolution of n = 640 as a result of utilizing optimized memory access patterns for more efficient execution.
3.6. YOLOv12 Architectural Advancements
YOLOv12 is composed of three main parts: the backbone, neck, and head. First, the input image is fed into the backbone, where an image of size 640 × 640 × 3 is progressively processed through convolutional layers and R-ELAN (Residual Efficient Layer Aggregation Network) blocks to extract hierarchical multi-scale features. The convolution layers gradually reduce the spatial resolution from 640 × 640 to 320 × 320, 160 × 160, 80 × 80, 40 × 40, and finally 20 × 20, while increasing the depth of the feature representations. At each stage, the R-ELAN blocks enhance feature learning, preserve important information, and strengthen the representation capability of the network. The extracted multi-scale features are then passed through a Position Perceiver module to preserve spatial and positional information before being forwarded to the Flash Attention [
70] A2 modules for further refinement.
Figure 7 shows the detailed architecture of YOLOv12.
3.6.1. A2 Module
This module plays a very important role in further feature processing, as it refines the extracted features by efficiently capturing long-range dependencies and global contextual information, thereby enabling the backbone to generate stronger and more informative feature maps for subsequent detection tasks. In addition, A2 maintains a large receptive field while simplifying the attention mechanism to reduce the computational cost and improve the inference speed. It also incorporates segmented feature processing with Flash Attention, which reduces computational complexity by 50 percent through spatial reshaping while preserving broad contextual coverage. Moreover, A2 supports real-time detection at a fixed input resolution of n = 640 by utilizing optimized memory access patterns for more efficient execution.
3.6.2. R-ELAN Module
The neck is responsible for aggregating and refining the multi-scale features received from the backbone before they are forwarded to the detection head. It combines feature maps of different resolutions through up-sampling, concatenation, and convolution operations, enabling effective fusion of high-level semantic information with low-level spatial details. Starting from the deeper 20 × 20 feature map, the features are progressively up-sampled to 40 × 40 and 80 × 80, where they are concatenated with the corresponding features from earlier layers to enrich the representation. After each fusion stage, the combined features are processed by the R-ELAN + A2 modules, which further enhance feature extraction and contextual understanding through attention. In the downward path, convolution layers reduce the spatial resolution again, while concatenation merges features across different scales, allowing the network to preserve strong semantic information at the 40 × 40 and 20 × 20 resolutions.
3.6.2.1. CSPNet
The R-ELAN module is built upon a CSPNet [
71]-inspired design, in which the input feature map is partially divided into different paths to improve gradient flow and reduce redundant computation. One portion of the features is forwarded directly, while the other passes through a sequence of convolutional and transformation operations before both models are merged through concatenation. This structure helps preserve original information, enhance feature reuse, lower computational complexity, and improve learning efficiency. By incorporating this CSP-based strategy, the R-ELAN module is able to extract richer and more stable feature representations in both the backbone and the neck.
Figure 7.
YOLOv12 architecture [
72]. Modified YOLOv12 Architecture.
Figure 7.
YOLOv12 architecture [
72]. Modified YOLOv12 Architecture.
3.6.2.2. R-ELAN
The R-ELAN module is built upon a CSPNet-inspired design, in which the input feature map is partially divided into different paths to improve gradient flow and reduce redundant computation. One portion of the features is forwarded directly, while the other passes through a sequence of convolutional and transformation operations before both branches are merged through concatenation. This structure helps preserve original information, enhance feature reuse, lower computational complexity, and improve learning efficiency. By incorporating this CSP-based strategy, the R-ELAN module is able to extract richer and more stable feature representations in both the backbone and the neck. This design also strengthens feature propagation across layers.
3.6.3. C3K2
The R-ELAN module incorporates a C3K2 [
73]-style CSP-based structure, in which the input features are first divided into multiple branches and then processed through convolutional layers, repeated blocks, and transition layers before being merged through concatenation. This design improves feature reuse, enhances gradient flow, reduces redundant computation, and enables more efficient and richer feature representation in both the backbone and the neck. Moreover, the integration of the C3K2-style CSP structure supports effective multi-branch feature learning without introducing excessive computational burden. It also facilitates better information propagation across layers.
3.6.4. Multi-Scale Detection Head
The refined multi-scale features from the neck are forwarded to three Flash Attention A2-based detection branches to generate the final predictions. Each branch operates at a different feature scale, enabling the model to detect objects of various sizes more effectively. Within each branch, the Flash Attention A2 module further enhances feature representation by capturing important contextual relationships and emphasizing relevant regions before prediction. Following this refinement, the Detect layer performs the final object detection task by predicting object locations, class probabilities, and confidence scores. The outputs from all detection scales are then combined to produce the final detection result.
3.7. Swin Transformer Architecture
The ST is a modern ViT framework that captures strong visual information using self-attention and attains high accuracy to enhance multi-scale feature representation based on MRI scans. These advancements enable the efficient processing of large-scale image data and constitute a highly effective architecture for resolving complex issues in computer vision tasks. However, the overall structure of the ST is shown in
Figure 8. The architecture comprises four main stages, where the input image is initially partitioned into small patches and subsequently processed through multiple transformer blocks within the backbone to extract hierarchical feature representations. In the initial Stage 0, the MRI brain images are taken as input:
. After that, ViTs implement the patch partitioning process, where the image is divided into separate, disjoint (i.e., 4 × 4) patches called “tokens”. In Stage 2, the model captures the broader regional tumor characteristics, such as clustered or irregular shapes. In this stage, every (2 × 2) group of neighboring patches is merged into a single feature, making a 4C-dimensional feature vector. This process leads to a reduction in the number of tokens; the output dimension is set to 2C and decreases the image resolution by half while doubling the channel depth for the wider contextual view of a larger tumor region encompassing the patch feature transformation and merging, while maintaining the image resolution at (i.e., H × 8, W × 8). This patch merging and feature transformation process is executed twice, the stages referred to as “Stage 3” and “Stage 4”, respectively, with output resolutions of H/16 × H/16 and H/32 × H/32. Finally, the SoftMax function is executed for the output of per-class probabilities, ensuring that the final decision brain contains a tumor or not. In the final Stage 5, classification involves determining the decision boundaries between the two possible classes, “Tumor” and “No Tumor”.
3.7.1. Window-Based Self-Attention (WSA) Architecture
In the ST block, WSA is a module that divides an image into small windows and calculates self-attention not only within each window but also in the whole image. In this block, the model first analyzes large regions to capture the overall brain structure, then focuses on smaller regions to extract tumor details. As illustrated in
Figure 9, two types of local and shifted windows are used to enable clearer detection of the tumor region. In the local window self-attention mechanism, the whole MRI image is split into many smaller, fixed-size windows (e.g., 4 × 4, 8 × 8, or 16 × 16). Each window acts as an enclosed region, where self-attention calculates the patches that only exist inside that specific window. Moreover, in the local window, only using local attention, the model becomes highly sensitive to minor pixel-level abnormalities, which are important for detecting tumors that might be too small or low-contrast for old methods to identify. This design makes the attention process computationally effective because it highlights the local neighborhoods rather than the entire image, significantly minimizing the computational cost of global self-attention. Furthermore, this localized attention mechanism is highly suitable for brain MRI analysis, where tumor regions often exhibit subtle structural and intensity variations.
3.7.2. Window-Based Multi-Head Self-Attention (MASA)
In each window of the ST, an MSA process is created to observe separately fine tissue patterns, brightness variations, and local affiliation in the specific brain region. This part is very important for tumor detection because it ensures that abnormalities and irregular tissue growth are clearly shown from the neighboring windows. In
Figure 10, the MRI image is divided into four windows, and each window is observed individually to capture localized tumor features within particular brain regions. Following this, every window is analyzed, and the capturing results are merged to give the model a full picture of the brain image. This localized attention strategy enables the model to extract discriminative regional features while maintaining computational efficiency. It is particularly beneficial for brain MRI analysis, where tumors may appear with a small size, irregular shape, or low contrast against surrounding tissues. The independent analysis of each window supports precise modeling of local structural variations and abnormal patterns. After processing all windows, the outputs are compiled:
where X
i indicates the feature extracted from the ith window of the MRI image. Once all the windows (i = 1 to n) are processed, the outputs are recombined (i.e.,
) to form the complete feature map. In this step, each window output is placed back in its correct tumor location (see Algorithm 1).
| Algorithm 1. Proposed method architecture |
Input: Brain MRI image X Output: Tumor/healthy prediction Y and bounding box localization B |
| Begin |
| 1: P ← Patch Partition (X, 4 × 4)//divide the MRI image into non-overlapping patches |
| 2: T ← LinearEmbedding (P)//convert image patches into token embeddings |
| 3: C3 ← SwinStage1 (T)//extract shallow local-contextual features |
| 4: C4 ← SwinStage2 (Patch Merging(C3))//reduce resolution and learn deeper features |
| 5: C5 ← SwinStage3 (Patch Merging(C4))//capture richer multi-scale representations |
| 6: C5′ ← SwinStage4 (Patch Merging(C5))//refine deepest global contextual features |
| 7: Ftd ← FPN ({C3, C4, C5′})//fuse multi-scale features through top-down pathway |
| 8: Fbu ← PANet (Ftd)//enhance localization-sensitive features via bottom-up pathway |
| 9: {P3, P4, P5, P6} ← Pyramid Features (Fbu)//generate enriched pyramid feature maps |
| 10: (Y, B) ← YOLOv12Head ({P3, P4, P5, P6})//predict class scores and bounding boxes |
| 11: Return (Y, B)//output final classification and localization result |
| End |
7. Conclusions
In this study, we proposed a hybrid Swin–YOLO framework to address the limitations of the conventional CNN-based YOLOv12 model for brain tumor (BT) detection. The proposed framework integrates the efficient local feature extraction capability of YOLOv12 with the strong global contextual representation ability of the ST. To establish a clear baseline, all YOLOv12 variants (n, s, m, l, and x) were first trained and comparatively evaluated. Based on these results, the backbone and neck of the YOLOv12 architecture were redesigned to develop the proposed hybrid model, aiming to improve feature representation, localization precision, and detection robustness. Experimental results on the Br35H dataset demonstrated that the proposed model achieved strong performance, attaining an accuracy of 99.7%, a precision of 98.8%, a recall of 99.7%, an F1-score of 99.2%, an mAP@50 of 99.4%, and an mAP@50:95 of 87.2%. In addition, the confusion matrix, ROC analysis, MCC, training and validation curves, and detection results further confirmed the model’s effectiveness and stability. The integration of XAI approaches further enhanced transparency by identifying clinically relevant tumor regions in MRI images, while the developed NeuroVision AI application demonstrated the potential practical utility of the framework in an assistive clinical environment. However, these promising findings should be interpreted with caution, as the limited size and diversity of the Br35H dataset may restrict the generalizability of the results to broader clinical settings. In addition, the hybrid Swin–YOLO design improves computational efficiency compared with YOLO-only and Swin Transformer-only models; however, repeated-run experiments, statistical significance analysis, and cross-validation were not fully investigated in the current study. Future work will focus on validating the proposed model on larger, more diverse, and multi-center MRI datasets, together with more rigorous statistical analysis, to further confirm its robustness and clinical applicability.