1. Introduction
Currently, the global incidence and mortality rates of tumors continue to rise. According to data from the Surveillance, Epidemiology, and End Results program between 2017 and 2021, the overall incidence rate of brain tumors and other central nervous system (CNS) tumors in the United States was 6.2 cases per 100,000 people per year [
1]. The GLOBOCAN 2022 report shows that in 2022, there were 321,476 new cases of brain tumors worldwide, accounting for 1.6% of all cancer cases, ranking 19
th in incidence; the number of deaths reached 248,305, representing 2.6% of all cancer-related deaths, placing brain tumors 12
th in mortality [
2].
The presence and development of brain tumors can threaten human health, cause compression and damage to human brain tissue, and interfere with normal brain function [
3]. It may lead to serious consequences such as blindness, paralysis, epilepsy, and sudden death and may also cause a variety of neurological dysfunctions depending on the location and nature of the tumor, seriously affecting the quality of life of patients [
4]. Therefore, early and accurate detection of brain tumors is crucial for devising effective treatment strategies [
5]. At the moment, brain tumor detection primarily relies on methods such as cranial CT scans [
6], brain MRI [
7], brain PET scans [
8], biopsies, and neurological examinations.
In the clinical environment, MRI is the main way of examining brain tumors, providing the highest diagnostic accuracy that can be achieved at present. Medical imaging technology has provided significant advantages in tumor detection [
9], but there are still many challenges faced in manual analysis. Manual analysis is time-consuming and error-prone, and changes in doctors’ professional knowledge and fatigue often lead to inconsistent diagnostic results. This limitation is particularly evident when large-scale patient datasets need to be processed, and the efficiency and accuracy of manual diagnosis often fail to meet the clinical needs of rapid decision making. It can be seen that the use of deep learning technology for automated tumor detection [
10] has become an important way to solve this problem.
Currently, many deep learning models are being applied to tumor detection and segmentation tasks [
11]. Olaf Ronneberger et al. proposed UNet, an encoder–decoder architecture that became a milestone in medical image segmentation due to its symmetric structure and skip connections [
12]. Yaopeng Peng et al. introduced UNetV2, which improves detail preservation by automatically adapting network configurations and optimizing the feature fusion mechanism [
13]. Hasib Zunair et al. developed Sharp UNet, which integrates an edge enhancement module to significantly improve the sharpness of tumor boundaries [
14]. Berkay Eren introduced TransUNet, a structure that combines convolutional layers with transformer architecture to enhance global context modeling [
15]. Lijuan Yang proposed a novel network framework named MUNet, which focuses on multi-scale information fusion for more accurate brain tumor segmentation [
16]. Zongwei Zhou et al. proposed UNet++, which redesigns the skip connections using nested and dense pathways to better capture multi-scale features and reduce the semantic gap between the encoder and decoder [
17].
Despite the rich variety of UNet variants, several common drawbacks still hinder their application in clinical-grade brain tumor segmentation. For instance, while architectures like UNetV2 and UNet++ improve multi-scale feature fusion, they may struggle to simultaneously preserve boundary precision for very small lesions and capture global context for large, irregular tumors, especially under low-contrast or noisy MRI conditions. Even when edge enhancement modules (Sharp UNet) or attention mechanisms (TransUNet) are introduced, the purely convolutional backbone often remains insensitive to subtle texture variations, leading to incomplete or blurred segmentations. Moreover, these enhancements typically incur significant computational and memory overhead, making real-time inference on standard medical hardware challenging.
YOLO has demonstrated unique advantages in tumor detection tasks [
18]. YOLO is a real-time object detection algorithm that divides the image into grids and performs efficient, end-to-end object detection through a single forward pass, predicting bounding boxes and class probabilities. The concept of YOLO was first introduced by Joseph Redmon and Andrew Farhadi in 2015 [
19]. Nadim Mahmud Dipu implemented seven different neural network-based object detection frameworks and algorithms, including YOLOv3 PyTorch, YOLOv4 Darknet, Scaled YOLOv4, YOLOv4 Tiny, YOLOv5, Faster R-CNN, and Detectron2 [
20]. Chanu Maibam Mangalleibi et al. found that CNN- and YOLOv3-based approaches showed greater potential for brain tumor classification [
21]. Akmalbek Bobomirzaevich Abdusalomov achieved higher accuracy by fine-tuning the YOLOv7 model through transfer learning [
22]. Yuan Zizhong made a series of improvements to YOLOv3 and YOLOv5 by introducing Efficient-Rep GFPN and a decoupled head structure into YOLOv5, enhancing its feature extraction and semantic information transmission capabilities [
23]. Rafia Ahsan et al. combined YOLOv5 with 2D U-Net, enabling both the detection of different types of brain tumors and the precise delineation of tumor regions within the predicted bounding boxes [
24]. Naira Elazab incorporated ResNet into the YOLOv5 framework as a feature extractor, enabling accurate classification and localization of tumors in histopathological images [
25]. Karacı Abdulkadir proposed a three-stage hybrid classification framework based on YOLO, DenseNet, and Bi-LSTM, capable of classifying gliomas, meningiomas, and pituitary tumors [
26]. As of 30 September 2024, the latest version, YOLOv11, has emerged, achieving significant improvements in both accuracy and efficiency.
Despite the advancements embodied in YOLOv11 as a state-of-the-art iteration of the YOLO family, its architectural optimization is primarily tailored for natural image domains. When directly applied to the task of brain tumor segmentation in medical imaging, the model exhibits several inherent limitations.
Specifically, the backbone of YOLOv11 emphasizes global semantic representation, rendering it insufficiently responsive to fine-grained boundary cues and small-scale abnormalities. This leads to frequent omission or incomplete detection of small tumors, especially under conditions of low texture contrast and blurred anatomical boundaries. Although YOLOv11 introduces certain cross-layer connections, its capacity for handling targets with drastic scale variation, which are typical in brain tumor imagery, remains inadequate, resulting in inconsistent performance across large and small tumor instances. Moreover, its reliance on bounding box regression for localization impairs its precision when delineating tumors with ambiguous or complex margins, often causing deviation in boundary alignment. Critically, the accurate interpretation of pathological regions in medical images necessitates the integration of contextual information, a capacity in which YOLOv11 is notably limited. Consequently, the model is vulnerable to misclassification induced by image noise and irrelevant artifacts.
In response to these challenges, this study introduces YOLO-BT, a multi-scale feature fusion framework built upon YOLOv11. We hypothesize that by deeply integrating hierarchical features with adaptive attention mechanisms, YOLO-BT can substantially enhance the segmentation fidelity of irregularly shaped brain tumors in MRI scans, while simultaneously reducing computational latency. Experimental validation on the Figshare Brain Tumor dataset substantiates this hypothesis; YOLO-BT not only surpasses YOLOv11 in both bounding box-level and mask-level performance metrics but also achieves significant improvements in mIOU (Abbreviations are listed in
Table 1) and Dice coefficient, thereby affirming the proposed method’s efficacy in elevating both detection efficiency and segmentation accuracy in the medical imaging context. The main contributions of this work are summarized as follows:
In this study, the network structure of YOLOv11 is improved, and the YOLO-BT algorithm is proposed. Based on the original network, the algorithm optimizes the feature extraction ability of the network, improves the accuracy and efficiency of brain tumor detection and segmentation, and improves the adaptability of the model to the morphology of brain tumors.
Experiments are carried out on the Figshare Brain Tumor dataset, and the data performance is better than other comparison algorithms. The experimental data show that YOLO-BT performs well in dealing with brain tumors with complex shapes and large size differences and exhibits significantly improved segmentation accuracy and detection efficiency compared with traditional methods.
2. Materials and Methods
This section mainly introduces the design and implementation of the proposed YOLO-BT algorithm, concluding with the evaluation metrics used to assess the model’s performance, with a focus on the optimization of the model architecture. Derived from YOLOv11, several critical modifications characterize the proposed model: UNetV2 is adopted as the backbone network; BiFPN is employed for multi-scale feature fusion; and the D-LKA mechanism is introduced to augment contextual information extraction. Each module’s design rationale and implementation specifics are examined here, alongside their collective influence on the performance metrics of the model overall.
YOLOv11 introduces three major improvements based on YOLOv8: when C3k2 is set to True, the C2f module is replaced by C3k2; a new mechanism called C2PSA is proposed, which embeds a multi-head attention mechanism into the C2 module; and two depthwise separable convolution layers are added to the detection head in the head network [
27].
Considering the performance and lightweight requirements of segmentation tasks, this study selects YOLOv11n-Seg as the research baseline. The network structure of YOLOv11n-Seg is shown in
Figure 1.
2.1. YOLO-BT Algorithm
Based on YOLOv11n-Seg, we propose a more suitable algorithm for brain tumor segmentation, named YOLO-BT. The YOLO-BT architecture and its input/output are shown in
Figure 2, and the network structure layers are detailed in
Table 2. Compared with YOLOv11, YOLO-BT has three major architectural features:
First, the UNetV2 network is used to replace the backbone. Compared with the original fully convolutional backbone structure, UNetV2 preserves shallow spatial features through long-distance skip connections, effectively addressing YOLOv11’s insensitivity to tumor boundaries while improving segmentation accuracy.
Second, BiFPN is introduced into the Neck network. Unlike traditional Concat or FPN structures, BiFPN features bidirectional information flow and weighted feature fusion, enhancing the model’s adaptability to brain tumors of varying sizes and addressing YOLOv11’s imbalance in response to small targets.
Third, a deformable large kernel attention mechanism is integrated into the neck structure. By leveraging deformable large kernel convolution, the model’s contextual awareness of tumor regions is enhanced, leading to more accurate tumor edge detection. This significantly improves YOLO-BT’s ability to model complex-shaped and poorly defined brain tumors, thereby effectively increasing segmentation precision.
The workflow of the YOLO-BT system, along with the corresponding mathematical derivations, is presented as follows.
Let the input MRI image be denoted as
. Multi-scale feature maps are extracted using the UNetV2 encoder, defined as
where the downsampling rate increases with the level
, and the number of channels
follows the expansion rule
.
Let the encoder output features be
. Cross-level feature interaction is constructed through BiFPN to form a pyramid structure:
where
denotes the current level.
is an adaptive sampling operator; bilinear upsampling is applied when
, and strode convolution downsampling is applied when
.
and
represent the input and output features at levels
and
, respectively.
Feature extraction is further enhanced by D-LKA:
where
denotes the input feature map,
denotes the intermediate feature map after convolution and
activation, and
represents element-wise multiplication.
The final outputs include detection and segmentation results.
Segmentation output:
where each
grid predicts 5 bounding box parameters
and class probabilities for
categories. The confidence score
is output via a
activation.
denotes the number of segmentation classes.
Thus, the final formulation of the YOLO-BT system can be expressed as
where
represents the UnetV2 encoder,
denotes BiFPN-based feature fusion,
stands for D-LKA enhancement, and
is the multi-task prediction head.
2.2. UnetV2 Network
The backbone network of YOLOv11-Seg is mainly composed of convolutional layers and C3k2 modules, which are well-suited for general detection tasks. However, for brain tumor segmentation—where the environment is well defined yet highly complex—the UNetV2 network offers clear advantages. By enhancing skip connections and multi-scale feature fusion, UNetV2 enables more accurate reconstruction of segmentation details, especially when handling complex morphologies and irregular boundaries in medical images. This significantly improves segmentation precision and allows for richer contextual information capture.
The UNetV2 architecture is shown in
Figure 3 and mainly consists of an encoder, a decoder, and linked SDI modules. The encoder generates four feature maps with different resolutions, which are then processed by the SDI module, applying both spatial and channel attention mechanisms to the feature maps.
denotes the input feature map at the level, while and represent the channel attention mechanism and spatial attention mechanism at the level, respectively. The output feature map at the level after applying and is denoted .
Next, a 1 × 1 convolution is applied for downsampling to obtain the feature map
, where
. For feature maps from different levels, adaptive average pooling, identity mapping, and bilinear interpolation are applied to generate feature map
with the same resolution. These are then refined using a 3 × 3 smoothing convolution to produce the adjusted feature maps
.
represents the parameters of the smoothing convolution, and
denotes the
-smoothed feature map at the I level.
Finally, all feature maps at thelayer undergo element-wise Hadamard multiplication, resulting in a feature map that contains rich semantic information and details.
By integrating the attention mechanism of UNetV2 into YOLOv11 and replacing the upper portion of its backbone network, YOLO-BT’s ability to extract features from key regions is significantly enhanced.
2.3. Bidirectional Feature Pyramid Network
In MRI images, most tumors vary in size, have irregular shapes, and possess blurred edges, which can lead to significant errors during detection. BiFPN addresses this issue by optimizing cross-scale connections and applying weighted feature fusion, thereby improving accuracy. It achieves comparable precision with fewer parameters and reduced computational cost, resulting in faster inference. As shown in
Figure 4, BiFPN overcomes the limitations of traditional FPN’s unidirectional information flow by introducing a bottom-up path aggregation network, removing nodes with only a single input edge, and eliminating nodes with minimal contributions to the feature network, resulting in a streamlined bidirectional architecture.
BiFPN adopts fast normalized fusion to effectively merge features of different resolutions. When fusing these multi-scale features, BiFPN assigns a learnable weight to each input, enabling the network to learn the relative importance of each feature:
denotes the iii-th input feature map, and
is a learnable weight. These weights are normalized into probabilities between 0 and 1, and ReLU is applied after each weight to ensure
. A small constant
is introduced to prevent numerical instability.
represents the normalized output feature map:
and represent the input features at levels 6 and 7, respectively. , , , are the weight coefficients, and refers to upsampling or downsampling operations. is a small constant to avoid numerical instability. is the intermediate feature at level 6 in the top-down path, while and represent the output features at levels 5 and 6 in the bottom-up path, respectively.
By integrating BiFPN into YOLOv11, replacing the original simple Concat connections, each feature level learns to adaptively fuse information from other levels through weighted fusion. This not only improves YOLO-BT’s feature representation capability but also enhances computational efficiency by pruning edge nodes and reducing the number of parameters.
2.4. Deformable Large Kernel Attention
Traditional convolutional neural networks face challenges in object detection across different scales when performing image segmentation. If an object exceeds the receptive field of the corresponding network layer, it may lead to insufficient segmentation. On the other hand, a larger receptive field may incorporate background information that, when compared to the actual size of the object, can adversely influence the prediction. To address this, the D-LKA is introduced. The specific structure of D-LKA is shown in
Figure 5.
Large Kernel Attention uses depthwise separable convolution layers and depthwise separable dilated convolutions to construct large convolution kernels with fewer parameters. The kernel size equations for depthwise convolutions and depthwise dilated convolutions for an input of size
and channels
with a kernel size of
are given in Equations (13) and (14):
where
d is the dilation rate,
represents the dilated depthwise separable convolution,
is the size of the input feature map, and
is the total number of blocks.
Building upon large kernel attention, the D-LKA introduces deformable convolutions. Deformable convolutions adjust the sampling grid using integer offsets, enabling free deformation. The additional convolutional layers learn the deformation from the feature map, creating an offset field. Learning deformation based on the features themselves results in an adaptive convolution kernel that improves the representation of deformed objects, enhancing the definition of tumor boundaries. The D-LKA module can be expressed as
F represents the input feature map, where . denotes the intermediate feature map obtained after the convolution and activation. refers to the dilated depthwise separable convolution, while represents a combination of dilated depthwise separable convolution and pointwise convolution. The symbol denotes element-wise multiplication, and the final output is denoted as .
In this study, the C3k2-D-LKA module is used to replace the C3k2 module in the neck of YOLOv11, enhancing the ability of C3k2 to fuse important features and improving the segmentation of tumor lesion boundaries.
2.5. Experimental Evaluation Metrics
Common evaluation metrics used in YOLO include Box_P, Box_R, Box_mAP50, Box_mAP50-95, Mask_P, Mask_R, Mask_mAP50, and Mask_mAP50-95. Their formula definitions are as follows:
represents the number of samples correctly predicted as positive,
represents the number of samples incorrectly predicted as positive, and
represents the number of samples incorrectly predicted as negative.
refers to the number of classes, and AP stands for Average Precision, with its calculation formula shown in Equation (21). Box_mAP50 refers to the mean Average Precision at an
threshold of 50%, while Box_mAP50-95 represents the average
calculated over
thresholds ranging from 50% to 95% with an interval of 0.005.
denotes the total number of pixels correctly segmented as a tumor, indicates the number of pixels incorrectly segmented as a tumor, and refers to the number of pixels incorrectly segmented as non-tumor. is a metric used in image segmentation tasks to evaluate the degree of overlap between the predicted results and the ground truth labels. Its calculation formula is shown in Equation (25). represents the ratio of the intersection area to the union area between the predicted box and the ground truth box; the larger the value, the higher the matching degree. refers to the average calculated by summing the values of each class and then taking the mean. Their calculation formulas are shown in Equation (26).
4. Discussion
4.1. Comparative Analysis
The YOLO-BT algorithm proposed in this study demonstrates significant performance advantages in brain tumor detection tasks. Compared to the original YOLOv11, YOLO-BT incorporates a UnetV2-based backbone network that effectively fuses semantic and detailed features. This design alleviates the limitations of traditional models in extracting features from blurred tumor boundaries and significantly improves boundary localization accuracy. The BiFPN module enables dynamic, cross-level feature fusion, enhancing the model’s adaptability to tumor structures of varying scales. In addition, the D-LKA module combines the global perception ability of large convolutional kernels with the local modeling capability of deformable convolutions, significantly improving the representation of irregular tumor shapes.
While DeepLabV3+ exhibits high sensitivity to large tumor regions, it tends to suffer from over-segmentation and false positives when handling small lesions—issues that YOLO-BT addresses with greater stability. HRNet performs well in contour recognition of meningiomas and gliomas but struggles with complex and blurry boundaries, whereas YOLO-BT’s UnetV2 backbone offers a more robust solution. PSPNet can effectively identify the overall shapes of meningiomas and gliomas but lacks accuracy in segmenting small lesions and is less precise than YOLO-BT in handling tumors at multiple scales. UNet and its variant UNetV2 produce relatively clear segmentation contours and stable results, but they still face challenges such as false detections and instability in boundary predictions.
Experimental results show that YOLO-BT outperforms YOLOv11 and other mainstream detection models in key metrics such as Accuracy, Recall, mIoU, and Dice coefficient. Notably, it demonstrates superior robustness and discriminative ability in identifying small lesions, making it a more reliable and effective solution for brain tumor detection and segmentation.
4.2. Limitation Analysis
Despite its outstanding performance, YOLO-BT has several limitations. The proposed model has been validated and tested only on the Figshare brain tumor dataset and has not yet been evaluated on real clinical MRI scans acquired from multiple centers or imaging devices. Therefore, its practical effectiveness in clinical settings remains unverified. The model’s segmentation performance may decline in images with low resolution, motion artifacts, or noise interference. Although YOLO-BT exhibits higher computational complexity compared to YOLOv11, it still meets the requirements for deployment on medical devices. In the future, model pruning or lightweight optimization strategies may be explored to further reduce inference latency and memory consumption. Furthermore, although YOLO-BT surpasses existing methods in terms of Recall, it still carries a risk of missing very small or low-contrast lesions. Future improvements could involve incorporating Transformer-based attention mechanisms to enhance feature representation or designing more robust loss functions (e.g., focal Tversky loss) to improve sensitivity to hard-to-detect samples.
4.3. Future Perspectives
From a clinical application standpoint, YOLO-BT exhibits strong capabilities in precise tumor segmentation and localization, offering radiologists an efficient and accurate diagnostic aid. Integrating the model into picture archiving and communication systems could enable real-time image processing and further extend its application to multi-organ tumor analysis, including lung and breast tumors. This would support the development of intelligent diagnostic platforms across disease types and promote the real-world translation and deployment of AI technologies in healthcare.