1. Introduction
With the rapid progress of intelligent transportation systems and deep learning technologies, object detection has become a key research focus, and traditional methods are often susceptible to environmental interference in complex backgrounds, leading to reduced accuracy [
1]. In real-world applications under challenging conditions such as adverse weather, low-light environments, and varying illumination, detecting fast-moving or small objects remains difficult. Currently, YOLO-based models are highly efficient for object detection [
2]. Meanwhile, other powerful object detection and segmentation methods have emerged in recent years, including LLM models like CLIP, DINO, and SAM [
3]. Notably, YOLO-based methods have advanced significantly, particularly in specialized domains like high-resolution and infrared imaging.
Recently, for the object detection tasks driven by deep learning, the YOLO series, as a representative single-stage methodology, has demonstrated outstanding performance owing to its powerful feature extraction capabilities and computational efficiency during inference, often surpassing two-stage detectors [
4]. Furthermore, while the C3K2 module of YOLOv11 [
5] achieves multi-scale feature fusion and extraction through convolutional layers, it remains limited in modeling non-linear and high-order feature interactions. At present, prevalent approaches for high-order feature modeling primarily focus on explicit attention mechanisms or recursive gated convolutional structures [
6]. For instance, Rao et al. [
7] proposed the HorNet network using the recursive gated convolution to achieve arbitrary-order spatial interactions. However, it relies on explicit convolutional kernel design for feature interaction, thus constraining parameter efficiency. Yang et al. [
8] investigated an attention mechanism employing a star operation within a star-attention backbone for small object detection. However, it has limitations in cross-channel high-order feature interaction. Bi et al. [
9] studied the star operation and separable convolution to replace C2f module of YOLOv8 to minimize object feature loss. However, this process may excessively compress subtle semantic features of small objects, specifically in low-light scenarios where high-frequency information is prone to degradation. Qi et al. [
10] applied star operation to enhance the non-linear feature extraction of YOLOv8 in detecting small objects. However, after the star operation, additional convolutional layers are required to normalize feature distribution, thereby increasing the complexity of network structure. Since the star operation resembles polynomial kernel functions that perform pairwise multiplication across different channels and demonstrate strong non-linear modeling abilities, it provides valuable inspiration for optimizing the C3K2 module in this study.
Over the past few years, deep convolutional networks have been widely used for image restoration. Meanwhile, Transformer architectures have shown promise for vehicle object detection in complex scenes [
11,
12]. However, they often have limitations in capturing long-range pixel dependencies and can be computationally demanding for image restoration tasks. For instance, Wang et al. [
13] developed a U-shaped Transformer for high-resolution image reconstruction; however, its training and inference speed is affected when processing large-scale images. Thus, while Transformers show great potential for high-resolution reconstruction, challenges remain in computational efficiency and long-range pixel interaction modeling. Zamir et al. [
14] proposed a Restormer Transformer model specifically optimized for image restoration tasks while maintaining strong global modeling capabilities. However, Restormer suffers from high computational cost, particularly when restoring high-resolution images. Zhang et al. [
15] presented multi-stage object detection of image enhancement via YOLOv12 with the Restormer module. However, this method is highly dependent on input image quality, and detection performance may deteriorate significantly under extreme blur. Wang et al. [
16] incorporated Restormer into the backbone of YOLOv8n to model long-range dependency with self-attention. However, there are still limitations in detection effect on those dynamic motion blur scenarios where vehicles move quickly under low-light conditions. Qiu et al. [
17] adopted Restormer for dynamic deblurring via multi-modal fusion image enhancement with YOLOv7; however, its adaptability to real-world scenarios depends on data matching, and the effect may decline if the actual blur patterns differ significantly from the training data. In summary, Restormer offers powerful global modeling capabilities for tasks like deblurring, which is beneficial for object detection. The key challenge lies in optimizing its computational efficiency for practical deployment. Therefore, to enhance vehicle detection in complex scenarios, this work investigates integrating the core principles of Restormer into YOLOv11n. We focus on optimizing the detection head to improve both accuracy and efficiency, particularly under challenging conditions like motion blur and low light.
Recent studies have demonstrated that leveraging contextual information can help models achieve high-quality results in segmentation and detection tasks, which will effectively distinguish vehicle objects from background interference, and reduce false and missed detection [
18]. However, the effectiveness of the global context scheme depends on data quality and parameter tuning, and if the training data is insufficient or contains a lot of noise, it may lead to failure in feature fusion. For example, Zhou et al. [
19] leveraged multi-scale feature fusion and context information for dense small object detection. However, when establishing long-range dependencies, excessive irrelevant context information may lead to a decline in detection performance. Wang et al. [
20] studied the spatial context-aware selection network on detection of small objects in intricate backgrounds; however, its original state space modeling has an inherent flaw in weak local dependency. Generally, these methodologies fall short in global context modeling ability and multi-scale feature fusion in complex environments and have not overcome the computational limitations of traditional convolutional networks. Deng et al. [
21] designed decoupling head via global context module with YOLOv5 for object detection; however, the multi-branch structure and multiple activation functions affect the convenience of deployment while the parameter scale is increased significantly. Zheng et al. [
22] incorporated feature enhancement for spatial Context-guided network via YOLOv8 for small object detection. However, it adds extra connections and fusion operations and does not adopt a weighted fusion strategy. Therefore, the context-guided module combines local features, surrounding contextual information, and global context, significantly optimizing memory and computational efficiency, and it can efficiently exhibit both global and local information in small object detection [
23]. Building on these advantages of the Context-guided module, to improve YOLOv11n, particularly in terms of efficiently modeling contextual information and multi-scale feature fusion, the Context-guided enhancement can be made for the C3K2 module in YOLOv11n to improve its detection accuracy in complex scenarios, specifically in handling occluded objects and vehicle detection tasks.
Although these methodologies have made progress, challenges remain in small object detection, such as insufficient accuracy, increased model complexity, and poor adaptability to multi-scale objects. To address these challenges, we propose SRG-YOLO, a novel vehicle detection model that enhances YOLOv11 by integrating Star operation and Restormer mechanisms within a global context framework. The model refines the network architecture to significantly improve feature extraction capabilities. The specific innovations are as follows:
(1) The original C3K2 module in the YOLOv11n model is replaced with a more efficient feature fusion strategy. Specifically, a Star block is applied during multi-scale feature fusion and extraction, and it enhances this capturing capability to those detailed objects in those complex environments through improved non-linear representation, thereby strengthening its performance in challenging scenes.
(2) Building on the improved C3K2 module, a Restormer module is incorporated that integrates spatial prior information and optimizes the self-attention mechanism. This significantly reduces computational complexity while preserving its global capability and increasing sensitivity to local details. So it is particularly suitable for handling complex backgrounds, enabling more efficient object detection.
(3) The detection head is optimized through a Context-guided module, which integrates local features with their contextual information to enhance detection accuracy. Specifically, the Context-guided module enhances the capture of fine object details by modeling local features and their context. This significantly improves detection robustness and provides notable advantages for small object detection.
2. Improved YOLOv11 Methodology
In object detection, YOLOv11 [
24] introduces significant architectural advancements over earlier versions like YOLOv3-YOLOv5 [
25]. These innovations include enhanced multi-scale feature fusion, the incorporation of attention mechanisms, and adaptive computational optimization. Furthermore, by integrating deeper convolutional networks and dense connections, YOLOv11 employs more refined feature extraction modules [
26]. Recent studies on YOLOv11 have focused on improving these enhanced multi-scale fusion techniques [
27]. Furthermore, by integrating a self-attention mechanism, it effectively captures global features while preserving local information, thereby improving detection accuracy and robustness. Inspired by these discoveries, our work primarily focuses on three key modules within the YOLOv11n framework: the C3K2 module, optimization of the detection head, and the introduction of the Context-guided module. The overall framework of our model is illustrated in
Figure 1.
2.1. Star Block Module for YOLOv11n
In the YOLO series models for object detection, the C3K2 module fuses multi-scale features via convolutional layers. However, it exhibits limitations in non-linear feature fusion. Ma et al. [
28] presented a Star block to enhance the expressive capability without adding parameters, enabling better handling of multi-scale objects and complex backgrounds. It outperforms various other efficient models, such as MobileNetv3, EdgeViT, and FasterNet [
29]. In existing YOLO-based object detection methodologies, since their C3K2 modules lack high-dimensional non-linear feature representation during feature fusion, this paper integrates the Star block in [
28] with C3K2 to design a C3K2_StarBlock module. Additionally, the traditional additive operations fail to exploit non-linear relationships between input features, specifically in complex and multi-object scenarios, leading to degraded detection performance. Thus, the Star block in [
28] is used to replace conventional additive feature fusion with element-wise multiplication, thereby enhancing the non-linear expressive power of features.
The C3K2 module in YOLO primarily consists of multiple convolutional layers (e.g., 3 × 3 convolutions) designed to extract and fuse features from different scales [
30]. However, during high-dimensional feature fusion, the traditional C3K2 module suffers from low fusion efficiency, particularly in its non-linear representation. Because it fails to adequately model non-linear features, the C3K2 module exhibits limitations in detection accuracy in complex environments and often experiences performance degradation in later training stages. To address this limitation, the improved YOLOv11n incorporates the Star block in [
28] into the C3K2 module to capture high-dimensional non-linear features. By leveraging element-wise multiplication for feature fusion, the Star block in [
28] strengthens non-linear feature representation without increasing the parameter count. Specifically, the Star block in [
28] performs element-wise multiplication on input features to form new feature representations. Consequently, this enhancement improves the non-linear representational capacity of YOLOv11n, leading to greater detection precision and robustness.
In the C3K2 module, assuming that the input features are
M and
N, and the features processed through convolutional layers are denoted as
M* =
Conv(
M) and
N* =
Conv(
N). In the Star block, feature fusion is performed via element-wise multiplication, and then the star operation can be expressed as
where ⊙ denotes this element-wise multiplication operation to significantly enhance the non-linear expressive power of features.
From Equation (1), the Star block fusion strategy enhances the non-linear expressive efficacy of features, particularly in complex scenes. Qi et al. [
10] observe that the key advantage of the Star block is that it can strengthen the non-linear relationships between input features without additional parameters, thereby improving the detection effect of YOLOv11n in complex scenarios. By integrating the Star block in [
28] with the C3K2 module in YOLOv11n, this limitation of insufficient non-linear feature extraction is addressed during multi-scale feature fusion in this original C3K2 module. This integration enhances the effectiveness of feature fusion, improves detection accuracy and robustness, and delivers notable benefits in detecting small and occluded objects, and this schematic diagram about integrating the Star block in [
28] with the C3K2 module in YOLOv11n is shown in
Figure 2.
To be more specific, in the C3K2 module with the Star block in [
28], the convolutional layers (e.g., 3 × 3 convolutions) are usually the basis for feature extraction, and after each convolutional operation, feature fusion is performed via the star operation to fully capture non-linear features in complex scenes. That is, after each convolutional layer in C3K2, a star operation is given by
As shown in Equation (2), this element-wise multiplication enhances the non-linear expression of high-dimensional features, particularly when processing complex objects, by better preserving and exploring the relationship between features. The incorporation of the Star block makes the feature fusion process more efficient, specifically in the fusion of those multi-scale features and improves the robustness of YOLOv11n.
2.2. Restormer Module for YOLOv11n
In high-resolution images and complex backgrounds, YOLOv11 cannot effectively capture long-range pixel dependencies via traditional convolutional networks, which limits its accuracy and robustness [
30]. Zamir et al. [
14] presented the Restormer module to achieve state-of-the-art efficacy on multiple image restoration in addressing challenges, such as occlusion and dense object distributions in vehicle recognition. To enhance the representational capacity and detection efficacy of YOLOv11n in high- resolution and complex scenarios, this work introduces an efficient Transformer architecture Restormer module in [
14] in image restoration tasks to improve the C3K2 module in YOLOv11n, as shown in
Figure 3, where the architectures of Restormer module in [
14] are composed of the multi-dimensional convolutional head transposed attention mechanism (MDTA) and gated deep convolutional feedforward network (GDFN), enabling joint modeling of global context and local details while effectively controlling computational complexity. From
Figure 3, integrating the Restormer module in [
14] into the backbone feature extraction network of YOLOv11n not only enhances the ability to recognize small objects but also improves the representation and utilization of multi-scale features. Building on this discovery, this section systematically elaborates the key components of Restormer module in [
14] and further explains its integration strategy and functional role within the YOLOv11n.
Thus, from
Figure 3, via the MDTA mechanism and GDFN module, the Restormer module can address the computational bottlenecks of traditional convolutional neural networks in high-resolution image restoration. Furthermore, it reduces computational complexity while preserving image details and enhancing global contextual information [
31]. The MDTA module utilizes the multi-head attention scheme, whose working mechanism linearly projects its input features into multiple heads [
14]. Its core function is to compute attention maps via channel-wise covariance, thereby capturing global context and preserving long-range dependencies through cross-channel interactions. Then Restormer reduces the computational complexity by performing self-attention calculations in the channel dimension. This structure of the MDTA module is illustrated in
Figure 4.
Specifically, the MDTA process involves the following steps:
Firstly, an input mapping of features
Φ∈R will be aggregated through channels via 1 × 1 convolution and then followed by 3 × 3 deep convolution to select local features. Thus, query
Q, key
K, and value
V can be generated [
14] as
where
,
and
describe the learned convolutional weights, and
M denotes the layer-normalized input features.
Secondly, the transposed attention map is computed as
where
indicate the projected query, key, and value, respectively; and
denotes the attention output, i.e., the attention map is obtained by computing the dot product and normalizing via the Softmax activation; and
is a learnable scalar that scales the dot product of the query and key to ensure the stability during attention computation. It is assumed that these channels are partitioned into
n heads when computing parallel, and the final multi-head output can be fused via
as
where
N and
denote the mappings of input and output features.
This GDFN module employs the gating delayed linear feature process to spatially transform those input features and enhance feature fluidity. Its gating mechanism ensures that each network layer selectively propagates important information. This structure of the MDTA module is demonstrated in
Figure 5.
This process of GDFN begins as follows: For this normalized feature
M, the gating result will be integrated with the original feature mapping and is obtained [
31] by
where
is obtained through the convolution via the GELU non-linear activation function and
;
denotes the GELU activation function;
LN(
M) is a layer normalization operation; and
indicates the element-wise multiplication.
From
Figure 3,
Figure 4 and
Figure 5, for the task of object detection on high-resolution images, by integrating the MDTA and GDFN modules of Restormer and the improved C3K2 module of YOLOv11n, there will be a significant improvement in handling complex backgrounds and detecting small objects. Thus, it is necessary for the combination of C3K2, C3K, and Bottleneck_Restormer. Their detailed explanations are as follows:
This input feature map undergoes preliminary processing through the C3K2 module, which initially adjusts the feature map via a Conv(1 × 1) layer, and multiple C3K(3 × 3) convolutional layers perform feature extraction. Then these output feature map from those C3K convolutional layers are concatenated. Thus, these concatenated feature maps are further processed through a Conv(1 × 1) layer to produce the final output features. This process can be expressed as
Since the C3K module is responsible for processing this feature map output by those previous modules, these input features will be first adjusted by the Conv(1 × 1) layer, and then, multiple sub-modules of Bottleneck_Restormer deeply process the feature map, where each Bottleneck_Restormer module contains a CBS module and a Restormer module, effectively enhancing feature representation. Thus, the processed the feature maps are concatenated via a Conv(1 × 1) layer to produce those final outputs. This computational process can be expressed as
For the Bottleneck_Restormer module, features are further reconstructed by combining the CBS and Restormer modules, and the residual connection adds the output to the input feature map, strengthening the information flow. Then the specific computation formula is indicated by
Here, the CBS module comprises the Convolution, Batch Normalization, and Activation Function. This module extracts features via convolution, reduces internal covariate shift during training through batch normalization, and introduces nonlinearity via the activation function, thereby enhancing the expressive capacity of network. Additionally, the Restormer module processes the feature map layer by layer, extracting and reinforcing multi-level feature representations. For an input feature, the output after the Restormer operation can be expressed as
From Equations (7)–(10), via the multi-scale fusion mechanism, the MDTA and GDFN of Restormer can process features at different levels, and for the effect of YOLOv11n in complex scenarios, the low-level MDTA enhances its local details, while the high-level GDFN captures its global information. Due to this multi-scale design of Restormer, YOLOv11n can better handle complex backgrounds and small objects of varying sizes.
2.3. Context-Guided Module for YOLOv11n
For object detection and semantic segmentation, local features integrated with contextual information are crucial, specifically in complex backgrounds [
32]. Wu et al. [
23] developed the Context-guided module to effectively capture spatial dependencies and semantic context across all stages. In complex scenes, object details and background context are key factors that determine whether this Context-guided module can accurately identify objects [
33]. At present, some existing YOLO models neglect the effective fusion of background and contextual features, resulting in limited effects in small object detection. To address this issue and boost the efficacy of YOLOv11n in complex vehicle detection scenarios, we integrate the Context-guided module in [
23], which effectively combines local features with contextual cues to improve fine-grained recognition. As shown in
Figure 1, this C3K2_ContextGuidedBlock strategy is a specific implementation of the Context-guided module in [
23]. It is integrated into both the backbone and neck networks, working collaboratively to enhance multi-scale feature extraction. The processed features are then passed to the detection head for object detection. Thus, this Context-guided module in [
23] can address the limitations of YOLOv11n in multi-scale feature fusion and strengthens its ability to capture object details in complex environments. Specifically, this module combines local information, surrounding context, and global context. It generates feature maps enriched with enhanced spatial and semantic context. By fully utilizing spatial and contextual cues, the Context-guided module in [
23] improves the object detection accuracy of YOLOv11n in challenging scenarios. The architecture of the Context-Guided module [
23] is illustrated in
Figure 6.
This module integrates local and contextual information via the following steps:
(1) Local feature extraction: Local features can be selected through a standard convolution to capture fine-grained details of an object region. A 3 × 3 convolutional layer produces the local feature
expressed as
where
I is an input image and
describes the standard convolution operation.
(2) Surrounding context feature extraction: Dilated convolution is applied to extract surrounding contextual features. By expanding the receptive field, it captures information around the object region, enhancing the understanding of background context. The surrounding context feature is given by
where
r is the dilation rate, controlling the extent of receptive field expansion.
(3) Joint feature extraction and weighting: The local and surrounding context features are concatenated and then processed by a Batch Normalization layer and the PReLU activation function, forming a joint feature. The concatenation is expressed as
The normalized and activated feature is defined as
(4) Global context weighting: Global context information is extracted via global average pooling. This information is used to weight the joint feature, emphasizing semantically important elements and suppressing irrelevant background noise. The global context feature is obtained as
Subsequently, the Context-guided module refines the joint feature by applying a global feature weighting operation, which can be computed by
where
is element-wise multiplication, and σ(*) is the sigmoid function that generates the global context weighting coefficients from the transformed global feature.
Referring to [
34], this Context-guided module in YOLOv11n addresses limitations in multi-scale feature fusion, improving its capability to capture object details while improving both accuracy and computational efficiency. In challenging scenarios where background noise and object occlusion often degrade the efficacy of traditional detectors, this module employs contextual modeling and global context weighting to enrich feature representation and boost detection accuracy for occluded objects. As a result, YOLOv11n equipped with the Context-guided module achieves more precise object recognition and enhanced detection efficiency in tasks involving complex environments, dense objects for vehicle object detection, which further offers significant advantages in real-world applications. Within the YOLOv11n architecture, the Context-guided module is incorporated across multiple convolutional layers, particularly in the backbone and detection head. By embedding C3k2_ContextGuidedBlock at specific stages, YOLOv11n effectively incorporates contextual information across different scales. In the backbone, the module is inserted between convolutional and residual blocks, ensuring that contextual cues are captured at various feature extraction stages, specifically in low-level and mid-level features, thus enhancing detection accuracy. In the detection head, it is fused with multi-scale features through concatenation operations.
In summary, the incorporation of the Context-guided module allows YOLOv11n to leverage contextual information more effectively, leading to superior performance in complex scenes and vehicle detection tasks. The core idea involves integrating local features with contextual cues to refine feature representation, which is further enhanced by global context weighting. Through this multi-level contextual fusion, our developed SRG-YOLO model maintains high efficiency and accuracy even on large-scale datasets, adapting robustly to diverse detection challenges.
3. Experiments and Analysis
3.1. Experimental Datasets and Evaluation Metrics
To evaluate the efficiency of SRG-YOLO, we employed the widely recognized VisDrone2019 dataset. Object detection in aerial imagery (UAV-captured) is more challenging than in conventional ground-based images due to significant differences in perspective and scene composition. The VisDrone2019 dataset [
9,
10,
35], collected and annotated by the AISKYEYE team, comprises 8629 images. These are split into 6471 for training, 1610 for testing, and 548 for validation. The annotations cover ten object categories: pedestrian, person, car, van, bus, truck, motorcycle, bicycle, awning-tricycle, and tricycle, with over 2.6 million bounding boxes in total. In addition, we selected the KITTI and UA-DETRAC datasets to evaluate the generalization of SRG-YOLO. To assess the performance of SRG-YOLO, we adopted four evaluation metrics from [
9,
36]: Precision (
P), Recall (
R),
mAP@0.5, and
mAP@0.5–0.95. These metrics can be indicated by
where
TP and
FP describe the number of samples accurately predicted as positive and inaccurately predicted as positive; and
FN indicates the number of actual positive samples that are inaccurately predicted as negative.
mAP (mean Average Precision) evaluates the average precision across all object categories.
mAP@0.5 and
mAP@0.5–0.95 are under the different Intersection over Union (IoU) thresholds, and are indicated as
where
C denotes this total number of object categories;
indicates the Average Precision for the
i-th category at an IoU threshold of 0.5;
N refers to the number of IoU thresholds; and
indicates the Average Precision for this
i-th category at a specific IoU threshold
j.
3.2. Parameter Settings and Training Process
In these experiments, the network models are trained, validated, and tested on a cloud server. The environments of hardware and software are summarized in
Table 1, and the parameter configurations for these hyperparameters for the training process are provided in
Table 2. During training, default data augmentation strategies of YOLOv11 are used for preprocessing.
3.3. Ablation Study
The efficacy improvement of SRG-YOLO, achieved by incorporating the Star block, Restormer, and Context-guided modules, will be demonstrated in detail. These experiments will be illustrated in
Table 3, where a “
√” indicates that a specific module was used. When the Star block, Restormer, or Context-guided module was integrated individually, the four metrics of
P,
R,
mAP@0.5, and
mAP@0.5–0.95 are reported.
As shown in
Table 3, these experiments indicate that each module contributes positively to enhancing the capability of capturing fine-grained details and modeling global contextual information for small objects. The detailed analysis is as follows:
(1) When only the Star block module was integrated into YOLOv11n, compared to the baseline, P increased by 0.44, R by 0.38, mAP@0.5 by 0.32, and mAP@0.5–0.95 by 0.06. With only the Restormer module added, P increased by 0.01, R by 0.10, while mAP@0.5 decreased by 0.05, and mAP@0.5–0.95 increased by 0.07. Integrating solely the Context-guided module resulted in an increase of 1.63 in P, 0.09 in R, 0.56 in mAP@0.5, and 0.27 in mAP@0.5–0.95 relative to the original YOLOv11n.
(2) The integration of both the Restormer and Context-guided modules into YOLOv11n (denoted as RC-YOLO) led to an increase of 0.34 in P, 0.76 in R, 0.48 in mAP@0.5, and 0.17 in mAP@0.5–0.95 compared to the baseline. In contrast, combining the Star block and Context-guided modules (denoted as SC-YOLO) resulted in a decrease of 0.17 in P, an increase of 0.10 in R, a reduction of 0.09 in mAP@0.5, and a decline of 0.18 in mAP@0.5–0.95. Incorporating both the Star block and Restormer modules (denoted as SR-YOLO) showed an increase of 0.24 in P, a slight decrease of 0.02 in R, an improvement of 0.21 in mAP@0.5, and an increase of 0.12 in mAP@0.5–0.95 compared to the baseline.
(3) The integration of three modules—Star block, Restormer, and Context-guided—into YOLOv11n, i.e., SRG-YOLO, yielded consistent efficacy improvements. Compared to the baseline, SRG-YOLO achieved increases of 1.80 in P, 0.51 in R, 0.61 in mAP@0.5, and 0.34 in mAP@0.5–0.95. When evaluated against RC-YOLO, SRG-YOLO showed a gain of 1.46 in P, a decrease of 0.25 in R, an improvement of 0.13 in mAP@0.5, and an increase of 0.17 in mAP@0.5–0.95. Relative to SC-YOLO, it demonstrated a P that was 1.97 higher, an R that was 0.41 better, a 0.70 increase in mAP@0.5, and a 0.52 improvement in mAP@0.5–0.95. Finally, compared to SR-YOLO, SRG-YOLO showed gains of 1.56 in P, 0.53 in R, 0.40 in mAP@0.5, and 0.22 in mAP@0.5–0.95. This section also analyzes the reasons for the relatively poor effect of certain module combinations. In summary, SRG-YOLO shows significant improvements in these four metrics, demonstrating strong synergistic effects among its constituent modules. It delivers particularly strong performance in complex environments and vehicle detection tasks, meeting the practical requirements for detecting small vehicle objects.
Overall, the experiments in
Table 3 confirm that the efficient integration of the three modules works synergistically, significantly increasing the detection capability of SRG-YOLO in complex scenarios. The ablation studies further indicate that each of the three key modules in SRG-YOLO contributes to improved detection accuracy, computational efficiency, and multi-scale feature fusion. The framework exhibits excellent detection effectiveness in challenging traffic environments characterized by dense objects, occlusions, varying image contrast, and background noise.
3.4. Experimental Verification
To more intuitively demonstrate the improvement in detection effect of SRG-YOLO,
Figure 7 displays the evolution curves of key loss functions and evaluation metrics during its training process. Analysis of
Figure 7 reveals favorable convergence and stability throughout training. The training losses, including train/box_loss, train/cls_loss, and train/dfl_loss, decrease steadily as training progresses. A similar downward trend is observed in the validation losses (val/box_loss, val/cls_loss and val/dfl_loss), indicating that SRG-YOLO continuously improves its feature extraction and classification capabilities without overfitting on the validation set. Meanwhile, the curves of
P and
R show a consistent upward trend, particularly when smoothed. This reflects stable improvements in both precision and recall, demonstrating this enhanced capability of SRG-YOLO in capturing details and detecting objects effectively. Furthermore, the evaluation metrics
mAP@0.5 and
mAP@0.5–0.95 also improve over time, indicating progressively better performance at higher IoU thresholds and underscoring the strong detection precision and robustness of SRG-YOLO. Overall, this training process of SRG-YOLO is stable and leads to continuous gains across multiple measures, further validating its effectiveness and practical advantages.
Figure 8 shows the confusion matrix, where the background class achieves high prediction accuracy, with clear distinctions from classes such as ‘people’ and ‘pedestrian’. Some confusion is observed among other categories, such as ‘car’ and ‘truck’. This is a common occurrence in object detection tasks.
Figure 9 shows that the F1-score is generally low at low confidence levels and gradually increases as the threshold rises. However, the rate of increase and the maximum F1-score achieved vary across different object categories. The blue curve shows the average F1-score across all classes, reaching a value of 0.36 at a confidence threshold of 0.174. Overall, specific categories (such as pedestrian) achieve higher F1-scores at higher confidence thresholds, while others (like truck) perform comparatively poorly.
3.5. Comparative Experiments
3.5.1. Module Validation
To evaluate the effectiveness of SRG-YOLO, comparative experiments were conducted against 12 state-of-the-art object detection algorithms across different YOLO variants, including YOLOv3-tiny [
37], YOLOv5n [
38], YOLOv5s [
38], YOLOv6n [
39], YOLOX-n [
40], YOLOv8n [
41], YOLOv9-t [
42], YOLOv10n [
43], YOLOv11n [
43], YOLO-Drone [
44], YOLOv12n [
45], and YOLOv13n [
45].
Table 4 summarizes the comparative results of 13 models across six metrics on the VisDrone2019 dataset, where ‘—’ indicates missing values. Following [
40,
43,
44,
45], we present the competitive results of these 13 YOLO variants.
As
Table 4 shows, SRG-YOLO demonstrates superior effectiveness across multiple metrics when compared to other 12 object detection models. Experimental outcomes show that SRG-YOLO achieves improvements on four metrics of
P,
R,
mAP@0.5, and
mAP@0.5–0.95 over other 12 competitive models. Specifically, the
P metric of SRG-YOLO is 1.97% higher than the next best performer, YOLOX-n, ranking first among all other models; and its
R exceeds all comparative models, surpassing the second-place YOLOv13n by 0.35%. SRG-YOLO achieves the best
mAP@0.5 score, significantly outperforming YOLOv13n and demonstrating optimal detection efficacy at the standard IoU threshold. It also ranks first in the
mAP@0.5–0.95 comprehensive metric, surpassing models such as YOLOv13n and YOLOv8n. While maintaining top-tier detection accuracy, SRG-YOLO effectively manages computational resources. Its computational complexity is moderate, with GFLOPs in the medium range, far lower than those of YOLOv5s and YOLOv9-t; and its parameter count is reasonably optimized, being slightly higher than that of the most parameter-efficient model, YOLOv13n, thereby maintaining lightweight characteristics while improving accuracy. These improvements may be attributed to the Star block and Restormer modules, which enhance fine-grained feature extraction and contextual modeling. Furthermore, compared to the competing YOLOv13n, SRG-YOLO achieves comprehensive superiority: showing a 2.24% relative improvement in
P, 0.35% improvement in
R, 0.6% improvement in
mAP@0.5, and 0.43% improvement in
mAP@0.5–0.95, while increasing computational load by only 1.5 GFLOPs. This demonstrates a superior balance between efficiency and accuracy achieved by SRG-YOLO. These results validate the strong efficacy of SRG-YOLO in complex scenarios, particularly in vehicle detection with challenging backgrounds, establishing it as one of the most competitive models in the current YOLO series.
3.5.2. Visualization and Analysis
To comprehensively illustrate the effectiveness of SRG-YOLO in complex traffic scenarios from the VisDrone2019 dataset, this section presents visualizations of its detection results of SRG-YOLO under various environmental conditions.
Figure 10 illustrates the detection outcomes in multi-frame urban road scenes.
Figure 11 showcases its performance across multiple viewpoints and scenarios.
Figure 12 focuses on detection results in nighttime urban roads. These visualizations provide an intuitive understanding of the performance of SRG-YOLO when handling challenges such as diverse backgrounds, occlusions, dense objects, and varying lighting conditions.
In the daytime urban road scenes shown in
Figure 10 (from the VisDrone2019 dataset), SRG-YOLO effectively distinguishes between different types of vehicles. It maintains detection accuracy even under crowded conditions or partial occlusion. In multi-view traffic scenarios in
Figure 11, SRG-YOLO demonstrates strong robustness. It reliably detects objects from various angles and maintains high detection accuracy in cluttered backgrounds. Furthermore, as illustrated in
Figure 12 for low-light environments, SRG-YOLO continues to provide clear object detection despite poor illumination. The predicted bounding boxes align closely with the ground truth, highlighting the robustness of SRG-YOLO under challenging lighting conditions. Collectively,
Figure 10,
Figure 11 and
Figure 12 provide visual results that allow for a clearer evaluation of the efficacy of SRG-YOLO in real-world traffic scenes. The visualization results in
Figure 10,
Figure 11 and
Figure 12 demonstrate the strong capabilities of SRG-YOLO in complex scenarios. The model maintains high accuracy and robustness when handling multiple objects, occluded objects, varying viewpoints, and diverse lighting conditions. In general, these results for small object detection confirm the broad applicability of SRG-YOLO in real-world traffic surveillance systems.
3.5.3. Generalization Verification
To further test the performance and generalization ability of SRG-YOLO, we conducted experiments on the KITTI and UA-DETRAC datasets. First, a visual comparison was conducted on the KITTI dataset, with the results shown in
Figure 13. The left panel of
Figure 13 displays the detection results of YOLOv11, while the right panel shows those of SRG-YOLO. As shown in
Figure 13, YOLOv11 can identify the most prominent objects in the scene but had obvious limitations. These limitations include missing distant small vehicles and generating bounding boxes with imprecise localization, which are often accompanied by relatively low confidence scores. SRG-YOLO not only detects multiple distant small objects missed by YOLOv11, enhancing detection comprehensiveness, but also generates predicted bounding boxes that fit the object contours more closely, which demonstrates higher localization accuracy. Additionally, in
Figure 13, the confidence scores for the detection boxes from SRG-YOLO are generally higher than those from YOLOv11. This comprehensive analysis indicates that SRG-YOLO outperforms the original YOLOv11 in complex urban scenarios in terms of detection
P,
R, and robustness, particularly for small objects.
Next, we conducted a visual comparison on the UA-DETRAC dataset, as shown in
Figure 14. The upper panel displays the detection performance of YOLOv11, while the lower panel shows the performance of SRG-YOLO. As shown in
Figure 14, YOLOv11 performed poorly in scenarios with dense traffic flow and occlusions. It exhibited noticeable missed detections for some vehicle types, such as distant compact cars, and generated detection boxes with relatively low confidence scores. This demonstrates its limitations in handling such complex dynamic scenes. In contrast, SRG-YOLO not only identified almost all vehicle objects in the scene, thereby improving detection density, but also successfully distinguished a wider variety of vehicle types. Most importantly, the confidence scores of its generated detection boxes were generally and significantly higher than those of YOLOv11. This demonstrates that SRG-YOLO possesses superior robustness and adaptability to complex environments, especially for detecting small objects.
3.6. Discussion
(1) The integration of the Star block into the C3K2 module enhances feature fusion and improves the detection effect of YOLOv11n in complex scenes. By increasing the non-linear expressive power of features, the Star block facilitates more accurate capturing of object details, particularly in challenging environments. It excels specifically in detecting small objects and operating in cluttered backgrounds. Thus, SRG-YOLO achieves higher accuracy and robustness in vehicle detection without significantly increasing computational complexity.
(2) The incorporation of Restormer allows SRG-YOLO to explicitly leverage spatial prior information within the self-attention mechanism. This reduces computational complexity while strengthening global modeling. Combined with the Context-guided module, SRG-YOLO achieves a better balance between global and local feature extraction, further improving detection performance in complex scenes. Restormer thus enhances the efficacy of SRG-YOLO for vehicle detection, boosting both accuracy and robustness while maintaining computational efficiency.
(3) The Context-guided module strengthens feature fusion by effectively modeling the relationship between local features and their surrounding context. This scheme enhances this capability of SRG-YOLO to capture fine-grained object details, leading to improved detection accuracy. This module significantly improves robustness in complex backgrounds, allowing SRG-YOLO to perform more reliably in various challenging scenarios, particularly for detecting small objects in visually cluttered environments.
4. Conclusions
To address the limitations of low detection accuracy for small objects and low computational efficiency in complex scenes, this paper proposes an optimal SRG-YOLO scheme. The model is designed to improve both detection accuracy and efficiency in such challenging scenarios. Specifically, the Star block enhances feature fusion capability, leading to improved accuracy of SRG-YOLO, particularly in detecting small and densely clustered objects. Restormer explicitly incorporates spatial prior information into the self-attention mechanism, achieving a better balance between global context and local details. Furthermore, the Context-guided module strengthens contextual modeling, enabling SRG-YOLO to capture object details more effectively in cluttered backgrounds. This further improves detection accuracy and operational efficiency. Therefore, the improved SRG-YOLO achieves significantly higher detection accuracy on the VisDrone2019, KITTI, and UA-DETRAC datasets, specifically for small or densely packed objects. These experimental results demonstrate that the enhanced SRG-YOLO effectively mitigates the shortcomings of conventional detection methodologies in complex scenarios, showing considerable potential for real-world applications. To sum up, SRG-YOLO performs well only on these three specific datasets but has not been verified for its generalization ability in other specialized fields, which may lead to performance degradation due to domain differences. Although its Restormer module reduces complexity, the Context-guided module and Star operation may increase memory consumption during inference. While there are improvements in small object detection, its performance on extremely sparse data is not evaluated and may degrade due to insufficient feature learning.
In future work, to enhance the generalization ability of SRG-YOLO, we will employ adversarial domain adaptation to align features between source and target domains, introduce a multi-task learning scheme with auxiliary tasks to improve feature representation. Furthermore, to optimize memory consumption and inference efficiency, we will explore knowledge distillation and dynamic inference mechanisms that adjust the computation path based on input complexity. We will also study model pruning and quantization techniques to remove redundant parameters and utilize frameworks like TensorRT or OpenVINO for optimized deployment.