SDO-YOLO: A Lightweight and Efficient Road Object Detection Algorithm Based on Improved YOLOv11

Ji, Peng; Jiang, Zonglin

doi:10.3390/app152111344

Open AccessArticle

SDO-YOLO: A Lightweight and Efficient Road Object Detection Algorithm Based on Improved YOLOv11

by

Peng Ji

and

Zonglin Jiang

^*

School of Mechanical and Equipment Engineering, Hebei University of Engineering, Handan 056038, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(21), 11344; https://doi.org/10.3390/app152111344

Submission received: 5 October 2025 / Revised: 19 October 2025 / Accepted: 21 October 2025 / Published: 22 October 2025

(This article belongs to the Special Issue AI in Object Detection)

Download

Browse Figures

Versions Notes

Abstract

Background: In the field of autonomous driving, existing object detection algorithms still face challenges such as excessive parameter counts and insufficient detection accuracy, particularly when handling dense targets, occlusions, distant small targets, and variable backgrounds in complex road scenarios, where balancing real-time performance and accuracy remains difficult. Methods: This study introduces the SDO-YOLO algorithm, an enhancement of YOLOv11n. First, to significantly reduce the parameter count while preserving feature representation capabilities, spatial-channel reconstruction convolution is employed to enhance the HGNetv2 network, streamlining redundant computations in feature extraction. Then, a large-kernel separable attention mechanism is introduced, decoupling two-dimensional convolutions into cascaded one-dimensional dilated convolutions, which expands the receptive field while reducing computational complexity. Next, to substantially improve detection accuracy, a reparameterized generalized feature pyramid network is constructed, incorporating CSPStage structures and dynamic channel regulation strategies to optimize multi-scale feature fusion efficiency during inference. Results: Evaluations on the KITTI dataset show that SDO-YOLO achieves a 2.8% increase in mAP@0.5 compared to the baseline, alongside reductions of 7.9% in parameters and 6.3% in computation. Generalization tests on BDD100K and UA-DETRAC datasets yield mAP@0.5 improvements of 1.9% and 3.7%, respectively, over the baseline. Conclusions: SDO-YOLO achieves improvements in both accuracy and efficiency, demonstrating strong robustness across diverse scenarios and adaptability across datasets.

Keywords:

autonomous driving; object detection; YOLOv11; real-time detection; complex road scenarios

1. Introduction

As autonomous driving technology advances swiftly, object detection algorithms for road scenarios are of paramount importance. Detecting traffic objects in road environments not only provides detailed information about the surrounding context but also delivers critical environmental feedback for the decision-making process of autonomous vehicles’ panoramic perception systems. This facilitates real-time, efficient, and accurate decisions by the decision-planning system, thereby enhancing driving safety. Improving the recognition accuracy and real-time performance of object detection algorithms is a key factor in advancing autonomous driving from assisted functions to fully autonomous capabilities.

However, the application environment of autonomous driving is characterized by complex road conditions and various factors that interfere with object detection tasks. In urban areas with high population density, vehicles must share the road with other traffic participants. To reduce the likelihood of accidents, accurately obtaining the location information of other traffic participants and obstacles is crucial. In practical applications, road environments present challenges such as dense targets, occlusions, distant small targets, and complex backgrounds [1], which inevitably affect detection accuracy. Moreover, object detection algorithms for autonomous driving commonly face the challenge of model lightweighting. In intricate road environments, object detection models must strike a balance between accuracy and real-time performance, while conforming to the rigorous computational constraints of in-vehicle edge devices. Real-time detection and mobility require models not only to be accurate but also to ensure rapid response times and low energy consumption.

Progress in deep neural network technology has significantly advanced the domain of road object detection. Neural network-based object detection techniques are typically classified into two primary categories: two-stage and single-stage methods. In two-stage object detection, the R-CNN series [2] (including Fast R-CNN [3] and Faster R-CNN [4]) employs selective search techniques to extract candidate regions, followed by convolutional neural networks (CNNs) for feature extraction, ultimately performing target classification and regression. These algorithms typically meet high-accuracy requirements but are less suitable for scenarios involving video stream data processing. In contrast, single-stage detection networks, such as SSD [5] and the YOLO series [6,7,8], simultaneously predict target categories and bounding box coordinates on convolutional feature maps, offering efficient real-time processing speeds, particularly suitable for dynamic video stream scenarios. Ghita et al. [9] compared the improvements of YOLOv5 to YOLOv9 in road environment object detection. Zhu et al. [10] proposed a vehicle detection model based on an improved YOLOv5, incorporating a coordinate attention mechanism to enhance the recognition rate of vehicle targets in dense object scenarios and introducing full-dimensional dynamic convolution to implement a multidimensional attention mechanism for improved feature extraction. Jiao et al. [11] proposed an improved YOLOv8s algorithm for road object detection, introducing a multi-scale path aggregation feature pyramid network to enhance the representation and perception capabilities of semantic features at different levels. They also adopted a weighted shuffle fusion algorithm to improve interlayer feature interaction and intralayer feature adjustment capabilities. Zhang et al. [12] proposed an improved YOLOv11s algorithm for road object detection, incorporating a global edge information transmission module to transfer edge information extracted from shallow layers to the model for fusion with features at different scales, and employing layer-adaptive amplitude-based pruning to remove redundant parameters. Gong et al. [13] addressed the challenge of small target detection by using an adaptive frequency-domain aggregation module to dynamically aggregate features, improving the frequency distribution of small object images while reducing computational costs through channel dimension compression. Zhao et al. [14] introduced the Real-Time Detection Transformer (RT-DETR), designing an efficient hybrid encoder that decouples intra-scale interactions and cross-scale fusion to rapidly process multi-scale features, thereby improving speed. They also proposed an uncertainty-minimized query selection to provide high-quality initial queries for the decoder, enhancing accuracy.

To mitigate the challenges posed by excessive parameter volumes and suboptimal detection precision in object detection architectures for autonomous driving environments, this study improves the classic YOLOv11 target detection framework from three perspectives: streamlined feature extraction, decoupling the large-kernel convolutional attention mechanism, and optimizing feature fusion. The improved model is named SDO-YOLOv11, with its network architecture shown in Figure 1.

First, drawing on the lightweight design of the RT-DETR backbone HGNetV2, we propose the SC-HGNetv2 backbone network, where SCConv streamlines redundant features. The Spatial Reconstruction Unit (SRU) employs dynamic binary gating to filter key spatial features, while the Channel Reconstruction Unit (CRU) splits features and applies a soft attention mechanism to refine channel features. By integrating a hybrid architecture of spatial-channel reconstruction residual modules and depthwise separable convolutions, feature extraction is achieved, significantly reducing computational complexity while maintaining robust feature extraction capabilities.

Second, the original PSA attention mechanism is replaced with a large-kernel separable attention mechanism (LSKA). LSKA decouples large-kernel convolutions into depthwise convolutions, depthwise dilated convolutions, and pointwise convolutions, while further decomposing two-dimensional convolutions into cascaded one-dimensional convolutions, substantially reducing computational complexity and parameter count. Additionally, LSKA supports dynamic kernel size adjustment, providing a larger receptive field across multiple scales, thus addressing the shortcomings of the PSA attention mechanism in handling occlusion and small target detection.

Finally, RepGFPN is introduced as the feature fusion layer to optimize information flow across different scales. Through structural reparameterization, multi-branch structures are merged during inference, enhancing the efficiency of multi-scale feature fusion. This facilitates the network’s more effective interchange of abstract semantic representations and fine-grained spatial information, thereby substantially enhancing detection accuracy while introducing only a negligible increase in computational overhead.

2. Methods

This section elaborates on the core improvement modules of the SDO-YOLO algorithm. Addressing the key challenges of low object detection accuracy and excessive model parameter counts in autonomous driving scenarios, the system is designed from three dimensions: backbone network optimization, attention mechanism enhancement, and feature fusion improvement. These enhancements aim to resolve issues in YOLOv11n for road scenarios, such as redundant feature extraction, deficiencies in handling occlusions and small targets, and low efficiency in multi-scale information fusion, thereby achieving efficient and lightweight detection performance. The overall algorithm architecture is illustrated in Figure 1. Through the synergistic effects of these modules, detection accuracy is significantly improved while ensuring real-time performance.

2.1. SC-HGNetv2: Improved Backbone Based on HGNetv2

The backbone network serves as the core for feature extraction. To address the issues of parameter redundancy and computational burden in HGNetv2, we propose SC-HGNetv2. Figure 2 illustrates the overall architecture of the backbone network, which consists of four stages, each primarily comprising depthwise separable convolutions (DWConv) and SC-HGBlocks that incorporate spatial-channel reconstruction convolutions. The HGStem acts as the initial preprocessing layer, employing a grouped 3 × 3 depthwise convolution combined with a 1 × 1 pointwise convolution to achieve rapid downsampling and preliminary feature extraction from the input image. DWConv decomposes traditional convolutions into Depthwise + Pointwise operations, significantly reducing the parameter count, and automatically adjusts padding based on the input size to prevent feature map dimension errors. The extracted features are then fed into Spatial Pyramid Pooling Fast (SPPF) to obtain fixed-dimensional outputs.

Figure 2 also depicts the detailed structure of the SC-HGBlock, which is composed of multiple SCConv units. The Efficient Squeeze-and-Excitation (ESE) module models interdependencies among feature channels, including a squeeze component (utilizing global average pooling to expand the receptive field) and an excitation component (employing fully connected layers to enable the model to assign weights to channels), thereby reducing computational load. Compared to the original HGBlock module, which uses standard or lightweight convolutions (capable of some degree of feature extraction but limited in parameter efficiency and computational demands), the improved SC-HGBlock effectively reduces spatial and channel redundancies in features through SCConv, significantly lowering model parameters and computational overhead. This design directly resolves the redundancy issues in feature extraction, enabling SC-HGNetv2 to enhance lightweighting when replacing the original YOLOv11 backbone network, making it suitable for real-world autonomous driving systems.

As shown in Figure 2, At the heart of SC-HGNetv2 is the SCConv [15] module, a dual-unit convolution operator comprising the Spatial Reconstruction Unit (SRU) and the Channel Reconstruction Unit (CRU). The SRU receives the input feature map and first applies group normalization to reduce scale differences among different feature maps. Specifically, consider an intermediate feature

X \in R_{N \times C \times H \times W}

, where N denotes the batch size, C the number of channels, and

H \times W

the dimensions of space, To normalize inter-channel discrepancies, group normalization is applied by partitioning the C channels into G groups. After computing the group-wise mean and variance, the input feature X is normalized by applying normalization to each element. The normalization operation forcibly maps the original data to a distribution with mean 0 and variance 1. By incorporating learnable parameters

γ

and

β

an affine transformation is performed, allowing the model to rescale and shift the data while preserving or adjusting key features, as follows:

\hat{X} = γ \frac{X - μ_{g}}{\sqrt{σ_{g}^{2} + ε}} + β

(1)

where

μ_{g}

and

σ_{g}^{2}

are the mean and variance of

X

within a channel group,

ε

is a stability constant added to prevent division by zero, and

\hat{X}

represents the normalized feature map.

The trainable parameters

γ \in R^{C}

in the group normalization layer can measure the spatial pixel variance of the processed feature maps and channels. Higher spatial pixel variance within an image signifies the presence of more abundant informational content, thereby indicating its comparatively greater significance. Then, normalized correlation weights

W_{γ}

are employed to quantify the significance of the feature map, with the computation method detailed as follows:

W γ = \{ω i\} = \frac{γ_{i}}{\sum_{j = 1}^{C} γ_{j}}, i, j = 1, 2, \dots, C

(2)

These feature maps are reweighted via the sigmoid function and mapped to the range (0, 1). A gating threshold is established, wherein weights exceeding this threshold are designated as informative weights

W_{1}

and assigned a value of 1, whereas those below are classified as non-informative weights

W_{2}

and set to 0, achieving binary control. Subsequently, element-wise multiplication is performed between the feature maps and their corresponding weights

W_{1}

and

W_{2}

, yielding information-rich feature maps

X_{1}

and less informative feature maps

X_{2}

. Finally, a cross-reconstruction strategy is employed to integrate these weighted features comprehensively, producing spatially refined features

X_{W}

, which encompass richer information flows while ensuring lightweight convolutions.

The Channel Reconstruction Unit (CRU) employs a split-transform-fuse strategy as an alternative to conventional convolutions, with the aim of reducing channel redundancy. In the split stage, the channel count of the given spatially refined features is divided into

α C

channels and

(1 - α) C

channels, subsequently utilizing 1 × 1 convolutions to reduce the dimensionality of the feature map channels and improve computational efficiency. In the transformation phase, a portion of the features acts as an enriched feature extractor, leveraging Group-wise Convolution (GWC) followed by Point-wise Convolution (PWC) in place of conventional convolution to yield a fused, representative feature map denoted as

Y_{1}

. This approach significantly reduces parameters and computational load while ensuring information interaction within the features. The other part of the features uses PWC to extract mapped features

Y_{2}

with shallow hidden details, complementing the rich feature extractor. In the fusion stage, inspired by SKNet’s feature fusion and weight allocation approach, global average pooling is first applied to collect global spatial information with channel statistics, calculated as:

S_{m} = P o o l i n g (Y_{m}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} Y_{c} (i, j), m = 1, 2

(3)

Next, the dual-path global channel smart descriptors

S_{1}

and

S_{2}

are stacked to fuse global context with local information, after which a channel-wise soft attention mechanism is employed to derive the feature importance vector

β_{1}, β_{2} \in R_{C}

, as shown below:

β_{1} = \frac{e^{s_{1}}}{e^{s_{2}} + e^{s_{2}}}, β_{2} = \frac{e^{s_{2}}}{e^{s_{2}} + e^{s_{2}}}, β_{1} + β_{2} = 1

(4)

Finally, the feature importance vector

β_{1}

and

β_{2}

is multiplied channel-wise with

Y_{1}

and

Y_{2}

, and the results combined to obtain the channel-specific refined features

Y

.

In summary, the SRU separates informative and less informative features, reconstructing them to enhance representative features while reducing redundancy in the spatial dimension. The CRU employs a split-transform-fuse strategy to further minimize redundancy in the channel dimension of the spatially refined feature map X. By sequentially integrating SRU and CRU, SCConv compresses redundant spatial and channel information, promoting the learning of representative features. This approach ensures the effectiveness of feature extraction in SC-HGNetv2 while streamlining redundant computations, significantly reducing computational complexity and cost.

2.2. LSK Attention Mechanism

Occlusions caused by high-density target distributions and insufficient feature information for distant small-scale targets have become typical issues leading to low detection accuracy. These challenges require models to possess larger receptive fields to capture rich contextual information while controlling computational complexity to maintain real-time performance. Large convolution kernels [16], although capable of providing abundant semantic associations and suppressing noise, are often accompanied by high computational demands. Large Kernel Attention (LKA) [17] reduces computation by decoupling depthwise convolutions, depthwise dilated convolutions, and pointwise convolutions while preserving the same receptive field, but there remains room for optimization. To address these practical issues, we introduce Large Separable Kernel Attention (LSKA) [18] to replace the original PSA attention mechanism, further reducing complexity and enhancing adaptability to occlusions and small targets.

LSKA is based on the LKA architecture, as illustrated in Figure 2, consisting of a 1 × 1 convolution, a GLU activation function, and the LSKA module. The core idea of LSKA is to decompose the traditional 2D convolution kernel into two 1D convolution kernels in the horizontal and vertical directions, respectively, and cascade them during convolution to achieve an effect similar to the original large-sized 2D kernel. Assuming the input and output dimensions of the feature maps for LSKA and LKA are identical (i.e., H × W × C), according to [18], the parameter count of LSKA is reduced by

(2 d - 1) (2 d - 3) + {(\frac{k}{d})}^{2} - (\frac{k}{d})

, where k is the convolution kernel size and d is the dilation rate.

Despite using smaller one-dimensional convolution kernels, LSKA preserves an extensive receptive field for capturing long-range dependencies, augments spatial information handling via separable convolutions, and bolsters inter-channel interactions through pointwise convolutions, thereby enabling the efficient acquisition of intricate feature representations. Through convolution decomposition and dilation rate design, LSKA achieves efficient computation while preserving a large receptive field, directly compensating for the deficiencies of the PSA mechanism in occlusion and small target detection. This mechanism reduces computational complexity and enhances model robustness in complex scenarios. The decomposition method of LSKA in Figure 2 can be represented as:

F_{1} = D W - C o n v_{(2 d - 1) \times 1} (D W - C o n v_{1 \times (2 d - 1)} (F))

(5)

F_{2} = {D W - D - C o n v}_{(\frac{k}{d} \times 1)} ({D W - D - C o n v}_{(1 \times \frac{k}{d})} (F_{1}))

(6)

F_{2}^{'} = {C o n v}_{1 \times 1} (F_{2}) \otimes F

(7)

where F is vehicle features. F₁ is separable depthwise convolution output, F₂ separable depthwise dilated output. DW-CONV is depthwise convolution. DW-D-CONV is depthwise dilated convolution.

2.3. RepGFPN

In real-world road scenarios, the pixel occupancy of pedestrians and vehicles differs by several orders of magnitude, leading to inefficiencies in multi-scale feature fusion. Although traditional Feature Pyramid Networks (FPN) [19] can aggregate features from different resolutions, Generalized FPN (GFPN) [20] employs uniform channel dimensions across scales, which may result in a surge in shallow-layer computations, parameter redundancy, and underutilization of deep-layer features. Additionally, cross-scale fusion introduces numerous upsampling and downsampling operations, increasing computational load. To address these issues, we introduce the Reparameterized Generalized Feature Pyramid Network (RepGFPN) [21] as the feature fusion layer. This network represents a more efficient variant of GFPN, capable of dynamically adjusting channel counts for features at different scales, optimizing information flow, and enhancing fusion efficiency. Consequently, it better handles scale variations and background interference in dense target scenarios.

Figure 3 illustrates the traditional feature pyramid structure with top-down and bottom-up bidirectional paths, as well as the RepGFPN structure. As depicted in Figure 3, the CSPStage module incorporated within RepGFPN partitions the input feature map into a primary path and a shortcut path. The main path employs a 1 × 1 convolution to extract basic features, followed by feature enhancement through N BasicBlock_3 × 3_Reverse modules. The shortcut path uses a 1 × 1 convolution for lightweight processing, with gradients from the main path acting solely on its own branch, avoiding redundant gradient computations inherent in traditional dense connections. The primary design of BasicBlock_3 × 3_Reverse involves introducing a reparameterization mechanism. During training, three branches operate in parallel; during inference, parameters are merged to implicitly fuse multi-branch information. The resulting single-branch structure enforces the model to learn more compact feature representations through parameter sharing, circumventing repetitive target focusing and background filtering operations in multi-branch structures, thereby substantially reducing the computational burden during inference. Furthermore, the integration of 1 × 1 convolution’s channel adjustment capabilities with 3 × 3 convolution’s spatial modeling abilities enables CSPStage to flexibly control channel dimensions across different-scale feature maps, yielding superior performance and higher accuracy. This design effectively resolves redundancies and efficiency issues in multi-scale fusion, allowing the network to more thoroughly exchange high-level semantic information and low-level spatial information, thereby meeting the practical demands of autonomous driving for real-time detection of multi-target and multi-scale scenarios.

3. Experiments

The experiments in this study were conducted on an Ubuntu 20.04 operating system equipped with 16 GB of NVIDIA RTX 4070 Ti Super GPU memory (NVIDIA, Santa Clara, CA, USA). We chose the KITTI dataset as the primary benchmark to demonstrate the enhanced performance of the improved model. The KITTI dataset is currently the most frequently used dataset in the field of autonomous driving object detection, established jointly by the Karlsruhe Institute of Technology in Germany and the Toyota Technological Institute at Chicago in the United States. It simulates complex road scenarios that vehicles may encounter during actual driving, fully illustrating the variety of authentic roadway environments and delivering robust assistance for investigations and applications in autonomous driving technology.

In addition, the BDD100K and UA-DETRAC datasets were selected as supplements to verify the model’s generalization capability in diverse scenarios. The BDD100K dataset was created by the Berkeley Deep Drive team at the University of California, Berkeley, focusing on providing real-world driving scenario data for researchers in the autonomous driving domain. The dataset incorporates variations in day and night, weather conditions, and geographical locations, offering greater differences in illumination and complex scenes. The UA-DETRAC dataset was jointly created by Beijing Institute of Technology and the University of Alberta, aiming to advance research in vehicle detection and tracking in traffic scenes. The dataset features dense traffic flows in urban road scenarios, including a large number of distant small targets and instances of mutual occlusions among targets.

Due to the lack of annotation information in the test set portions of the datasets, this paper utilized only the training set portions, randomly dividing them into training, validation, and test sets in an 8:1:1 ratio. Table 1 presents a detailed description of the datasets.

3.1. Experimental Settings and Evaluation Metrics

The experiment employs a cosine annealing learning rate adjustment scheme, with an initial learning rate of 0.01 and a minimum learning rate set to 0.0001. The cycle length aligns with the number of training epochs. Combined with the Adam optimizer, the first-order momentum decay rate is set to 0.9, with the second-order momentum decay rate set to 0.999, the numerical stability constant ε set to 1⁻⁸, and the weight decay set to 5⁻⁴. The batch size was set to 32 to ensure SDO-YOLO achieves efficient convergence and optimized lightweight object detection performance in complex road scenarios. The KITTI dataset exhibits relatively uniform scenes, predominantly under clear weather conditions, enabling optimal training with fewer epochs (260 training epochs). In contrast, the BDD100K and UA-DETRAC datasets incorporate more complex environmental information, necessitating 300 training epochs to ensure comprehensive learning.

To comprehensively evaluate the effectiveness of the improved algorithm, this study adopts precision, recall, mean average precision (mAP), frames per second (FPS), parameters, and FLOPs as principal evaluation metrics to examine the performance of the optimized object detection model. These metrics directly reflect the algorithm’s efficacy in addressing issues of low accuracy and excessive parameter counts: mAP and FPS emphasize detection accuracy and real-time performance, while parameter count and FLOPs highlight the degree of lightweighting. Detection results can be classified into four categories: True Positives (TP), which represent correctly identified positive instances; True Negatives (TN), which denote correctly identified negative instances; False Positives (FP), which indicate incorrectly identified positive instances; and False Negatives (FN), which refer to incorrectly identified negative instances.

The calculations for Precision and Recall are presented in (9) and (10). Based on the derived precision and recall values, a PR curve illustrating the relationship between the two can be constructed. The area enclosed by this curve represents the average precision (AP) for a single category. mAP denotes the mean of AP values across all categories in the sample for object detection, as given in Formula (11). The parameter count serves as an indicator of the model’s complexity; in (12),

K_{h}

represents the convolution kernel height,

K_{w}

its width,

C_{in}

the number of input channels, and

C_{out}

the number of output channels.

Precision = \frac{TP}{TP + FP}

(8)

Recall = \frac{TP}{TP + FN}

(9)

mAP = \frac{1}{n} \sum_{i = 1}^{n} AP

(10)

P a r a s = (K_{h} \times K_{w} \times C_{in}) \times C_{out} + C_{out}

(11)

3.2. Performance Analysis

When selecting a baseline model for improvement, YOLOv11n serves as an ideal choice as a lightweight representative. As shown in Table 2, YOLOv11n has only 2.6 million parameters, a computational load of 6.8 GFLOPs, and a weight size of merely 5.97 MB, significantly lower than other variants. Despite these reductions, YOLOv11n maintains a high mAP of 86.5% and an impressive 121.4 FPS, outperforming other versions. Although its mAP is slightly lower than that of larger YOLOv11 models, its substantially reduced parameter count and computational cost make YOLOv11n an ideal option for lightweight scenarios, such as embedded systems and real-time processing tasks. This substantial reduction in parameters and computational costs positions it as the preferred choice for real-time tasks. This balance underscores YOLOv11n’s advantages in lightweight applications and provides a solid foundation for the improvements in SDO-YOLO.

To further substantiate this, comparative performance experiments were conducted between SDO-YOLO and mainstream single-stage and two-stage object detection models (YOLOv5n, YOLOv7, YOLOv8n, YOLOv9t, YOLOv10n, YOLOv11n, YOLOv12, ZZ-YOLOv11n, RT-DETR-L, and Faster R-CNN). The results are presented in Table 2, compared to other models, SDO-YOLO exhibits a markedly smaller model size while achieving a modest improvement in accuracy. The mAP of SDO-YOLOv11 surpasses that of the baseline model by 2.8%, and relative to YOLOv5s, YOLOv7, YOLOv8n, YOLOv9t, YOLOv10n, YOLOv12, ZZ-YOLOv11n, RT-DETR-L, and Faster R-CNN, the mAP improvements are 3.2%, 4.8%, 3.1%, 3.5%, 5.2%, 1.8%, 2.6%, 3.7%and 7.3%, respectively.

In terms of lightweight design, compared to the YOLOv11n model, SDO-YOLO reduces the number of parameters from 2.6 M to 2.39 M, a decrease of 7.9%, and the computational load from 6.8 GFLOPs to 6.37 GFLOPs, a reduction of 6.3%. Regarding weight file size, SDO-YOLO measures 5.41 MB, while YOLOv11n is 5.97 MB, indicating a slight decrease. However, among all models, SDO-YOLO achieves the smallest parameter count and computational volume. In terms of frame rate, SDO-YOLOv11 attains 128.7 FPS, exceeding YOLOv11n’s 121.4 FPS, demonstrating exceptionally high processing speed.

To further address the motivation for real-time detection in embedded systems, we analyze the inference latency implied by these FPS metrics on standard hardware. SDO-YOLO achieves an average inference latency of approximately 7.77 ms per frame, a reduction from the baseline YOLOv11n’s 8.24 ms, and significantly lower than heavier models like YOLOv7-tiny (14.66 ms) or Faster R-CNN (84.75 ms). This low latency, combined with the model’s minimal parameters (2.39 M) and computational load (6.37 GFLOPs), suggests reduced power consumption in deployment deployed embedded devices, where lower FLOPs directly correlate with energy efficiency. For instance, on resource-constrained platforms like mobile, such metrics indicate feasibility for real-time autonomous driving applications without exceeding power budgets, reinforcing the practical advantages of SDO-YOLO over competitors.

To verify the generality of SDO-YOLO, model performance validation was conducted on the BDD100K and UA-DETRAC datasets. The BDD100K dataset contains numerous small-sized and occluded targets, with a portion of images captured in nighttime environments, and incorporates weather factors such as rain and snow, which pose significant challenges for object detection. As shown in Table 3, the baseline YOLOv11n achieves an mAP0.5 of 53.5%, while SDO-YOLO improves to 58.4%, and mAP0.5:0.95 rises from 31.2% to 37.6%. These enhancements indicate that the redundancy refinement in SC-HGNetv2 effectively reduces noise interference in rainy and snowy weather; moreover, the multi-scale fusion method proposed in this paper enhances semantic detail retention under low-light nighttime conditions. The UA-DETRAC dataset focuses on urban traffic monitoring, with images featuring a large number of dense, occluded, and distant targets. As shown in Table 4, YOLOv11n’s mAP0.5 is 95.6%, with SDO-YOLO improving to 99.3%, and mAP0.5:0.95 increasing from 89.6% to 94.7%. This demonstrates that SC-HGNetv2’s filtering of background information improves the purity of feature representations, avoiding feature confusion caused by complex backgrounds in high-density traffic flows; additionally, the larger receptive field provided by the large-kernel convolutional attention mechanism enables the association of vehicle contours with surrounding environmental semantic information. The experiments indicate that the improved model exhibits stronger robustness when confronting varied weather and lighting conditions, and can adapt to urban monitoring scenarios, underscoring the strong generalization capabilities of the SDO-YOLO algorithm.

3.3. Ablation Experiments

To assess the individual contributions of the proposed modules and evaluate their performance, ablation studies were designed and implemented. The results of these experiments are presented in Table 5. Utilizing YOLOv11n as the baseline model, a series of ablation variants was systematically constructed through the progressive incorporation of distinct modules. Thereafter, a qualitative analysis was performed to evaluate the impact of each module on the detection performance of the algorithm. The findings demonstrate that SC-HGNetv2 significantly improves the model’s feature extraction performance, improving mAP0.5 by 0.8% while reducing the model parameters by 11%. LSKA expands the model’s receptive field while lowering computational complexity, yielding a 0.3% increase in mAP0.5 and a 3.4% reduction in computational load. RepGFPN bolsters the neck network’s performance, enhancing the model’s robustness and elevating mAP0.5 by 2.1%, albeit with increases in parameter count and computational load. These ablation experiments not only validate the rationale of the proposed modules but also substantiate their efficacy in enhancing the overall detection performance of the target algorithm.

3.4. Visualization Results

The alterations in mAP0.5 and mAP0.5:0.95 values for the enhanced model and the baseline model over 260 training epochs are represented in the form of a line chart, as illustrated in Figure 4. During the initial ten training epochs, the discrepancy in accuracy between the enhanced and the baseline models is found to be negligible. However, as training progresses, the enhanced model demonstrates a marked improvement in performance, outperforming the baseline model in both mAP and mAP50:95 metrics. Following the convergence process, the enhanced algorithm exhibited substantial enhancements in its performance. This finding serves to substantiate the scientific robustness and efficacy of the proposed enhancement method.

As shown in Figure 5, we introduce a confusion matrix difference heatmap to further analyze the model performance. Specifically, the SDO-YOLO and YOLOv11n models were tested on the KITTI dataset to generate their respective confusion matrices. Subsequently, a difference heatmap was obtained through element-wise subtraction. Within this heatmap, blue areas denote positive values, whereas red areas indicate negative values. It can be observed that the improved model achieves enhancements in correct recognition rates across all categories, with a significant reduction in the number of targets misclassified as background, confirming the overall superior performance of the model. Notably, the probability of correctly identifying pedestrians and cyclists has increased by 11%, while the probabilities of misclassifying pedestrians and cyclists as background have decreased by 23% and 13%, respectively. These categories typically involve small targets, and this substantial improvement fully demonstrates the greatly enhanced detection capability of the improved algorithm when confronting small targets.

To more intuitively demonstrate the practical effects of the improved algorithm, we selected several samples from the KITTI dataset exhibiting misdetections and missed detections due to occlusions and long distances for comparative analysis. As shown in Figure 6, the left side displays the detection results of YOLOv11n, while the right side presents those of the improved model. From the comparison, it is evident that the improved model comprehensively outperforms the baseline model in detection accuracy. Further analysis reveals that the baseline model is prone to misdetections and missed detections when processing small targets such as pedestrians with colors similar to the environment, as well as vehicles with large-area occlusions. In contrast, the improved model, through enhanced feature extraction and contextual association capabilities, successfully identifies these missed and occluded targets, thereby significantly improving detection reliability in complex road scenarios.

4. Conclusions

This paper addresses the challenges of precision and efficiency in object detection within autonomous driving road scenarios by proposing a lightweight SDO-YOLO algorithm. Through enhancements to YOLOv11n, incorporating the SC-HGNetv2 backbone network, LSKA attention mechanism, and RepGFPN feature fusion module, the model achieves significant reductions in parameters and improvements in detection performance. Empirical evaluations conducted on the KITTI dataset indicate that the algorithm elevates mAP0.5 to 89.3%, with parameter count and computational load reduced by 7.9% and 6.3%, respectively, and FPS reaching 128.7, outperforming various mainstream object detection models. Concurrently, generalization experiments on the BDD100K and UA-DETRAC datasets confirm the algorithm’s robustness and adaptability. Ablation studies further substantiate the contributions of each module: SC-HGNetv2 optimizes feature extraction efficiency, LSKA expands the receptive field, and RepGFPN enhances multi-scale fusion capabilities. Visualization results reveal that the improved model excels in handling occlusions, small targets, and complex backgrounds. This research provides practical value for real-time autonomous driving applications on embedded devices. Future work may further explore optimizations under more extreme weather conditions and dynamic scenarios to advance the practical deployment of object detection technologies.

Author Contributions

Conceptualization, Z.J.; Methodology, Z.J.; Software, Z.J.; Validation, Z.J.; Investigation, P.J.; Resources, P.J.; Writing—original draft, Z.J.; Writing—review & editing, P.J. All authors have read and agreed to the published version of the manuscript.

Funding

The present study received no financial sponsorship.

Data Availability Statement

The datasets generated and/or analyzed during the current study are available in the KITTI, BDD100K, and UA-DETRAC repositories. https://www.cvlibs.net/datasets/kitti/user_submit.php (accessed on 10 August 2025).; https://bair.berkeley.edu/blog/2018/05/30/bdd/ (accessed on 16 July 16 2025).; https://www.albany.edu/cnse/research/computer-vision-machine-learning-lab (accessed on 3 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xiao, Y.; Yang, H. A Review of Object Detection Algorithms Applied in Traffic Scenarios. J. Comput. Eng. Appl. 2021, 57. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
YOLOv5. Available online: https://github.com/ultralytics/Yolov5 (accessed on 18 June 2022).
YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 January 2023).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Yusof, N.I.M.; Sophian, A.; Zaki, H.F.M.; Bawono, A.A.; Embong, A.H.; Ashraf, A. Assessing the performance of YOLOv5, YOLOv6, and YOLOv7 in road defect detection and classification: A comparative study. Bull. Electr. Eng. Inform. 2024, 13, 350–360. [Google Scholar] [CrossRef]
Zhu, K.; Lyu, H.; Qin, Y. Enhanced detection of small and occluded road vehicle targets using improved YOLOv5. Signal Image Video Process. 2025, 19, 168. [Google Scholar] [CrossRef]
Jiao, B.; Wang, Y.; Wang, P.; Wang, H.; Yue, H. RS-YOLO: An efficient object detection algorithm for road scenes. Digit. Signal Process. 2025, 157, 104889. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, Z.; Li, G.; Xia, C. ZZ-YOLOv11: A Lightweight Vehicle Detection Model Based on Improved YOLOv11. Sensors 2025, 25, 3399. [Google Scholar] [CrossRef] [PubMed]
Gong, X.; Yu, J.; Zhang, H.; Dong, X. AED-YOLO11: A Small Object Detection Model Based on YOLO11. Digit. Signal Process. 2025, 166, 105411. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Li, J.; Wen, Y.; He, L. Scconv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 2023, 123, 106442. [Google Scholar] [CrossRef]
Lau, K.W.; Po, L.M.; Rehman, Y.A.U. Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Jiang, Y.; Tan, Z.; Wang, J.; Sun, X.; Lin, M.; Li, H. GiraffeDet: A heavy-neck paradigm for object detection. arXiv 2022, arXiv:2202.04256. [Google Scholar]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-yolo: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]

Figure 1. SDO-YOLO Network Architecture Diagram.

Figure 2. (Top-left) SC-HGNetv2 and its SC-HGBlock structure; (Top-right) LKA and LSKA structures; (Bottom) SCConv structure.

Figure 3. PAFPN, RepGFPN, and CSPStage Structures.

Figure 4. Training Convergence Analysis.

Figure 5. Confusion Matrix Differential Heatmap.

Figure 6. Effect Comparison Diagram.

Table 1. Details of the datasets.

Dataset	Image Size	Number of Images	Classes (Proportion of Annotation)
		Train: 5984	Car, Van, Truck, Pedestrian
KITTI	1392 × 512	Val: 749	Person_sitting, Cyclist
		Test: 748	Tram, DontCare
		Train: 56,000	Traffic_sign, Traffic_light
BDD100K	1280 × 562	Val: 7000	Car, Rider, Motor, Person
		Test: 7000	Bus, Truck, Bike, Train
		Train: 17,400
UA-DETRAC	960 × 540	Val: 2175	Car, Bus, Vans, Others
		Test: 2175

Table 2. Comparison of detection performance of different methods on the KITTI dataset.

Model	mAP0.5 (%)	Parameters (M)	Calculation Amount/GFLOPs	Weights (MB)	FPS
YOLOv5s	86.1	7.2	16.5	14.3	43.8
YOLOv7-tiny	84.5	6.5	103.4	74.5	68.2
YOLOv8n	86.2	3.01	8.1	6.2	112.6
YOLOv9t	85.8	4.62	10.3	6.4	95.2
YOLOv10n	84.1	2.79	8.2	6.52	110.7
YOLOv11n (base line)	86.5	2.6	6.8	5.97	121.4
YOLOv12	87.5	2.56	6.7	5.8	124.7
ZZ-YOLOv11n	86.7	2.42	9.6	7.12	117.3
RT-DETR-L	85.6	32.8	81.3	13.6	56.9
Faster R-CNN	82	41.7	170.3	317.0	11.8
SDO-YOLO	89.3	2.39	6.37	5.41	128.7

Table 3. Comparison of detection performance of YOLOv11n and SDO-YOLO on the BDD100K dataset.

Model	mAP0.5 (%)	mAP0.5:0.95 (%)	Parameters (M)	Calculation Amount/GFLOPs	FPS
YOLOv11n	53.5	31.2	2.6	6.8	109.4
SDO-YOLO	55.4	37.6	2.39	6.37	115.2

Table 4. Comparison of detection performance of YOLOv11n and SDO-YOLO on the UA-DETRAC dataset.

Model	mAP0.5 (%)	mAP0.5:0.95 (%)	Parameters (M)	Calculation Amount/GFLOPs	FPS
YOLOv11n	95.6	89.6	2.6	6.8	135.6
SDO-YOLO	99.3	94.7	2.39	6.37	141.8

Table 5. Ablation Experiments.

SC-HGNetv2	LSKA	RepGFPN	mAP0.5 (%)	mAP0.5:0.95 (%)	Parameters (M)	GFLOPs
			86.5	61.6	2.6	6.8
√			87.3	63.7	2.31	6.13
	√		86.8	6.16	2.53	6.46
		√	88.6	64.2	2.76	7.39
√	√		87.4	63.8	2.23	5.79
√		√	89.1	65.1	2.46	6.71
√	√	√	89.3	65.1	2.39	6.37

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ji, P.; Jiang, Z. SDO-YOLO: A Lightweight and Efficient Road Object Detection Algorithm Based on Improved YOLOv11. Appl. Sci. 2025, 15, 11344. https://doi.org/10.3390/app152111344

AMA Style

Ji P, Jiang Z. SDO-YOLO: A Lightweight and Efficient Road Object Detection Algorithm Based on Improved YOLOv11. Applied Sciences. 2025; 15(21):11344. https://doi.org/10.3390/app152111344

Chicago/Turabian Style

Ji, Peng, and Zonglin Jiang. 2025. "SDO-YOLO: A Lightweight and Efficient Road Object Detection Algorithm Based on Improved YOLOv11" Applied Sciences 15, no. 21: 11344. https://doi.org/10.3390/app152111344

APA Style

Ji, P., & Jiang, Z. (2025). SDO-YOLO: A Lightweight and Efficient Road Object Detection Algorithm Based on Improved YOLOv11. Applied Sciences, 15(21), 11344. https://doi.org/10.3390/app152111344

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SDO-YOLO: A Lightweight and Efficient Road Object Detection Algorithm Based on Improved YOLOv11

Abstract

1. Introduction

2. Methods

2.1. SC-HGNetv2: Improved Backbone Based on HGNetv2

2.2. LSK Attention Mechanism

2.3. RepGFPN

3. Experiments

3.1. Experimental Settings and Evaluation Metrics

3.2. Performance Analysis

3.3. Ablation Experiments

3.4. Visualization Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI