In this research, the YOLOv11s model is adopted as the foundational framework for subsequent enhancement. YOLOv11, officially released by Ultralytics on 30 September 2024, is an advanced iteration of the YOLO series featuring notable structural improvements over earlier versions such as YOLOv5 and YOLOv8. These upgrades contribute to its superior performance in object detection tasks. Among its different versions, YOLOv11s was selected because it strikes a good balance between detection precision and computational efficiency. This characteristic makes it a fitting choice for situations with limited resources.
2.1. Collaborative Optimization Strategy of the C3K2DS Module and Focaler-WIoU Loss Function
In the YOLOv11 architecture, the C3K2 module includes a configurable C3K parameter that determines whether to activate the module. When disabled, the Bottleneck module is used in its place. The C3K module supports adjustment of convolutional kernel sizes to enhance the capture of features at various scales. Its architectural details are visualized in
Figure 2. Importantly, increasing the kernel size of the C3K module introduces a large number of parameters, which may cause deep features to lose critical details from shallower levels. On the other hand, using a smaller 1 × 1 convolutional kernel fails to expand the receptive field, making it more difficult to capture spatially adjacent features.
In the context of vehicle detection from UAV perspectives, the use of a fixed 3 × 3 convolution kernel within the C3K module results in a limited receptive field that impairs global context modeling capabilities. This constraint hampers the model’s ability to adapt to targets of diverse scales and intricate backgrounds. As a result, it proves inadequate for reliable vehicle detection in UAV-related scenarios.
To address the aforementioned limitations, this paper introduces the Dilation-Wise Residual (DWR) structure into the C3K2 module. The proposed structure enhances multi-scale receptive fields and improves gradient propagation in deep networks. The architecture of the DWR module is illustrated in
Figure 3 [
22].
The DWR module is designed with a two-stage architecture that integrates both regional and semantic residual enhancements. Initially, the input feature map undergoes a 3 × 3 convolution operation coupled with a nonlinear activation function, which serves to reduce feature complexity while maintaining essential spatial structure. Subsequently, multiple parallel dilated convolution layers with dilation rates set to d = 1, 3, and 5 are applied to perform morphological filtering across regions of different sizes. This allows the network to capture contextual information from both local and mid-range receptive fields, thereby improving its responsiveness to objects across a variety of scales. To further support stable gradient flow and ensure effective end-to-end learning, residual pathways are incorporated in order to mitigate issues such as gradient vanishing in deeper network layers.
In aerial images captured by UAVs over urban and suburban areas, different forms of background interference such as intricate building textures, shadow patterns, and inconsistent illumination can significantly disrupt the recognition of vehicle targets. Such interference often results in mis- or missed detections. To mitigate the adverse impact of such background elements on detection performance, this study incorporates the SE attention mechanism [
23] into the C3K2 module. A visual representation of the SE attention structure is provided in
Figure 4.
Through the Squeeze phase, the SE mechanism applies global average pooling to summarize spatial information along each channel, contributing to the suppression of unwanted high-frequency signals. Subsequently, fully connected layers (Excitation phase) are used to generate channel-wise attention weights, which emphasize detailed vehicle features while reducing the response to background noise.
The improved model proposed in this paper is based on YOLOv11s. In the backbone network, the last two C3K2 modules and the C3K2 module preceding the large-object detection head have their C3K configuration enabled, while the remaining C3K2 layers use the default Bottleneck structure.
In this work, the Bottleneck components in all C3K2 modules of the original YOLOv11 architecture are replaced with the DWR structure. For those C3K2 modules that have C3K enabled, the Bottleneck units within the C3K submodules are also replaced by DWR. Furthermore, the SE attention mechanism is integrated into the C3K2 modules, forming the improved C3K2DS module.
Figure 5 presents the configurations of the updated modules, with (a) detailing the C3K2DS architecture and (b) showcasing the C3K variant incorporating the DWR structure.
The proposed C3K2DS module incorporates three parallel dilated convolutional branches, each utilizing a distinct dilation rate; the branch with d = 1 is designed to emphasize local feature extraction, effectively preserving fine-grained information critical for identifying small vehicle targets. In contrast, the branches with dilation rates of d = 3 and d = 5 serve to broaden the receptive field. This enables the network to capture the overall shapes and structural outlines of vehicles, strengthening its global feature representation capability. To further refine feature focus, the SE attention mechanism is embedded into the module, which aids in suppressing irrelevant background interference. This enhancement allows the network to concentrate more precisely on vehicle regions to alleviate the impact of uneven lighting and background complexity, ultimately contributing to a reduction in missed detections.
The original DWR structure uses the ReLU activation function following the first 3 × 3 convolution. However, characteristics of ReLU such as hard thresholding, abrupt gradient transitions, and static activation thresholds make it poorly suited to the challenges posed by complex aerial scenes in UAV-based vehicle detection tasks. In contrast, the SiLU (Sigmoid Linear Unit) activation function offers smoother activation, retains negative values, and provides dynamic responses, making it better suited for UAV imagery characterized by non-uniform lighting and large variations in object scale. A comparison between the ReLU and SiLU activation functions is shown in
Figure 6. To further improve module performance, in this study we replace the ReLU activation function in the DWR module’s first 3 × 3 convolution layer with the SiLU function.
In practical applications, UAV-captured images are affected by the combined influence of camera angle, flight altitude, and lighting conditions. Due to small gimbal angles and wide shooting ranges, certain vehicle targets may occupy only a small pixel proportion in the image, making them difficult to detect. Additionally, some vehicle targets may exhibit geometric distortions or deformations, further complicating target identification. Moreover, vehicles parked side by side may result in overlapping bounding boxes due to partial occlusion by adjacent vehicles, which can interfere with the model’s ability to accurately learn target features. These challenging low-quality samples increase the difficulty of feature learning for the model.
To tackle this challenge, the proposed model adopts an improved loss computation scheme by introducing the Focaler-WIoU loss in place of the conventional loss function. This change diminishes the detrimental effects of hard samples on detection performance.
In contrast to conventional IoU-based loss functions, Focaler-IoU [
24] introduces a piecewise linear interval mapping strategy to reformulate the IoU loss, resulting in improved performance in bounding box regression. Computation of the Focaler-IoU loss is defined in Equations (1) and (2). The parameters
d and
u in the equations are tunable within the range [
d,
u] ∈ [0, 1].
For the dataset adopted in this study, the Focaler-IoU loss improves the influence of low-quality samples within the loss computation, enhancing the model’s capacity to learn from extremely small vehicle targets and heavily degraded instances. Meanwhile, WIoU [
25] introduces a dynamic scaling mechanism that adjusts the loss weight in accordance with the current IoU value. This adaptive strategy increases the penalty for hard samples exhibiting low IoU, effectively guiding the model to pay greater attention to challenging and low-quality targets. In this work, the final loss function is formulated by integrating the advantages of both Focaler-IoU and WIoU, with the mathematical definitions provided in Equations (3)–(5).
In the above equations,
R represents the scaling factor of the loss function, the parameters
and
are tunable hyperparameters, and the variable
denotes the global average IoU, which is used to adaptively adjust the scaling factor. The
is obtained using Equation (
6), where the
parameter is a momentum term.
First, the Focaler-IoU adjusts the IoU distribution to reshape the loss landscape; then, the scaling factor from the WIoU further refines the loss weight based on the ratio between the current sample’s IoU and the global average IoU. This dual-stage strategy enables more flexible learning, allowing the model to better adapt to the challenges posed by complex urban backgrounds.
In UAV-based vehicle detection tasks within urban environments, the YOLOv11s model suffers from insufficient ability to extract fine-grained features of small objects and limited robustness to challenging low-quality samples. These deficiencies often result in false detections and missed targets. To address this issue, in this study we propose a collaborative optimization strategy that combines the C3K2DS module with the Focaler-WIoU loss function.
Specifically, the C3K2DS module enhances the model’s ability to represent vehicle features by integrating cross-level feature fusion and channel-wise attention mechanisms, thereby improving the representation of fine details. Meanwhile, Focaler-WIoU dynamically adjusts the loss weight based on classification difficulty and localization quality, strengthening the model’s ability to learn hard samples.
This strategy effectively reduces false positives and missed detections, resulting in significantly enhanced vehicle detection accuracy in UAV imagery.
2.2. Coordinate Convolution (CoordConv)
Traditional convolution operations extract features using a sliding window mechanism based on local receptive fields, which inherently ignores both the absolute and relative positional information of the pixels. This spatial invariance poses limitations when detecting small or occluded vehicle targets in UAV-captured imagery. In the YOLOv11s model, the Neck component employs conventional convolution layers that lack explicit spatial awareness. This may result in insufficient feature extraction for vehicle targets, potentially affecting detection accuracy.
In contrast to standard convolution, Coordinate Convolution (CoordConv) [
26] introduce additional x-coordinate and y-coordinate channels appended to the input feature map. The structure of the CoordConv operation is illustrated in
Figure 7. As shown in the figure, during the forward pass, the input feature map is augmented with coordinate values i and j, which are stacked alongside the original feature map and then processed by convolutional filters. This explicit injection of positional information enables the model to better capture spatial relationships, improving localization precision, especially for small and irregularly distributed vehicle targets in UAV scenes.
Replacing the traditional convolution in the Neck part of the YOLOv11s model with Coordinate Convolution (CoordConv) allows the network to explicitly perceive spatial information by introducing coordinate channels. This enhances the model’s ability to learn both the absolute and relative positional information of vehicle targets. With improved spatial awareness, the model can more accurately localize small or partially occluded vehicles, effectively reducing missed detections.
Moreover, the spatial distribution of vehicles in urban environments often follows predictable patterns; for example, vehicles tend to cluster in areas such as parking lots or roadways, while appearing less frequently in green spaces or rural farmland. By incorporating CoordConv, the model is able to learn the correlation between spatial priors and target distributions, improving its efficiency in capturing context-dependent spatial patterns and enhancing its overall detection performance.