3.1. Framework of PS-YOLO
To address the challenges in UAV remote sensing object detection, including low model accuracy, conflict between real-time and accuracy, and difficulty in deploying on edge devices, this paper introduces PS-YOLO, a lightweight network tailored for UAV image object detection. The PS-YOLO model is based on the YOLOv11-s [
38] framework, and we improve the backbone network, neck, and detection head to achieve faster and better UAV object detection.
Figure 2 presents the framework of PS-YOLO.
Specifically, in the backbone network, we integrate partial convolution operators into the original C3k2 structure, introducing the Faster_C3k2 module, which reduces the number of parameters. By incorporating Faster_C3k2 into the backbone, we effectively decrease the overall model complexity. Additionally, we reevaluate the C2PSA attention module in YOLOv11. While the C2PSA module can enhance detection accuracy in UAV-based object detection tasks, it significantly increases the parameter count. This added complexity outweighs its benefits when deploying the network on UAV platforms. Therefore, we remove the original C2PSA module to enhance deployment efficiency. In the neck network, we redesign the original structure by incorporating bidirectional information flow through cross-scale connections and learnable weights. Additionally, we embed Faster_C3k2 into the neck network, introducing FasterBIFFPN. This lightweight and efficient neck network enhances information interaction between features of different scales at a minimal computational cost, improving detection accuracy while reducing parameter count. GSCD is a shared convolution-based detection head, where shared convolution enhances the multi-scale detection capability while keeping the parameter count relatively low. Additionally, we observe that the IoU-based loss function performs poorly in small object detection. To address this, we introduce NWDLoss, which improves detection accuracy without increasing the parameter count.
3.2. Partial Convolution and Faster_C3k2
Fasternet [
29] is a novel lightweight convolutional neural network architecture that has garnered significant attention due to its simple design concept and compact network structure. The innovation of Fasternet lies in its introduction of partial convolution (PConv). Unlike previous methods, such as depthwise separable convolution and group convolution, PConv does not primarily aim for extreme parameter reduction. Instead, it focuses on minimizing both computational redundancy and memory access overhead, resulting in improved inference speed. The operation of PConv is illustrated in
Figure 3. Given a set of input feature maps, PConv performs a convolution operation on a subset of the channels, leaving the remaining channels unprocessed, and then combines the convolution-generated feature maps with the original feature maps. In fact, some studies [
28,
39] have demonstrated that there are high degrees of similarity between the feature maps of different channels, i.e., there are large amounts of feature redundancy in the feature maps. Therefore, PConv convolves the feature maps of only some of the channels, which not only reduces the computational cost but also achieves similar accuracy easily. For convenience, PConv usually selects a set of consecutive
channels for feature extraction.
For the purpose of analysis, it is assumed that the input and output feature maps have the same number of channels. For an input feature map
, the number of parameters introduced by a standard convolution operation is given by
Using PConv, the parameter count is significantly reduced to
The memory access count for a standard convolution (Conv) operation is given by
For PConv, the memory access count is significantly reduced to
In Equations (1)–(4), denotes the height of the feature map, denotes the width of the feature map, and denotes the number of channels in the input (or output) feature map, is the kernel size, and represents the number of channels on which the convolution operation is performed. When setting the ratio , the number of FLOPs (Floating-Point Operations) for PConv is reduced to of that of standard convolution, and the memory access count is reduced to of the standard convolution’s memory access.
Fasternet is primarily composed of stacked Fasternet Blocks, and the structure of a Fasternet Block is shown in
Figure 4. Each Fasternet Block consists of one PConv module and two pointwise convolutions (PWConvs). The first PWConv expands the channel dimension and is followed by an activation function to enhance feature representation, while the second PWConv compresses the channel dimension to retain the most important features. This structure maximizes the utilization of the original feature information while maintaining high accuracy.
It is worth noting that although the design of partial convolution is very delicate, its accuracy is still inferior to that of conventional convolution. Partial convolution is specifically designed for lightweight models, discarding a portion of the original features and processing only a subset of them to enhance real-time performance. Consequently, replacing all conventional convolutions with partial convolutions would significantly degrade accuracy. In UAV object detection, prioritizing real-time performance often compromises model accuracy, making it difficult to provide reliable detection services. On the other hand, prioritizing accuracy can overwhelm the UAV platform with a large number of parameters and high computational demands. To address this challenge and better balance accuracy with computational load, we replace the Bottleneck in C3k2 with the Fasternet Block to create the Faster_C3k2 module. We then replace the original C3k2 module in the network with Faster_C3k2. This approach reduces network complexity while maintaining sufficiently high accuracy.
Figure 5 shows the structure of Faster_C3k2.
3.3. Faster Bidirectional Feature Fusion Pyramid Network
The neck network of YOLOv11 retains the classic Path Aggregation Feature Pyramid Network (PAFPN) [
40]. PAFPN builds on the Feature Pyramid Network (FPN) [
41] by incorporating a bottom-up path. The original top-down path uses upsampling operations to propagate deep semantic information to shallower layers, thereby enhancing the semantic representation of shallow features. In contrast, the bottom-up path utilizes convolutional operations to pass shallow detailed features to deeper layers. This bi-directional structure enhances the neck network’s ability to capture and integrate both semantic and detailed information. For small objects, detailed information from shallow layers is more important, while for large objects, semantic information from deeper layers plays a more crucial role. However, PAFPN directly fuses feature maps of different resolutions, which can result in information conflicts. This issue becomes even more pronounced in UAV imagery, where the features of small objects are often overshadowed by those of larger objects. Furthermore, the computational cost associated with the PAFPN structure is relatively high for a lightweight network, prompting the need to reconsider YOLOv11’s neck design. Drawing inspiration from BIFPN [
42], we introduce cross-scale connectivity and feature-weighted fusion operations into the neck without increasing the parameter count. Additionally, we integrate the more lightweight Faster_C3k2 into the neck network, resulting in the Faster Bidirectional Feature Fusion Pyramid Network (FasterBIFFPN). Cross-scale connectivity enables feature maps from different layers to exchange information more effectively, enhancing the semantic representation of small objects and detailed information of large objects. Meanwhile, feature-weighted fusion adaptively adjusts the importance of features based on the characteristics of objects at different scales. This is particularly crucial for small object detection in UAV images, as the features of small objects often do not reach the deeper layers of the network, while deeper feature maps still contain residual background information. By increasing the weight of shallow features and decreasing the weight of deeper features, the complexity of the background is effectively suppressed, thereby improving model detection performance. This approach further reduces parameter count in the neck while achieving more efficient feature fusion, all without relying on any attentional mechanisms.
Figure 6 illustrates the structure of FasterBIFFPN.
In the figure, B1 to B5 represent feature maps at varying resolutions within the backbone network, P3 to P4 denote feature maps at different resolutions in the FPN, and N3 to N5 indicate feature maps at distinct resolutions in the PAN. The B1 feature map exhibits the highest resolution, with the resolutions of the feature maps from B1 to B5 progressively halving. Similarly, the feature maps in the FPN and PAN follow the same resolution pattern. The B2-B4 feature maps in the backbone network are fed into FasterBIFFPN, while we disregard feature maps that have only a single input path and do not process them. This is because such feature maps, typically obtained through upsampling and downsampling, contribute less to feature fusion. In contrast, feature maps with two input paths are typically involved in feature fusion, and as such, they contribute more significantly to the fusion process. We introduce connections at the nodes where the backbone network feature maps align with the resolution of the PAN’s feature maps, enabling the fusion of richer features without the need for additional modules. The connections between B4-N4 and B3-N3 are supplementary connections, indicated by red arrows in the figure. Notably, the N3 feature map in the PAN is initially formed by stitching B3 from the backbone network with P3 from the FPN. In FasterBIFFPN, we introduce an additional downsampling path to enhance feature fusion for small object detection. This increases the proportion of original detailed features within the feature map, thereby strengthening the representation of small objects. The additional downsampling paths are indicated by blue arrows in the figure. The feature map fusion process can be represented as follows:
In Equations (5) and (6), Faster_C3k2 and Conv refer to the corresponding module operations. The weighted fusion operation involves multiplying the feature maps at the same resolution by their respective learnable weights and then summing them. The weight for each feature map is initially set to 1, and as training progresses, the weights are automatically updated based on the loss function. Here, we adopt the fast normalization fusion method from BIFPN [
42], and the process is as follows:
In this process, represents the weighted feature map, and denotes the different feature maps of the same size. The learnable weights are non-negative values, and when they are fed into the ReLU function, the weights are normalized. The constant is a small value introduced to guarantee the stability of the normalization process.
3.4. Gaussian-Shared Convolutional Detection Head
Similar to YOLOv8, YOLOv11 employs a decoupled detection head to separate localization and classification tasks. This separation minimizes interference between the two tasks, resulting in improved performance, especially in complex scenes. The detection head structures of YOLOv8 and YOLOv11 are illustrated in
Figure 7.
However, UAV images often contain not only complex scenes but also a large number of small objects. This combination significantly reduces both the inference speed and detection accuracy of the model, posing a considerable challenge for UAV object detection. To make matters worse, the detection head of YOLOv11 employs depthwise convolution (DWConv) to reduce parameter count. Although DWConv effectively lowers the number of parameters, it introduces a high volume of memory access operations and makes parallelization of independent channel-wise convolutions difficult. As a result, the actual inference speed is further diminished. Inspired by Faster R-CNN [
43] and Normalized Gaussian Wasserstein Distance Loss (NWDLoss) [
44], we designed a detection head called the Gaussian-Shared Convolutional Detection (GSCD) head. On one hand, GSCD incorporates shared convolution to enhance the model’s efficiency in feature extraction and to improve its ability to generalize across feature maps of varying sizes. This contributes to improved overall performance and increased inference speed. On the other hand, GSCD integrates NWDLoss to enhance the model’s detection capability, particularly for small objects.
Figure 8 illustrates the structure of the GSCD head.
In the figure, N3 to N5 represent the three feature maps with varying resolutions produced by the neck network. The Share_Conv module consists of a 3 × 3 convolution layer and a 1 × 1 convolution layer. Both the Conv_Reg and Conv_Cls components are convolutional layers without normalization or activation functions. Initially, the three feature maps are passed through a 3 × 3 convolution to unify their channel dimensions. They are then simultaneously input into the Share_Conv module to extract shared features across different scales. Finally, the extracted features are fed into the respective regression and classification branches.
In the regression branch, to detect objects at different scales, the three features are scaled differently. Here, we introduce three learnable factors, Scale, which apply element-wise multiplication to adjust the regression predictions without changing the feature map size. The scaled regression features are concatenated with the features from the classification branch, and the final detection results are output. All the aforementioned convolutional operations are standard convolutions, which significantly enhance feature extraction efficiency without compromising accuracy. This design improves inference speed while maintaining a low parameter count, thereby ensuring efficient real-time performance in UAV object detection.
Traditional IoU-based loss functions primarily focus on the overlap area between the predicted and ground truth bounding boxes. This makes the detection head overly sensitive to positional deviations, especially for small objects. Even a slight shift in the predicted bounding box can lead to significant fluctuations in the IoU-based loss value. Therefore, it is necessary to improve the loss function to address this limitation. NWDLoss models bounding boxes as two-dimensional (2D) Gaussian distributions and measures the similarity between them by calculating the distance between their respective distributions. Unlike IoU-based losses, NWDLoss is not significantly affected by the degree of overlap between the predicted and ground truth boxes. This property alleviates gradient optimization difficulties when the boxes do not overlap and enhances robustness in detecting small objects. Therefore, we incorporate NWDLoss into the regression loss function to improve performance, particularly for small object detection.
The principle of NWDLoss is as follows:
For a bounding box
with a center coordinate
, width
, and height
, the inscribed ellipse of
can be represented as follows:
The probability density function (PDF) of the 2D Gaussian distribution is given by
where
represents the coordinates
,
represents the mean vector,
represents the covariance matrix. When
The ellipse in Equation (8) becomes the density contour of a 2D Gaussian distribution, then the horizontal frame
can be built as a 2D Gaussian distribution
, where
For 2D Gaussian distributions
and
, the second-order Wasserstein distance between
and
is defined as
where
is the Frobenius parameter. For the Gaussian distributions
and
modeled by the bounding box
and
, Equation (12) can be simplified as
Normalization using the exponential form yields the normalized Wasserstein distance
where
is the average absolute size of the object, used to normalize the Wasserstein distance, which is calculated as expressed below:
where
is the total number of objects in the dataset, and
and
are the width and height of the
th object box, respectively.
We weighted fuse the CIOU loss with NWDLoss to obtain an improved loss function, and it is expressed as follows:
where
is the Gaussian distribution of the prediction frame,
is the Gaussian distribution of the real object frame, and
is a human-set coefficient.