3.1. Review of YOLOv5s
YOLOv5 is a one-stage object detection algorithm developed by Ultralytics. It has multiple versions based on model size, including n, s, m, l, and x. YOLOv5s is a relatively lightweight model suitable for deployment on lightweight devices. The overall network structure of YOLOv5s is shown in
Figure 4. It consists of four parts: input, backbone, neck, and head. The default input image size of the input part is 640 × 640. Multiple data augmentation strategies are adopted to increase the diversity of the training data, thereby improving the robustness and generalization ability of the model. The backbone of YOLOv5s uses CSPDarknet53, which consists of C3 modules and SPP (Spatial Pyramid Pooling) modules. The C3 module adopts the CSP (Cross Stage Partial) structure, which separates the information flow of the network into two branches for processing. The feature tensor of the main branch is divided into two parts. One part is concatenated with the output of the branch after the convolutional operation and residual connection, whereas the other part is directly concatenated with the output of the branch. This design can prevent the network from losing too much information during information transmission, thus improving the utilization of features. In addition, the residual connection can effectively prevent the gradient disappearance problem in deep neural networks. The SPP module is a pooling operation that processes the feature maps of the backbone with different sizes of pooling kernels to fuse spatial information of different receptive field sizes. The primary function of the SPP module is to improve the ability of the model to detect objects of various scales without changing the input size. The neck part of the YOLOv5s network uses the PANet (Path Aggregation Network) structure, which can effectively fuse feature maps of different scales to improve detection accuracy. PANet consists of two parts: FPN (Feature Pyramid Network) [
40] and PAN. FPN is used to generate feature pyramids of different scales, and PAN is used to fuse these feature pyramids. PAN not only performs feature fusion by upsampling the feature maps but also performs another round of feature fusion by downsampling to enrich semantic information, which is beneficial for detecting objects of multiple scales. The head of YOLOv5s adopts the structure of YOLOv3, including three output layers of different scales. Each output layer is responsible for classifying and bounding box regression of feature maps of different sizes to obtain the final detection results.
The YOLOv5s model has gained popularity across different industries like healthcare, transportation, and manufacturing. However, there is a significant difference between remote sensing images and natural images, and YOLOv5s does not perform well in object detection tasks on remote sensing images. This article takes YOLOv5s as the baseline model and improves in multiple aspects to make it better suited for remote sensing image object detection tasks.
3.2. Multi-Scale Strip Convolution Attention Mechanism
The extraction and recognition of foreground features can be hindered by complex background noise, which is an important reason for the poor performance of remote sensing image object detection. The attention mechanism can improve the focus of the model on foreground objects in the feature map, thereby reducing the influence of background noise to a certain extent. Through the observation of remote sensing image data, we found that some objects, like airports, dams, and bridges, may appear as narrow stripes due to the specific shooting height and angle. As shown in
Figure 5, when using a regular square convolution to extract features of the long strip-shaped airport, the surrounding background information of the object will inevitably be fused with the object features, resulting in a large amount of background noise and reducing detection performance. In addition, the problem of significant differences in object scales in remote sensing images will seriously reduce the performance of the detector, as the detector cannot simultaneously detect objects with significant scale differences.
In computer vision tasks, strip convolution can cover long strip-shaped object areas, capture long-range contextual information in a single direction, and avoid introducing additional background noise. Therefore, based on the above content, we propose a multiscale strip convolution attention mechanism called MSCAM. This method uses strip convolutions of different directions to focus on the horizontal and vertical information in the feature map. It uses convolutions of different sizes to fuse features of different scales. This method can simultaneously solve the problems of complex background noise and significant differences in object scales in remote sensing images. Next, we will provide a detailed introduction to the implementation details of this method.
First, we split the input feature map into two sub-feature maps in the channel dimension to perform different operations on the two while reducing the parameters. Then, to avoid the problem of introducing additional background noise with square convolution, we use 1 × n and n × 1 strip convolutions for the two sub-feature maps, respectively, to focus on spatial information in the horizontal and vertical directions of the feature map. Furthermore, to address the problem of object scale differences in remote sensing images, we use three different convolution kernel sizes for each sub-feature map of strip convolutions. This method allows us to focus on multiple different scales of objects and fuse multi-scale features.
Specifically, we split X evenly into two parts in the channel dimension for an input feature map X with C channels, as shown in Equation (
1):
where
denotes the split operation applied to the input feature map
X, where
means
X is equally divided into two sub-feature maps, each having half the number of channels as the original input feature map.
represents the sub-feature map obtained after the split operation.
Next, we perform operations on the two sub-feature maps obtained from the split operation. Both sub-feature maps
A and
B undergo three parallel strip convolutions with different kernel sizes, focusing on spatial information in the weight and height directions of the feature map. Then, the results of the three convolutions are added and undergo nonlinear transformation through an activation function, which can be represented as Equations (2) and (3):
where
F represents the convolution of feature map
A with a strip-shaped kernel of size
, and
G represents the convolution of feature map
B with a strip-shaped kernel of size
.
S means the SiLU activation function and
and
represent the feature maps obtained by adding the convolution results from different sizes of kernels. We use strip convolution for two reasons. First, it allows for extracting horizontal and vertical information separately from different directions, avoiding introducing a large amount of background noise that square convolution would cause. Second, a strip convolution is relatively lightweight compared to a standard convolution.
represents the size of the convolution kernel, which affects the receptive field of a feature map and significantly impacts detection performance. Using convolution kernels of varying sizes is vital to computing features at different scales. To determine the kernel size for each strip convolution, we propose a mapping function represented by Equation (
4).
In Equation (
4),
w and
h represent the width and height of the input feature map, ⌊⌋ represents the floor operation, and
represents the size of the strip convolution kernel. Equation (
4) can calculate three different sizes of convolution kernels according to the size of the input feature map. Using three strip convolutions with different kernel sizes can extract multi-scale features, thereby focusing on objects of different scales in the image and solving the problem of significant differences in object scales in remote sensing images.
Afterward,
and
are concatenated along the channel dimension to obtain a feature map with
C channels. Then, a 1 × 1 convolution is employed to simulate the relationship between different channels. The convolution result is used as attention weights and multiplied with the original input, thereby re-balancing the input of the module, as shown in Equation (
5):
where
f represents a convolution operation of size 1 × 1,
represents a concatenation operation along the channel dimension, and ∗ represents the dot product between the attention weights and the input feature map
X, which applies attention weights to the original input feature map.
represents the output feature map filtered by the attention weights.
Integration strategy: The MSCAM proposed in this paper is a plug-and-play module, and its overall structure is shown in
Figure 6. In the experimental part, we applied the attention module to the last convolutional block of the network backbone. In the comparative experiment, other attention modules were added to the same position, as shown in
Figure 7.
3.3. Improved PANet by GSConv
In practical applications of remote sensing image object detection tasks, models often need to be deployed and run in resource-limited environments, such as drones and satellites. Therefore, we need to adopt a lightweight design strategy for remote sensing image object detection models to reduce the size and complexity of the models, making them more suitable for practical application scenarios. In the YOLOv5s network architecture, the feature fusion part adopts the Path Aggregation Network (PANet) structure, which performs another bottom-up fusion based on the top-down feature fusion of the Feature Pyramid Network (FPN), enabling shallow information to be more utilized in deep layers of the network. However, this approach introduces a longer propagation path and more convolutional operations, resulting in an increase in the number of model parameters, which is not conducive to model lightweighting.
In the feature fusion part of the network, the feature map usually contains numerous channels but has smaller width and height dimensions. This elongated tensor shape is suitable for lightweight processing with DWConv, but the sparse calculation of DWConv can lead to a decrease in accuracy. The dense calculation of PWConv can avoid the loss of semantic information caused by sparse connections. GSConv combines the calculation results of PWConv and DWConv through shuffle, mixing the precise results of PWConv dense calculation into the calculation results of DWConv, thus achieving a lightweight and efficient convolution method. Therefore, in this work, we introduce GSConv into the feature fusion part of the network to achieve a new lightweight feature fusion network called GS-PANet.
First, based on the superiority of GSConv, we replace the ordinary convolution in the original PANet with a GSConv module. Then, by introducing GSConv to improve the C3 module, we design a more efficient GSC3 module, whose structure is shown in
Figure 8. The input feature map first undergoes feature extraction in two parts through the CSP structure, and then the merged result is input into GSConv for mixing the results of PWConv and DWConv.
The GSC3 module can replace any C3 module in the feature fusion layer of the network. To explore the best embedding scheme for the GSC3 module, we provide three design schemes for the GS-PANet, whose structures are shown in
Figure 9. Among them, GS-PANet1 only introduces GSConv, GS-PANet2 replaces the first C3 module in the original PANet with the GSC3 module while introducing GSConv, and GS-PANet3 replaces all C3 modules with GSC3 modules.
Table 1 shows the comparative experimental results of the three structures of GS-PANet1, 2, and 3. It can be seen that although the GS-PANet3 method, which uses more GSC3 modules, reduces the parameter volume more, it also loses some detection accuracy. We analyze that because the three additional GSC3 modules of GS-PANet3 are closer to the detection head than GS-PANet2, this convolutional layer often needs a larger receptive field to focus on the overall features of the object, so it is not suitable to use a lighter GSC3 module. Therefore, we adopt the more excellent performance of GS-PANet2 as the final design scheme.
3.4. Wise-Focal CIoU Loss
The loss function in object detection is an indicator used to measure the difference between the predicted results of the model output and the ground–truth labels. By minimizing the loss function, the model is optimized to make its output closer to the ground–truth labels. Therefore, designing a suitable loss function has a crucial influence on the final detection effect of the model. In the YOLOv5 algorithm, the loss function consists of three parts: confidence loss, classification loss, and localization loss. Among them, the localization loss has a direct impact on the localization effect of the predicted box. The YOLOv5 utilizes the CIoU loss function for localization to guide the bounding box regression process. It considers three aspects: the overlap between the predicted and ground–truth boxes, the distance between their center points, and the aspect ratio. The calculation process for CIoU loss is shown in Equations (6) and (7).
where
represents the intersection over the union of the predicted box and the ground–truth box area,
b and
represent the center point coordinates of the predicted box and the ground–truth box, respectively,
represents the Euclidean distance,
c represents the diagonal distance of the minimum bounding rectangle of the predicted box and the ground–truth box, and
v is the weight function used to measure the similarity of the aspect ratio.
In remote sensing image object detection, due to the complexity of the image background and the differences in object size and shape, the same image usually contains samples of different qualities, and the importance of varying quality samples to the regression process is different. Since the model only performs the bounding box regression process on the positive samples containing the foreground objects, we only focus on the effect of the quality of the positive samples here. For positive samples, the higher the IoU, the higher the quality of the sample. High-quality samples with high IoU provide more accurate and clearer object information, which can guide the training of the model more accurately than low-quality samples. This conclusion has been demonstrated in Prime Sample Attention (PISA) [
41]. Therefore, for the same consideration, we expect high-quality samples to be given more weight than low-quality samples in remote sensing images.
Through the study of CIoU loss, we found that although it considers the IoU, center point distance, and aspect ratio between the predicted box and the ground–truth box, it does not consider the difference in the importance of different quality samples to the regression process. CIoU loss optimizes samples of different qualities with equal weights, which is unfair to more important high-quality samples. At the same time, the low-quality samples with low IoU have too much weight, which will interfere with the regression of high-quality samples with high IoU, leading to a decrease in regression accuracy. As shown in
Figure 10, the result shows that the network mistakenly predicted a ship object with a confidence of only 0.3 and interfered with the regression of other samples, resulting in a missed detection of a storage tank at the same location.
Libra R-CNN [
42] and Dynamic R-CNN [
43] propose that high-quality samples should have more gradient contribution in the model optimization process. They revised the SmoothL1 loss [
44] to reweight these predicted bounding boxes. Therefore, the CIoU loss also needs to be optimized to address this issue. We first introduce a power transformation
greater than 1 for the regression penalty term of CIoU loss, thereby enhancing the weight contribution of high-quality samples in regression to enhance their impact on model training. After introducing the power transformation, we propose Wise CIoU loss, whose calculation process is shown in Equation (
8).
The Wise CIoU loss introduces a power transformation in the Euclidean distance between the predicted box, the ground truth box, and the aspect ratio regression term. Since is greater than 1, the Wise CIoU loss further amplifies the penalty terms of the Euclidean distance and aspect ratio between the predicted box and the ground truth box to amplify the weights of high-quality samples in the training process. It can adjust the weight of the loss function more effectively according to the current sample and increase the loss function contributions of high-quality samples. In addition, since the model pays more attention to high-quality samples with high IoU, it can also alleviate the interference of complex background noise in remote sensing images. is a hyperparameter in the training process, and we will discuss its value later.
The Wise CIoU loss increases the loss weight of high-quality samples with high IoU. However, in remote sensing image object detection, the vast majority of predicted boxes obtained based on anchors have small IoU between them and the ground truth boxes, implying the existence of a large number of low-quality samples (outliers). EIoU loss [
45] suggests that these large number of low-quality samples with low IoU tend to cause drastic fluctuations in the regression loss of training. Therefore, to further reduce the negative impact of low-quality samples in remote sensing images on the regression loss, we improve based on Wise CIoU loss. Inspired by the EIoU loss, we propose a new loss function Wise-Focal CIoU loss, which is calculated as shown in Equation (
9).
Specifically, we multiply Wise CIoU loss by a balance term , which is used to suppress the weight of low-quality samples to obtain the final form of Wise-Focal CIoU loss. reweights Wise CIoU loss, where is a hyperparameter that controls the extent of outlier suppression, and we keep the same value as in EIoU loss with . The introduction of the balance term can suppress the gradient contribution of low-quality samples with low IoU in the regression loss, avoiding a large number of low-quality samples in remote sensing images from dominating the loss leading to a skewed loss function, thereby further enhancing the effect of model regression.
In summary, our proposed Wise-Focal CIoU loss provides a reasonable re-weighting of samples of different quality in the regression process by introducing a power transformation and a balance term. In this way, the contribution of samples of different quality to the regression loss can be flexibly adjusted, thus effectively improving the accuracy and robustness of remote sensing image object detection.
Hyperparameter discussion: a power transformation
to be introduced in the Wise-Focal CIoU. To explore the effect of different values of
, we conducted a series of comparative experiments, with
set to 2, 3, 4, 5, and 6, respectively. The experimental results are shown in
Table 2.
The experimental results in
Table 2 show that the detection accuracy gradually decreases as the power transformation
increases, and the best result is obtained when
is set to 2. In the subsequent experiments, we will uniformly set the value of
to 2.