1. Introduction
With the continuous advancement of technology, images captured by drone are now widely used in remote sensing imagery, agriculture, wildlife conservation [
1,
2], and disaster surveillance. Although existing object detectors have made significant advancements in the object detection for natural scenes, the following problems remain when applying such general-purpose object detectors directly to remote sensing images: (i) The inconsistent flight altitude of the UAVs leads to different scale sizes of the same class of objects in the captured images. This is the case, for example, for the images in the dataset VisDrone2021. (ii) Small objects have problems, such as few effective pixels, limited feature expression, and a susceptibility to background effects. We also refer to them as spectral mixtures. (iii) The loss function based on the Intersection over Union (IoU) variant is more sensitive to the offset of small objects than that of larger objects. Therefore, small objects have the problem of being difficult to locate.
In practical applications, the large differences in scale between the various objects in remote sensing imagery present a greater challenge for object detectors. It is therefore vital to obtain a detection network that can detect objects at different scales. A prevalent approach to solve the varying scales of objects is to construct a multi-layer feature fusion, such as the Feature Pyramid Networks (FPN) [
3] and the feature fusion modules Path Aggregation Network (PANet) [
4], Bi-directional Feature Pyramid Network (Bi-FPN) [
5], Adaptively Spatial Feature Fusion (ASFF) [
6], and Neural Architecture Search Feature Pyramid Networks (NAS-FPN) [
7], which are all improved on the basis of FPN. However, small objects have fewer effective pixels, and more feature information is lost after passing through the backbone network, resulting in the model failing to correctly learn important spatial and semantic feature information about small objects. Therefore, it is necessary to increase the shallow branches as well as to improve the feature map resolution of the detection head in order to mitigate the loss of small object information.
In object detection, the regression loss function characterizes the extent of agreement between the model output box size and position and the true box size and position. The regression loss function has gone through L1/L2 loss, and smooth L1 loss [
8] to the loss function, based on IoU [
9,
10,
11,
12] variants commonly used today. YOLOv5 [
13] uses GIoU as its position regression loss. The GIoU is an improved version of IoU. Unlike IoU which focuses only on overlapping regions, GIoU focuses not only on overlapping regions but also on other non-overlapping regions, which can better reflect the overlap of both. However, this loss function is very sensitive to the positional bias of small objects, and a slight positional bias of small objects will cause a significant increase or decrease in the IoU value, which is thus unfriendly to small objects. Although other scholars have solved the regression problem for small objects to some extent using a variant of IoU, there is still the problem that this type of loss function is not friendly to small objects. Wang et al. [
14] designed an NWD based on a two-dimensional Gaussian distribution to effectively alleviate the problem of low detection accuracy of commonly used object detection networks for small objects, but they failed to consider the advantage of IoU-based loss function for the detection of large and medium objects.
Regarding the problems above, this paper proposes a remote sensing small-object network detection based on the attention mechanism and multi-scale feature fusion, which can effectively improve the detection accuracy of the model for small objects in remote sensing images with the addition of fewer model parameters. First of all, the detection head contains information for the classification and regression of the final object. In order to make effective use of the feature information in the individual detection head, we propose a detection head enhancement module. Secondly, after multiple convolutions of the input image, feature redundancy occurs in the feature layer. To prevent this type of redundant information from interfering with small objects, we use an attention mechanism to design a channel cascade module. Then, to address the difficulty of detecting small objects with the three detection heads of the universal detector, we add a detection head with a higher-feature map resolution. Finally, we introduce a NWD loss function to calculate the similarity between two objects using a Gaussian distribution.
The main contributions of this paper can be summarized as follows:
We propose a detection head enhancement module DHEM to further achieve more accurate small-object detection by combining a multi-scale feature fusion module and an attention mechanism module to enhance feature characterization, at the cost of slightly increasing model parameters.
We design a channel cascade module based on an attention mechanism, AMCC, to help the model remove redundant information in the feature layer, highlight small-object feature information, and help the model learn more efficiently for small-object features.
We introduce the NWD loss function and combine it with GIoU as the location regression loss function to improve the optimization weight of the model for small objects and the accuracy of the regression boxes. Additionally, an object detection layer is added to improve the object feature extraction ability at different scales.
AMMFN is compared with YOLOv5s and other advanced models on the homemade remote sensing dataset and publicly available dataset VisDrone2021, with significant improvements in the values and values.
The remaining sections are organized as follows: In
Section 2, we summarize the literature on small-object detection; in
Section 3, we describe the improved modules and the reasons for the improvements in detail, including a detection head enhancement module, a channel cascade module based on attentional features, a regression loss function, and the addition of a detection head; in
Section 4, we describe the relevant steps involved in this experiment and analyze the results; in
Section 5, we discuss the advantages and disadvantages of the proposed model; and in
Section 6, we conclude this work and put forward directions for optimizing the model.
2. Related Works
Object detection has tremendous practical value and application promise, and it is the cornerstone of many vision algorithm tasks, such as face recognition and target tracking [
15]. The existing networks for the detection of objects can be broadly divided into two categories. One class is the two-stage object detection networks based on the Region-CNN (RCNN) [
16], Fast Region-based Convolutional Network (Fast-RCNN), Faster-RCNN [
17], and the Region-based fully convolutional network (R-FCN) [
18], which first perform feature extraction and create a good number of candidate boxes for images through a backbone network, and then perform classification tasks and regression tasks for objects. The detection accuracy of this type of network is high, but its real-time performance is very low. One class is the one-stage object detection networks represented by the Single Shot Multi-Box Detector (SSD) [
19], You Only Look Once (YOLO) series [
20,
21,
22,
23], Fully Convolutional One-Stage Object Detector (FCOS) [
24], and the RetinaNet [
25], which directly perform semantic and spatial feature extraction on objects and then complete the classification and regression of objects. Although the overall performance of such networks is poor, their real-time performance is high and they have been broadly used in various scenarios. Academically, there are two ways to define a small object: relative size and absolute size. The relative size approach is to consider an object to be small if its aspect is 0.1 of the original image size [
26], and the absolute size approach is to consider an object smaller than
pixels to be small. This paper uses the definition of absolute size. Remote sensing images suffer from complex backgrounds, few effective pixels of objects, varying scales, and different morphologies, which make it difficult for existing general-purpose object detectors to extract accurate and effective feature information to classify and localize small objects. To address the challenge of the inaccurate detection of small objects in the field of remote sensing, this paper focuses on the work of other scholars from two aspects: multi-scale integration and attention mechanism.
Since the deep feature layer contains rich semantic object information and the large perceptual domain, while the shallow features contain more fine-grained information, the deep feature information and the shallow feature information can be reasonably used to increase the accuracy of the model for small-object detection by multi-scale fusion. Qu et al. [
27] proposed a small-object detection model that is called the Dilated Convolution and Feature Fusion Single Shot Multi-box Detector (DFSSD) which improved the detection of remote sensing small objects to some extent by expanding the perceptual domain of features, obtaining contextual information of features at different scales, and enhancing the semantic information of shallow features. Deng et al. [
28] designed an Extended Feature Pyramid Network (EFPN) specifically for small-object detection, which contained a Feature Texture Transfer (FTT) module that acted on the super-resolution feature map by extracting semantic information and texture features from the feature map of the FPN network, thus effectively improving the representation of small-object feature information and being efficient in both computation and storage. Deng et al. [
29] proposed a multi-scale dynamic weighted feature fusion network, which adaptively assigns different weights to feature layers at different scales through network training to increase the contribution of shallow feature information in the whole network, which directs the model for small-object detection tasks.
Small objects have the problem of a low number of effective pixels and an easy background confusion, so highlighting the feature information of small objects is very necessary. The attention module helps the network to pay close attention to task-relevant foreground object feature information in a large amount of background information. Thus, the use of attention mechanisms can effectively improve the representation of small-object features. Zhu et al. designed a small-object detection model called the Transformer Prediction Heads-YOLOv5 (TPH-YOLOv5) [
30] that combines YOLOv5 with the Transformer [
31], which integrates the Convolutional Block Attention Model (CBAM) [
32] module and self-attention mechanism [
33] into the YOLOv5 model to help the network in extracting small-object feature information, which effectively improves its ability to detect small objects and means it can detect small objects under the perspective of UAV low-altitude flight. Shi [
34] et al. proposed a feature enhancement module using the location attention mechanism, which improves the efficiency of the model at detecting small objects by highlighting the contribution of important features and suppressing the influence of irrelevant features on the overall feature information, by using the channel attention mechanism, after extracting the feature information from different sensory fields. Zhao et al. [
35] proposed a feature fusion strategy based on the Efficient Channel Attention (ECA) module to enhance the expression of semantic information of shallow features by fusing object information at different scales, thus improving the performance of the model for small-object detection. Zhang et al. [
36] designed a multi-resolution attention detector that captured useful location and contextual information through adaptive learning, which was used to obtain attention weights by calculating the cosine similarity between other output layers and the template layer, which were then weighted to fuse the first three layers of feature information of the backbone network to generate an attention graph, to highlight the feature information of small objects.
3. Our Work
In the paper, we improve on the YOLOv5s model by proposing the addition of AMMFN. Firstly, the prediction layer is optimized by the detection head enhancement module. Secondly, a channel cascade module based on an attention mechanism is designed to replace the generic cascade operation in the neck. Then, the NWD and the GIoU are merged to improve the weight losses of the small objects and to increase the accuracy of the regression boxes. Finally, the detection head of the shallow network is increased for detecting small objects. These four improvements are effective in improving the ability of the model to detect small objects.
Figure 1 shows the overall structure of the proposed network model.
3.1. Detection Head Enhancement Module
The detection head part of YOLOv5 contains the final object classification information as well as the regression information of the object box; therefore, for small-object detection, the detection head has a huge impact. During the model training process, the detection head detects too few small objects due to their weak feature representation and fewer pixels, resulting in interference with the optimized weights of small objects. For this reason, it is very necessary to significantly enhance the feature-expression capability of the foreground.
Using multi-scale feature fusion, features of different sizes can be obtained and the perceptual domain can be expanded to strengthen the description of small-object features, thus improving the detection performance of the model for small objects. To this end, this paper proposes a detection head enhancement module through this idea, as shown in
Figure 2.
The DHEM structure uses a multi-branch structure, wherein each branch uses convolution kernels of variable numbers and sizes of to obtain different scales of perceptual fields, and also uses the idea of residual connectivity. This approach improves the range of perceptual fields without adding too much computation and enables the model to obtain features with high discriminative power while being lightweight. However, there are semantic differences in the feature maps at different scales, and the fused feature layers may have a confounding effect which can cause the network to confuse localization and recognition tasks. To mitigate this negative impact, a lightweight channel attention mechanism module is used in this paper. This module not only reduces the confusion between features but also significantly enhances the feature information of small objects.
Specifically, firstly, 1 × 1 convolution is used to bring down the number of feature channels and thus reduce the computational effort; secondly, the information of different scales is extracted by three branches, respectively, and cascaded to obtain the feature map with multi-scale information, then 1 × 1 convolution is used to organize the information in the feature map and decrease the number of channels; and finally, the model will obtain accurate and non-redundant feature information for final object detection. The formulae are shown in (1) and (2).
where
represents the splicing operation,
and
represent two convolution operations with 1 × 1 convolution kernels,
,
and
are the convolution operations on the graph, and
represents the Sigmoid activation function.
3.2. Channel Cascade Module Based on Attention Mechanism
Feature fusion is the combination of information from different scales or branches and is an essential part of the object detection network structure. A common method of feature fusion is to merge features by connecting channels of the feature map or by adding them element by element. Element-by-element addition can make the feature map more informative with the same dimensionality but less computationally intensive than the cascade approach. However, for the problem of semantic inconsistency or perceptual field inconsistency among input feature maps, this method may not be the best one. To prevent the imbalance problem caused by object scale variation and small-object feature information to the detection model in remote sensing images, in this paper, a channel cascade module AMCC based on the attention mechanism is designed according to the literature [
37], as seen in
Figure 3.
The module consists mainly of a mechanism for paying attention to channels and a mechanism for paying attention to space. The channel attention mechanism adaptively learns and focuses on the channel weights that are more important for the task, thus enabling adaptive object selection and directing the network to focus more on important objects. The spatial location attention mechanism can guide the model to learn and highlight task-relevant foreground objects on the feature map according to the spatial information of the feature map. Thus, the advantages of both are used to direct the network’s attention to more regions about small objects, as shown in (3)–(7).
where
and
represent the average and maximum pooling operations, respectively, and
indicates all convolution operations in the spatial attention mechanism.
Specifically, given two feature maps , firstly, and are selected for initial feature fusion by element summation to obtain feature map . Secondly, feature map Z is input into the channel attention mechanism module to obtain weights that help the network to focus on small-object information through pooling and convolution operations, and then weights and weights are applied to feature maps and to get and , respectively. Then, and are summed by elements for the second time to acquire the feature map , and the feature map is input into the spatial attention mechanism module to obtain the spatial weights associated with the object task . The attention mechanism uses expanded convolution to expand the perceptual field and aggregate the contextual information. The weights and are then applied to the feature maps and to obtain and , respectively. finally, the channels of and are stitched to obtain effective feature information at different scales.
3.3. Optimization of the Loss Function
The position regression loss function of the YOLOv5s is GIoULoss, as seen in Equation (8). For small objects, a small positional offset will cause a sharp decrease in the IoU value, but for large objects, only a small change in the IoU value occurs for the same positional offset. Therefore, the IoU-based loss function is very sensitive to the positional shift of small objects, which reduces the overall detection accuracy of the object detector, as seen in
Figure 4. To solve the problem, we introduce the position regression loss function based on normalized Wasserstein distance (NWD), which has become a new method for small-object detection and optimization in recent years. NWD uses a two-dimensional Gaussian distribution to model the bounding box and calculate the similarity between the predicted object box and the labelled object box by its corresponding Gaussian distribution, i.e., the normalized Wasserstein distance between them based on Equation (10). The method consistently reflects the distance between distributions for objects detected by the model, regardless of whether they overlap or not. NWD is insensitive to the scale of objects; it is therefore more appropriate to use it to measure the similarity between the predicted object boxes and the labelled boxes in the remote sensing images. However, this paper does not simply replace this GIoU with the NWD loss function, because the GIoU is better at detecting large- and medium-sized objects. Therefore, in the paper, NWD is fused with GIOU by scaling so that the model can improve the optimization weights and the accuracy of the regression boxes, according to Equation (11), as a loss function for the location regression of AMMFN, where the coefficients a and b of GIoU and NWD are chosen as shown in the ablation experiment section.
where
represents the ratio of the intersecting areas of two rectangular boxes to the sum of their areas,
indicates the area enclosed by the prediction box,
indicates the area surrounded by the label box,
represents the area of the smallest outer rectangle of the prediction box and label box, and
denotes the constant associated with the data-set. In this paper, the value of C is the number of categories in the dataset.
denotes the distance measure, and
and
denote the Gaussian distribution modeled by
and
.
3.4. Optimization of the Prediction Feature Layer
In remote sensing images, small objects have the drawbacks of small size and insufficient effective information. In the YOLOv5s object detector, the effective pixels of small objects gradually decrease in the process of mapping the input image into feature maps of different scales after repeated down-sampling operations, which makes the network unable to learn the important spatial feature information of small objects well, thus influencing the detection accuracy of the detectors for small objects.
As seen in
Figure 1, compared with the original three detection layers of YOLOv5s, detection layer P2 contains richer texture and more detailed information due to fewer down-sampling operations, which can help the model to detect small objects in remote sensing images more effectively, so we propose the addition of detection layer P2 to detect small objects on feature maps.
6. Conclusions
In this paper, we proposed a remote sensing small-object detection network based on the attention mechanism and multi-scale feature fusion, to address the problem that existing object detectors are poor at detecting small objects due to the object’s size, unsatisfactory feature extraction, and large-scale variation in UAV images.
In terms of the network structure, we firstly designed a detection-head-enhancement-module (DHEM) to enhance the weight information of foreground objects. Secondly, we proposed a feature cascade module with a multi-scale attention mechanism (AMCC) to reduce the redundant information in the feature layer and enhance the feature representation of small objects. Finally, a new detection head was added to predict small objects using shallow fine-grained information. In terms of loss functions, we introduced the NWD loss function to address the problems of small-object optimization weights and inaccurate small-object prediction boxes.
Although the detection performance of small objects in remote sensing images can be improved with the network described in this paper, there is still more room for improvement, such as the existence of large-model arithmetic power, missed objects, and its poor detection performance of large objects. The next step is to investigate ways to improve the overall performance of the object detector in a more efficient and lightweight manner.