To address the many challenges of small object detection in aerial images, this paper proposes the EMFE-YOLO method. This section provides a detailed introduction to its core structure and key improvements from four perspectives: Overview, EALF Structure, EMFE Module, and DySample.
3.2. EALF Structure
The YOLOv8 backbone employs five downsampling operations to extract features. The feature maps from layers
,
, and
are fed into the neck for multi-scale feature fusion. With an input image size of
, the sizes of the feature fusion layers at the neck and the final detection feature maps are
,
, and
, respectively. However, most targets in UAV aerial images are small objects. Large-scale features contain rich spatial details and are essential for detecting them. Consequently, enhancing the utilization of large-scale features can improve the ability to locate and recognize small objects. Therefore, we optimize the network structure of YOLOv8 and propose the EALF structure to better adapt to the small object detection task in UAV aerial images. The specific design of the EALF structure is shown in
Figure 2.
In the neck,
,
,
, and
feature fusion layers are added to efficiently extract and fuse feature maps at
and
scales, which enhances the representation of large-scale features. The
layer receives feature information from the
layer of the backbone and is concatenated with the upsampling output from the
layer (Equation (
1)). This design ensures that the
layer retains the highest resolution, enhancing its ability to capture spatial detail information. The
layer accepts feature information from the
layer and is concatenated with the upsampling output from the
layer (Equation (
2)). In contrast to the
layer, the
layer introduces some semantic features, which strengthens the ability to differentiate between small object categories and compensates for the lack of semantic information in large-scale features. Layer
accepts feature information from Layer
and splices it with the downsampled output of Layer
(Equation (
3)); Layer
accepts feature information from Layer
and splices it with the downsampled output of Layer
(Equation (
4)). By utilizing large-scale features fully, the sensitivity of the model to spatial detail information is enhanced, so as to mitigate the loss of fine-grained information in deeper layers.
In the detection head, a detection head with scale of
is introduced and connected to the
layer, which improves the ability to locate and recognize small objects. At the same time, the detection head with scale of
and its redundant layers (
,
,
) are cropped to decrease the parameters. The neck structure is simplified to achieve lightweight feature fusion without reducing detection accuracy.
3.3. EMFE Module
Small objects in UAV aerial images usually have a small pixel area, limited feature representation, and are easily affected by complex backgrounds. Although the EALF structure can improve the detection accuracy of small objects, its ability to distinguish features is still limited when facing the interference from background noise. To address this issue, this paper proposes the EMFE module, which is centered around the EMFEBlock and combines convolutional layers with residual connections. The EMFE structure is shown in
Figure 3.
Start with an input feature map
(where
H and
W are the height and width of the feature map, and
C is the number of channels). Firstly, a
convolution is used to transform the features of
X. Then, the feature map is split along the channel dimension into two equal parts,
and
.
Y is fed into the EMFEBlock module for feature extraction and enhancement to obtain
, and the other part of
Z is fed into the Concat module for splicing with
. Finally, the spliced result is used as the input to obtain the output result
by a
convolution again. The computational process can be formulated as follows:
Figure 3 shows the working process of EMFEBlock. Begin with the input
. Firstly, a
depthwise convolution (DWConv) is used to capture a wide range of contextual information in the low-dimensional feature space to obtain
. Secondly, a
pointwise convolution (PWConv) is applied to extend the number of feature channels to twice the input to obtain
, which enhances the richness of feature expression.
Then, a
DWConv is again used to further extract deep features in the high-dimensional feature space to obtain the output
, which enhances the expression of fine-grained information. Subsequently,
is fed to the SCSA module for a feature enhancement to obtain
, which suppresses background noise through the synergy of spatial and channel attention. Finally,
is spliced with the low-dimensional feature maps
Y and
along the channel, and
is obtained by integrating and downscaling the features at different scales through a
PWConv, which achieves unified representation and effective fusion of multi-scale information. The process can be formulated as follows:
SCSA enhances feature representation at both spatial and channel levels to provide more comprehensive and detailed support for subsequent detection. SCSA consists of shared Multi-Semantic Spatial Attention (SMSA) and Progressive Channel-wise Self-Attention (PCSA). SMSA integrates multi-semantic information to generate comprehensive spatial feature representations, which provide valuable spatial prior information to PCSA and guide it to adjust channel feature weights more accurately. PCSA implements feature interaction at the channel level through a single-head self-attention mechanism, alleviating conflicts between multi-semantic features. The SCSA computation process is shown in
Figure 4. SMSA decomposes the input feature map
into two unidirectional 1D sequence structures
and
along the height (
H) and width (
W) dimensions, and uniformly divides them into n equally sized sub-features. Subsequently, the MS-DWConv1d module is used to extract the semantic information of the sub-features. Finally,
is generated by concatenating, applying Group Normalization (GN), and using the Sigmoid activation function on the output of MS-DWConv1d. The process can be formulated as follows:
where
.
and
denote the attention in the height and width dimensions, respectively. Then,
is taken as input into the PCSA for average pooling to compress the dimensions as
. Subsequently, three vectors of query (
), key (
), and value (
) are generated by using DWConv, and the channel attention
is obtained by the Channel-wise Single-Head Self-Attention (CA-SHSA).
can be calculated by the following equation:
is obtained by average pooling operation and Sigmoid activation function:
EMFE achieves efficient feature extraction and fusion through depthwise separable convolutions (DSCs) at different scales. EMFE also incorporates the SCSA module to enhance the expression of contextual information. In addition, it introduces a residual connection to optimize gradient to enhance the stability of feature representation and the utilization of multi-scale information.
3.4. DySample
Upsampling restores spatial information by reconstructing the features. YOLOv8 constructs a bidirectional cross-scale feature fusion mechanism through a cascade architecture of Feature Pyramid Network (FPN) and Path Aggregation Network (PAN), where the upsampling module is responsible for gradually aligning spatial features with semantic features. The default upsampling method in YOLOv8 is nearest neighbor interpolation. Although this method is simple, it lacks adaptability to input content and may lose details when dealing with small objects or complex backgrounds.
Therefore, the DySample module is used for dynamic upsampling in the neck. DySample makes upsampling more flexible and accurate by formulating upsampling from the perspective of point sampling. Compared to kernel-based upsamplers (e.g., CARAFE [
35] and FADE [
36]), DySample avoids a higher computational cost. Its sampling process is shown in
Figure 5. Given an input feature map
(where
C denotes the dimension of
X, and
H and
W denote the height and width of
X, respectively), an upsampling scale factor
r and a static range factor of 0.25, pixel shuffle is applied to obtain the output of size
. (We conducted preliminary experiments by setting the static range factor
r to 0.1, 0.25, 0.5, and 1 for comparison. The results showed that the model achieves the best performance when
r = 0.25, which is therefore adopted in this study.) The output is multiplied by the static range factor 0.25 and passed through a linear layer to obtain the offset
(where
d = 2, which represents the
x and
y coordinates of the sampling point). Finally,
O is added to the original sampling grid
P to obtain the sampling set
. This process can be formulated as follows:
Finally,
X and
S are fed into the grid_sample function, and
X is resampled by using bilinear interpolation to obtain
. This process can be expressed as follows:
DySample can adaptively select key information for sampling, which generates higher-resolution features with greater expressive power. This helps mitigate the detection difficulties caused by low resolution and limited feature information in small objects.