Multi-Level Contextual and Semantic Information Aggregation Network for Small Object Detection in UAV Aerial Images

Liu, Zhe; He, Guiqing; Hu, Yang

doi:10.3390/drones9090610

Open AccessArticle

Multi-Level Contextual and Semantic Information Aggregation Network for Small Object Detection in UAV Aerial Images

by

Zhe Liu

,

Guiqing He

^*

and

Yang Hu

School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(9), 610; https://doi.org/10.3390/drones9090610

Submission received: 21 July 2025 / Revised: 25 August 2025 / Accepted: 27 August 2025 / Published: 29 August 2025

(This article belongs to the Special Issue Intelligent Image Processing and Sensing for Drones, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

In recent years, detection methods for generic object detection have achieved significant progress. However, due to the large number of small objects in aerial images, mainstream detectors struggle to achieve a satisfactory detection performance. The challenges of small object detection in aerial images are primarily twofold: (1) Insufficient feature representation: The limited visual information for small objects makes it difficult for models to learn discriminative feature representations. (2) Background confusion: Abundant background information introduces more noise and interference, causing the features of small objects to easily be confused with the background. To address these issues, we propose a Multi-Level Contextual and Semantic Information Aggregation Network (MCSA-Net). MCSA-Net includes three key components: a Spatial-Aware Feature Selection Module (SAFM), a Multi-Level Joint Feature Pyramid Network (MJFPN), and an Attention-Enhanced Head (AEHead). The SAFM employs a sequence of dilated convolutions to extract multi-scale local context features and combines a spatial selection mechanism to adaptively merge these features, thereby obtaining the critical local context required for the objects, which enriches the feature representation of small objects. The MJFPN introduces multi-level connections and weighted fusion to fully leverage the spatial detail features of small objects in feature fusion and enhances the fused features further through a feature aggregation network. Finally, the AEHead is constructed by incorporating a sparse attention mechanism into the detection head. The sparse attention mechanism efficiently models long-range dependencies by computing the attention between the most relevant regions in the image while suppressing background interference, thereby enhancing the model’s ability to perceive targets and effectively improving the detection performance. Extensive experiments on four datasets, VisDrone, UAVDT, MS COCO, and DOTA, demonstrate that the proposed MCSA-Net achieves an excellent detection performance, particularly in small object detection, surpassing several state-of-the-art methods.

Keywords:

context information; feature fusion; small object detection; aerial image

1. Introduction

In recent years, with the rapid development of unmanned aerial vehicle (UAV) technology [1,2], aerial image object detection has become an important research direction in the fields of computer vision and remote sensing [3,4,5]. By virtue of their unique advantages, including high mobility, low cost, and flexible deployment, UAVs can effectively overcome the limitations of traditional remote sensing platforms in terms of timeliness and spatial resolution. Consequently, they have been widely applied in numerous domains, such as remote sensing and mapping [6], ecological conservation [7], and smart cities [8]. Although existing mainstream object detectors have achieved significant progress in general object detection, their performance on aerial images has yet to reach satisfactory levels of accuracy and efficiency. This is primarily due to the inherent characteristics of aerial imagery, which is characterized by small-scale objects and complex backgrounds.

Currently, the majority of object detection methods for aerial images are built upon Convolutional Neural Network (CNN) architectures and can be primarily classified into two categories: two-stage and one-stage detectors. Representative examples of two-stage methods include Faster R-CNN [9], Mask R-CNN [10], and Cascade R-CNN [11]. These methods first generate a set of candidate object proposals through a Region Proposal Network (RPN). Subsequently, they perform fine-grained classification and precise bounding box regression on these proposals to achieve accurate detection. However, two-stage methods are often computationally expensive, which makes them challenging to deploy on resource-constrained UAV platforms. To achieve a higher detection efficiency, one-stage detectors, such as the YOLO series [12,13,14,15] and FCOS [16], formulate object detection as a regression problem. They pre-define a series of prior anchors over the feature maps and then directly predict the object class and regress the bounding box coordinates based on these anchors, thus eliminating the proposal generation stage. By integrating the object localization and classification tasks into an end-to-end network, these methods significantly improve both the computational efficiency and inference speed.

As shown in Figure 1, compared to natural images, aerial images captured using UAVs provide a high-altitude perspective [17], which results in a wide field of view [18]. In addition, the key targets in these images are relatively small in size [19,20]. These factors bring significant challenges to UAV-based object detection tasks. Specifically, the difficulties of small object detection in aerial images can be attributed to the following two points:

(1) Insufficient feature representation: Due to their inherently small scale and limited pixel coverage, small targets in aerial images suffer from a scarcity of visual information, such as texture and contours. This makes it difficult for models to construct robust and discriminative feature representations.

(2) Background confusion: The wide field of view of UAVs captures vast backgrounds, and some objects are occluded. This introduces more noise interference, causing the features of small objects to be easily overwhelmed by or confused with the surrounding background.

To address the issues of insufficient feature representation and background confusion in small object detection in aerial images, we propose a Multi-Level Contextual and Semantic Information Aggregation Network (MCSA-Net), which can effectively improve the detection accuracy for small objects. Specifically, MCSA-Net includes three key components: the Spatial-Aware Feature Selection Module (SAFM), the Multi-Level Joint Feature Pyramid Network (MJFPN), and the Attention-Enhanced Head (AEHead). The SAFM first employs a sequence of dilated convolutions to capture multi-scale local contextual information and then integrates a dynamic spatial selection mechanism to adaptively weight and merge these multi-scale features along the spatial dimension. This allows for key local context to be precisely captured, enabling the model to obtain a richer feature representation for small objects. The MJFPN incorporates multi-level connections to facilitate dense information exchange among latent semantic features at various scales. Furthermore, a weighted fusion method is employed, allowing the MJFPN to dynamically adjust the fusion weights based on the importance of different features. The fused features are subsequently enhanced by a feature aggregation network, which is constructed from multi-branch reparameterized convolutional blocks. Finally, the AEHead is proposed by integrating a sparse attention mechanism into the detection head. The sparse attention mechanism calculates the attention among the most relevant spatial regions, achieving efficient extraction of the fine-grained global contextual features while suppressing interference from irrelevant background information. As a result, the AEHead enhances the model’s perceptual ability for small objects, thereby effectively improving the detection performance. Our contributions are as follows:

(1) We propose an MCSA-Net that effectively addresses the issues of insufficient feature representation and background confusion, thereby significantly improving the detection accuracy for small objects in aerial images;

(2) We propose an SAFM that first extracts multi-scale local context features through a dilated convolution sequence and then adaptively merges these features using a dynamic spatial selection mechanism to obtain the key local context features required for the objects, effectively enriching the feature representation of small objects.

(3) We propose an MJFPN that introduces multi-level connections and weighted fusion to fully leverage the detailed features of small objects in feature fusion and further enhances the fused features through a feature aggregation network;

(4) We propose an AEHead that leverages sparse attention to efficiently extract the fine-grained global context while suppressing background noise, thereby strengthening the model’s perceptual capacity for small objects.

2. Related Works

2.1. The Attention/Selective Mechanism

The attention mechanism is a widely adopted technique in deep learning that enhances the model’s performance and efficiency by selectively focusing on salient features while suppressing irrelevant ones [21,22,23,24]. SENet [25] introduced a channel attention mechanism that performed feature recalibration by reweighting the channel-wise features based on the global average information. CBAM [26] combined channel and spatial attention, where the spatial attention module generated a spatial mask to emphasize the feature representations at crucial locations, thereby improving the network’s ability to model contextual information. RLRD-YOLO [27] refines the spatial attention mechanism further by narrowing its scope from the entire feature map to within the receptive field of the convolutional kernels. This allows the convolutional weights to be dynamically adjusted according to the local context, thereby improving the model’s capacity to extract fine-grained features of small objects. In addition to channel and spatial attention mechanisms, the kernel selection mechanism is another effective dynamic context modeling strategy. SKNet [28] employed multiple convolutional branches with different kernel sizes within the same layer and performed adaptive aggregation across the channel dimension, enabling the model to dynamically adjust its feature representations. Building on this idea, ResNeSt [29] enhanced the feature extraction capabilities by splitting the input feature map into several groups, with each undergoing independent convolution operations, thereby capturing richer and more diverse features. Similarly to SKNet, SCNet [30] utilized multi-scale convolutional branches to extract more comprehensive contextual information and incorporated spatial attention to highlight key spatial locations, thereby improving the model’s localization performance.

The transformer model [31,32,33,34,35] has demonstrated significant potential in object detection due to its remarkable performance and broad applicability. The core strength of transformer models lies in the self-attention mechanism, which effectively captures global contextual dependencies and the limitations imposed by the fixed receptive fields of CNNs. Swin Transformer [36] introduced a shifted-window-based self-attention mechanism which improved the computational efficiency by confining the self-attention calculations to local windows and enabling information interactions across them. However, this approach leads to an insufficient capability to model global information. DETR [37] reframes the object detection task as a direct set prediction problem and constructs an end-to-end, transformer-based object detection architecture, but its large number of learnable parameters results in an unstable training process. RT-DETR [38] significantly improves the detection speed and accuracy by designing a hybrid encoder and implementing IoU-aware query selection. Nevertheless, the computational complexity of the self-attention mechanism scales quadratically with the input image size. Applying the aforementioned transformer-based detectors directly to high-resolution UAV images would result in an enormous computational and memory overhead. To address this, we have introduced a sparse attention mechanism into our detection head. It selectively computes the attention only among a few of the most relevant regions, rather than performing dense computations across the entire image. This design allows for effective modeling of the key global long-range dependencies necessary to distinguish small objects from complex backgrounds while also greatly reducing the model’s computational complexity.

2.2. Multi-Scale Feature Fusion

Deep-learning-based object detection methods rely on backbone networks to extract high-level features rich in semantic information. However, in aerial imagery, the features of small objects may occupy only a single pixel in the output feature map after successive downsampling by the network. This results in insufficient feature information and compromises the detection performance. Therefore, the introduction of a multi-scale feature fusion mechanism is beneficial for more effective representations of small objects. The Feature Pyramid Network (FPN) [39] is a milestone work that constructs a feature pyramid structure based on multi-scale features. Through a top-down pathway, it combines high-level semantic features with low-level detailed features, thereby effectively fusing features from different scales. Building on the FPN framework, to enhance the feature interactions further, PANet [40] added an additional bottom-up path augmentation network to the FPN, establishing a bidirectional feature information flow. This allowed fine-grained localization information from lower layers to be propagated to the higher-level features better. BiFPN [41] further introduced learnable weights to adjust the importance of different input features and optimized the feature pyramid architecture with efficient connection strategies, including cross-layer skip connections and redundant node removal, achieving a higher fusion efficiency while effectively improving the accuracy. NAS-FPN [42] utilizes Neural Architecture Search (NAS) technology to automatically learn and construct an irregular yet efficient feature pyramid structure within a large search space, thus eliminating the reliance on traditional hand-designed fixed fusion paths.

In addition to optimizing the feature propagation path, several methods focus on enhancing feature discriminability. In the feature fusion process of a conventional FPN, operations like upsampling and channel reduction may introduce noise or lead to a decrease in representation robustness. To address this issue, DN-FPN [43] employs contrastive learning to supervise the feature fusion process, effectively suppressing noise within the FPN and ensuring that the fused features maximally retain semantic information from high-level features. However, this may cause the spatial details of low-level features to be dominated and weakened by the semantic information on the high-level features, thereby obscuring the detailed features of small objects. To enhance the feature representation of small objects, EA-PANet [44] and ACFPN [45] introduce context extraction modules at different levels of the FPN to extract local contextual information and further combine it with attention enhancement modules to focus the model on foreground objects and extract discriminative multi-scale features. YOLO-SOD [46] concatenates and fuses multi-scale features and then applies multi-scale convolutional layers combined with channel and spatial attention to enhance the expressive power of the fused features. However, these methods adopt indiscriminate fusion strategies when merging features from different paths, ignoring the varying contributions of feature maps from different levels to the final detection result. Furthermore, they often have complex network structures. Our MJFPN first introduces multi-level connections to promote dense information exchange between different levels of latent semantics. Secondly, it re-weights each feature map along the channel dimension through a weighted fusion approach. Finally, the fused features are fed into a feature aggregation network, which enhances the expressive power of the multi-scale fused features further without increasing the inference cost.

3. The Method

The framework of the proposed MCSA-Net is illustrated in Figure 2. First, the SAFM is incorporated into the backbone. This module enables the model to focus on the most relevant spatial contextual regions of the target, which facilitates fine-grained local context information extraction and thereby effectively enriches the feature representation of small objects. Subsequently, feature maps from different stages of the backbone are fed into the proposed MJFPN for efficient multi-scale feature fusion. During the fusion process, the spatial detailed information of small objects is fully utilized, and the overall expressive capability of the fused features is enhanced. Finally, the sparse attention mechanism performs attention calculation on the most relevant regions within the fused feature map, effectively modeling the global contextual relationships and suppressing interference from irrelevant background information. Consequently, the model’s perceptual capability for small objects is enhanced, ultimately yielding precise detection results.

3.1. The Spatially Aware Feature Selection Module (SAFM)

Due to the limited visual characteristics of these small objects, the backbone network requires more detailed feature extraction. When the intrinsic features of the target are insufficient to extract comprehensive information, leveraging the contextual relationships between the target and its surrounding environment can effectively enhance the performance of small object detection. Therefore, we propose a Spatial-Aware Feature Selection Module (SAFM) capable of extracting detailed local contextual information, thereby enriching the feature representation of small objects. The architecture of the proposed SAFM is illustrated in Figure 3. Specifically, a sequence of dilated depthwise convolutions is first employed to capture multi-scale local contextual features. Furthermore, these features are adaptively weighted and merged along the spatial dimension using a dynamic spatial selection mechanism to achieve precise spatial feature extraction, thereby enhancing the model’s ability to focus on the most relevant regions of the spatial context of the target.

The dilated convolution sequence: To obtain rich local contextual information, a sequence of depthwise convolutions with increasing dilation rates is employed to extract features from the input X. The dilation rate is set to increase as a power of 2, which allows the receptive field to expand sufficiently fast and ensures that no gaps are introduced between feature maps. The receptive field

f_{n}

after the n-th dilated convolution in the sequence can be summarized by the following formula:

f_{n} = (2^{n + 1} - 1) \times (2^{n + 1} - 1)

(1)

By cascading these dilated convolutions, the same effective receptive field as that in standard large-kernel convolution can be achieved. This enables extraction of an object’s features under large receptive fields of different scales, thereby capturing the local contextual features

U_{n}

from various ranges. The process of feature extraction using the dilated convolution sequence is as follows:

U_{0} = X, U_{n + 1} (i, j) = \sum_{u = 1}^{K} \sum_{v = 1}^{K} U_{n} (i + d_{n} \cdot u, j + d_{n} \cdot v) \cdot W_{n}^{K} (u, v)

(2)

where

W_{n}^{K}

is the weight matrix of the n-th convolutional kernel with a size of

K \times K

, and

d_{n}

is its corresponding dilation rate. Assuming there are N cascaded convolutional kernels, each extracted spatial feature is processed further through a point-wise convolution layer

{PWConv}_{n} (\cdot)

for interaction and fusion across the channel dimension, forming a more discriminative representation.

{\tilde{U}}_{n} = {PWConv}_{n} (U_{n}), n \in [1, N]

(3)

The proposed design of the dilated convolution sequence has two main advantages: First, it can explicitly construct multi-scale receptive fields to encode features with diverse local contexts, which benefits subsequent fine-grained spatial feature extraction. Second, applying a sequence of convolutions with increasing dilation rates is more efficient than directly applying a single large-scale standard convolution operation. With a theoretical receptive field of the same size, this design not only significantly reduces the number of parameters but also enables the extraction of richer features.

The dynamic spatial selection mechanism: This mechanism is used to adaptively weight and merge the feature maps extracted by convolution kernels with different receptive fields to achieve precise spatial feature extraction, thereby enhancing the model’s ability to focus on the most relevant regions of the spatial context for different types of objects. Specifically, the features obtained from the dilated convolution sequence are first concatenated:

\hat{U} = Concat ({\tilde{U}}_{1}; {\tilde{U}}_{2}; \dots; {\tilde{U}}_{N})

(4)

Then, channel-based max pooling and average pooling operations are applied to the concatenated feature

\tilde{U}

. At each spatial location, channel-based max pooling extracts the most prominent feature, while average pooling captures the overall contextual importance. By combining these two pooling operations, spatial relationships can be efficiently extracted. To enable information interaction between the two spatial descriptors and allow the network to effectively utilize both types of spatial information simultaneously, the two pooled features are first concatenated. Then, a pointwise convolution layer is applied to transforming the two-channel pooled features into N spatial attention maps for further feature selection.

Z = PWCon v^{2 \to N} ([MaxPool (\hat{U}); AvgPool (\hat{U})])

(5)

where

MaxPool (\cdot)

and

AvgPool (\cdot)

represent the max pooling and average pooling, respectively. The sigmoid activation function is applied to each spatial feature map

Z_{i}

to obtain the spatial selection masks

\tilde{Z_{i}}

corresponding to the convolution kernels with different receptive fields. These masks are then used to weight the features extracted from the dilated convolution sequence, and the weighted features are subsequently fused to obtain the final output feature Y.

{\hat{Z}}_{i} = Sigmoid (Z_{i})

(6)

Y = {\hat{Z}}_{1} \cdot {\tilde{U}}_{1} + {\hat{Z}}_{2} \cdot {\tilde{U}}_{2} + \dots + {\hat{Z}}_{N} \cdot {\tilde{U}}_{N}

(7)

Through the above operations, the model can dynamically select the optimal receptive field for each object, thereby capturing local contextual information more precisely. This effectively enriches the feature representation of small objects and significantly improves the detection accuracy.

3.2. The Multi-Level Joint Feature Pyramid Network (MJFPN)

High-level features encode rich abstract semantics, while low-level features retain abundant spatial details. Effectively aggregating features from multi-scale feature maps can construct more robust semantic representations for small objects. The Feature Pyramid Network (FPN) [39] builds a top-down pathway to propagate high-level semantic information to lower-level features. On this basis, PANet [40] further introduces a bottom-up pathway to enhance the flow of low-level localization information toward higher levels, thereby establishing an efficient bidirectional fusion framework. However, although the top-down information flow strengthens the semantics of the lower-level features, it may weaken spatial details due to dominance of the high-level semantic features, resulting in the loss of fine details of small objects. Moreover, the existing methods typically apply indiscriminate fusion strategies when combining features from different paths, neglecting the varying contributions of feature maps at different levels to the effective detection of small objects. Therefore, we propose a Multi-Level Joint Feature Pyramid Network (MJFPN), as illustrated in Figure 4, aiming to achieve better fusion of the features of small objects and enhance the representation capability for multi-scale fused features.

Firstly, unlike existing bidirectional feature pyramid structures, we introduce multi-level connections that fully utilize the detailed information on small objects in the low-level features and simultaneously promote dense information exchange across different hierarchical semantic levels. Secondly, a weighted fusion approach is employed to adaptively adjust the fusion weights of features from different scales based on their importance during fusion. Finally, based on the idea of ELAN and multiple reparameterized convolutional blocks, a feature aggregation network was constructed to enhance the fused multi-scale features further.

The input to the MJFPN consists of feature maps extracted from different stages of the backbone network. These include low-level features from earlier stages, denoted as

X_{1}

(160 × 160) and

X_{2}

(80 × 80); a mid-level feature,

X_{3}

(40 × 40); and a high-level feature from a deeper stage,

X_{4}

(20 × 20). The top-down strategy of the MJFPN is as follows: First,

X_{3}

is downsampled to match the size of

X_{4}

, and then they are fused to obtain

X_{4}^{'}

. Next,

X_{4}^{'}

is upsampled while

X_{2}

is downsampled;

X_{4}^{'}

,

X_{2}

, and

X_{3}

are then fused to produce

X_{3}^{'}

. Differing from the standard top-down fusion strategy, due to the introduction of multi-level connections, the fusion feature

X_{3}^{'}

incorporates not only features from the same-level shallow feature

X_{3}

and the preceding layer feature

X_{4}^{'}

but also integrates the high-resolution low-level feature

X_{2}

. A similar procedure is applied to obtaining the fused feature

X_{2}^{'}

, where the feature

X_{1}

is additionally incorporated. This design facilitates dense information exchange among multi-level semantic features, enabling the network to fully exploit the details of small objects present in low-level high-resolution features. The above process can be expressed as follows:

X_{4}^{'} = Fusion (δ (B N (Conv (X_{3}))), X_{4})

(8)

X_{3}^{'} = Fusion (f_{u p}^{2} (X_{4}^{'}), X_{3}, δ (B N (Conv (X_{2}))))

(9)

X_{2}^{'} = Fusion (f_{u p}^{2} (X_{3}^{'}), X_{2}, δ (B N (Conv (X_{1}))))

(10)

where

f_{u p}^{2}

denotes the upsampling operation,

δ

represents the SiLU activation function, and

B N

denotes batch normalization. The bottom-up process in the MJFPN is similar to the top-down pathway, with the main difference being that all downsampling operations are performed using convolutions with a stride of 2. For example,

X_{3}^{'}

and

X_{3}^{″}

are downsampled and then fused with

X_{4}^{'}

to produce

X_{4}^{″}

.

X_{4}^{″} = Fusion (δ (B N (Conv (X_{3}^{'}, stride = 2))), X_{4}^{'}, δ (B N (Conv (X_{3}^{″}, stride = 2))))

(11)

The weighted fusion approach: Conventional weighted fusion strategies operate at the feature map level, assigning a uniform weight to all channels within a given feature map, which overlooks the varying importance of individual channels. To enhance the representation of small targets in multi-scale features and to fully leverage the distinct information encoded in different channels, we introduce a weighted fusion approach that weights the feature maps along the channel dimension to signify the importance of each channel. Specifically, the feature maps are first concatenated. Then, they are multiplied by a set of learnable, normalized weights, where the number of weights is equal to the total number of concatenated channels, yielding the weighted fused feature Y.

Y = \sum_{j} \frac{w_{j}}{\sum_{m} w_{m} + ε} \cdot x_{j}

(12)

where

ω_{j}

represents the j-th learnable weight, m is the total number of channels after concatenation, and

ϵ

is set to a small value of 0.0001 to avoid numerical instability. Through this adaptive weighted fusion operation, the network obtains more discriminative fused features.

The feature aggregation network: After being processed by the weighted fusion approach, the fused features are sent into two separate branches. One branch functions as a shortcut connection, passing the original information directly to the final concatenation layer. The other branch is processed through N stacked reparameterized convolution (RepConv) blocks. Following the mechanism of ELAN [47], the output of each RepConv block is preserved and ultimately concatenated to form the final output feature.

The structure of the RepConv block first implements a k × k RepConv based on the concept of reparameterization and then uses a 3 × 3 convolution to refine the features. It is noteworthy that RepConv operates differently during the training and inference phases. During training, the network employs a multi-branch parallel structure, where each branch consists of a convolutional layer of a specific size, followed by a batch normalization (BN) layer. During inference, the parameters of the convolutional layer within each branch are fused with its corresponding BN layer. Subsequently, the fused weights and biases from all branches are accumulated to form a single convolutional layer, thereby ensuring that the inference speed is not compromised. As a specific example, the right-hand side of Figure 4 shows the reparameterization steps for a 3 × 3 RepConv operation. Let

K_{n}

and

B_{n}

denote the weights and biases of the

n \times n

convolutional kernel, respectively. The output

Y^{'}

can be expressed as

Y^{'} = Y \otimes (K_{2 n - 1} + \sum_{i = 1}^{m} K_{2 n - (2 i + 1)}) + (B_{2 n - 1} + \sum_{i = 1}^{m} B_{2 n - (2 i + 1)})

(13)

3.3. The Attention-Enhanced Head (AEHead)

The global context can represent the relationships between pixels across spatial dimensions, providing the model with a holistic understanding of image semantics. Therefore, effectively extracting the global context is crucial for accurately distinguishing small objects from complex backgrounds. The self-attention mechanism models long-range dependencies by computing the correlations between different spatial positions on the feature map, thereby capturing the global context. However, due to the high resolution and abundant background in aerial images, directly applying self-attention results in high computational complexity and redundancy.

After the SASM and the MJFPN, the feature maps have already encoded key local contextual information and present good representations of the features of the small objects. In order to model the global contextual relationships, we introduce a sparse attention mechanism in the detection head, forming the attention-enhanced head (AEHead). The architecture of the AEHead is illustrated in Figure 5a. Specifically, a sparse attention mechanism is incorporated into the early stages of the detection head, with the number of attention heads set to 8 by default. The output features from each level of the feature pyramid are first processed by the sparse attention mechanism to model the global contextual information. They are then refined further through convolution layers to ultimately yield the final predictions for classification and regression. As illustrated in Figure 5b, the sparse attention mechanism first selects a few of the most relevant regions for each query at a coarse region level. Subsequently, attention computation is performed exclusively on the set of tokens contained within these selected regions, thereby enabling efficient extraction of the key contextual information from the global scope.

Given a fused feature map

X \in R^{H \times W \times C}

processed by the MJFPN, we first divide it into

N \times N

non-overlapping regions, where each region contains

\frac{H W}{N^{2}}

feature vectors. The corresponding query (Q), key (K), and value (V) vectors are then obtained via linear projections:

Q = X_{r s} \cdot W_{q}, K = X_{r s} \cdot W_{k}, V = X_{r s} \cdot W_{v}

(14)

where

W_{q}

,

W_{k}

, and

W_{v}

are the respective learnable weight matrices for the projections. For the

N \times N

partitioned regions, we construct an affinity graph to represent the relationships among different regions, indicating which regions have a stronger association with each other. Specifically, we first obtain the region-level queries and keys,

Q^{'}

and

K^{'} \in R^{S^{2} \times C}

, by averaging the Q and K vectors within each region. Then, the query matrix

Q^{'}

is multiplied by the transpose of the key matrix

K^{'}

to obtain an adjacency matrix

A M

, which represents the affinity graph between regions.

A M = Q^{'} \cdot {(K^{'})}^{T}

(15)

In the adjacency matrix

A M

, each row represents the current region, and each column represents the other regions. The elements in the matrix indicate the semantic relevance between two regions. To focus on the most relevant global contextual information for the target and suppress interference from the irrelevant background, we identify the top-k most related regions for each region based on the elements in the adjacency matrix and prune the remaining regions. The indices of these retained k most relevant regions are stored in a region-to-region routing index matrix

R_{I n d e x}

.

R_{I n d e x} = [\begin{matrix} max_{k} (A M [1]) \\ max_{k} (A M [2]) \\ ⋮ \\ max_{k} (A M [S^{2}]) \end{matrix}]

(16)

where

A M [i]

represents the i-th row in the adjacency matrix

A M

, and the i-th row in

R_{I n d e x}

contains the indices of the top k most relevant regions to the i-th region. After obtaining the region-to-region routing index information, attention will be applied among the associated regions. Since the computational efficiency of modern GPUs is highly dependent on contiguous memory access, directly computing the query against these scattered key–value pairs becomes inefficient when the indexed regions are dispersed throughout the feature map. We gather the relevant keys and values and then apply the attention mechanism to these gathered keys and values.

K_{g} = Gather (K, R_{Index}), V_{g} = Gather (V, R_{Index})

(17)

Output = softmax (\frac{Q K_{g}^{T}}{\sqrt{C}}) V_{g} + L E (V)

(18)

where

K_{g}

and

V_{g}

represent the gathered keys and values, and the

L E (\cdot)

is a local enhancement component, as in [48], which is parameterized using depthwise convolution. Through this operation, we sparsely model the long-range dependencies between different regions, thus enhancing the feature representation of small targets while suppressing interference from irrelevant background information.

4. Experiments

4.1. The Datasets

The VisDrone dataset [49] is a large-scale benchmark specifically designed for object detection in aerial images. The dataset comprises images collected from various UAV platforms under diverse weather conditions, lighting environments, and flight altitudes. It extensively covers urban and suburban scenes across 14 different cities, thereby reflecting the complexity of and variability in real-world scenarios. The dataset contains 10,209 high-resolution images, including 6471 training images, 548 validation images, and 1610 test images. The image resolutions range from 960 × 540 to 1920 × 1080 pixels. There are 10 categories of labeled objects, including people, pedestrians, bicycles, cars, vans, buses, trucks, tricycles, awning-tricycles, and motors. We use the training set for training and the validation set for testing.

The UAVDT dataset [50] is a comprehensive benchmark dataset designed for object detection in aerial imagery. The images were captured by various aerial platforms under diverse conditions, including different flight altitudes, weather scenarios, and lighting environments. The dataset covers a wide range of common urban scenes, such as main roads, highways, and crossings. UAVDT consists of 38,327 frames extracted from 100 video sequences, with an average image resolution of 1080 × 540 pixels. It contains three annotated object categories: bus, car, and truck. The dataset is split into a training set of 23,258 images and a testing set of 15,069 images.

The COCO dataset [51] is a large-scale benchmark widely used for general-purpose object detection. It consists of images collected from everyday scenes, covering diverse lighting conditions, viewpoints, and backgrounds. The dataset contains approximately 200,000 annotated images, including 118,287 images in the training set, 5000 images in the validation set, and 40,670 images in the test set. The annotations cover 80 object categories, as well as 91 stuff categories.

The DOTA dataset [52] is a large-scale benchmark specifically designed for object detection in remote sensing imagery. It contains high-resolution images captured from platforms such as UAVs and satellites, comprising approximately 2806 large images with resolutions ranging from 800 × 800 to 4000 × 4000 pixels. The dataset is annotated with 15 object categories, including plane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer ball field, and swimming pool.

4.2. The Evaluation Metrics

In aerial image object detection tasks, accurately evaluating the performance of the detection models is of paramount importance. The mean average precision (mAP) has become a prevalent evaluation metric in this field, serving as a comprehensive measure of both the precision and recall of detection algorithms. Specifically, precision (P) and recall (R) are defined as follows:

P = \frac{TP}{TP + FP}, R = \frac{TP}{TP + FN}

(19)

where TP, FP, and FN represent true positives, false positives, and false negatives, respectively. An efficient detector is expected to achieve a high performance in both precision and recall; thus, both metrics warrant close consideration. By constructing the precision–recall curve and computing the area under this curve, the average precision (AP) is obtained:

AP = \int_{0}^{1} P (R) d R

(20)

where P(R) denotes the precision value at a given recall R. The AP ranges from 0 to 1, with higher values indicating a better detection performance. In scenarios involving N object categories, the mean average precision (mAP) is calculated by averaging the AP values across all categories:

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(21)

Additionally, evaluation metrics such as mAP50 and mAP75 are employed, where mAP0.5 and mAP75 represent the mean average precision computed at Intersection over Union (IoU) thresholds of 0.5 and 0.75, respectively. To evaluate the detector performance across different object scales, we utilize the COCO evaluation metrics [51], which include mAP_small, mAP_medium, and mAP_large. mAP_small denotes the mAP for small objects with sizes less than 32 × 32 pixels, mAP_medium represents the mAP for medium objects with sizes between 32 × 32 and 96 × 96 pixels, and mAP_large corresponds to the mAP for large objects with sizes greater than 96 × 96 pixels.

4.3. The Implementation Details

We trained the model for 100 epochs on both the Visdrone and UAVDT datasets. The AdamW optimizer was used for model optimization, with the initial learning rate set to 0.01, a momentum of 0.937, and a weight decay of 0.0005. We implemented a warm-up phase lasting 3 epochs, during which the momentum was set to 0.8. After the warm-up, the learning rate linearly decayed from 0.01 to a final value of 0.0001 by the end of training. To ensure the deterministic and reproducible nature of our results, we set a fixed random seed (seed = 0). During the preprocessing stage, the pixel values of all input images were linearly normalized to the range of [0.0, 1.0]. YOLOv8s was employed as the baseline with CSPDarknet as the backbone, and multiple configuration experiments were conducted on this basis. All training was performed on 4 NVIDIA RTX 4090 GPUs, each with 24 GB of memory, using a total batch size of 16. During inference, a single GPU was used by default. For data augmentation, only simple mosaic and random horizontal flipping was applied to increase the diversity of training data and reduce overfitting, with probabilities of 1.0 and 0.5, respectively.

4.4. The Ablation Study

4.4.1. Evaluation of Different Components

To validate the effectiveness of each component proposed in MCSA-Net, we conducted ablation experiments on the VisDrone dataset. The results are shown in Table 1, where “✓” and “✗” indicate the presence and absence of a specific module, respectively. Aerial images contain a large number of small objects, and the baseline exhibits a limited representation capability for these small targets, resulting in a mAP of only 29.4%. When the Spatial-Aware Feature Selection Module (SAFM) is added, which dynamically adjusts the spatial receptive field based on the object types during feature extraction, high-quality local contextual features are captured. As a result, the overall detection performance of MCSA-Net leads to a 1.7% improvement. Furthermore, the integration of the Multi-level Joint Feature Pyramid Network (MJFPN) results in an additional 0.8% improvement in the mAP. This suggests that the MJFPN effectively performs multi-scale feature fusion and enhancement, enabling the model to learn richer and more discriminative feature representations, thereby significantly improving the detection performance. Finally, by incorporating the Attention-Enhanced Head (AEHead), the model achieves a further gain of 0.9% in the mAP. This improvement is mainly attributed to the sparse attention mechanism employed in the AEHead, which effectively models the long-range dependencies at a global scale, enhancing the model’s perception of targets while suppressing irrelevant background interference and ultimately improving both classification and the regression accuracy. When only the AEHead is added, the mAP improves by 0.4%, indicating that by extracting key global contextual information in a fine-grained manner, the AEHead can enhance the model’s discriminative ability between small objects and the background, thereby effectively improving the detection accuracy. These experimental results also highlight the compatibility of the proposed components. When all three components are used together, the detector achieves its best performance with a 32.8% mAP. Furthermore, consistent improvements are observed in the evaluation metrics mAP50 and mAP75, further validating that the proposed method effectively enhances the aerial image object detection performance across various precision requirements.

4.4.2. Evaluation of the Spatial-Aware Feature Selection Module

We conducted ablation experiments to investigate the design of the dilated convolution sequence in the SAFM, as shown in Table 2. It can be observed that cascading only two dilated convolutions does not achieve the highest detection accuracy. This is likely due to the insufficient diversity of local context information, which makes it difficult for the subsequent dynamic spatial selection mechanism to effectively capture the critical local context features of small objects. When cascading three dilated convolutions, the receptive field significantly increases with the enlargement of the convolutional kernel size; however, the detection accuracy continuously decreases. This phenomenon may be attributed to the excessively large receptive field, which introduces considerable irrelevant background information, thereby interfering with the feature representation and leading to a decline in the detection performance. When the sequence of depthwise dilated convolutions is configured as

(3, 1) \to (3, 2) \to (3, 4)

, this enables the extraction of multi-scale features with varying local receptive fields while avoiding the background interference caused by excessively large receptive fields, thus achieving the optimal detection accuracy.

4.4.3. Evaluation of the Multi-Level Joint Feature Pyramid Network

We conducted a series of ablation experiments on the VisDrone dataset to further validate the effectiveness of the proposed MJFPN. The results are shown in Table 3. When multi-level connections were introduced, this facilitated dense information exchange between different levels of semantic features and effectively leveraged the information on the spatial details in the high-resolution feature maps, resulting in a 0.3% improvement in the mAP. With the addition of the weighted fusion approach, the model’s detection performance improved by 0.5%. This indicates that assigning fusion weights to different features based on their importance contributes to the extraction of more expressive fused features. Building upon this, by incorporating the Feature Aggregation Network (FAN), the multi-scale fused features were enhanced further, which contributed to a 0.4% increase in the mAP. Finally, when the MLC, RF, and FAN were integrated, the detector achieved a notable 1.3% gain, demonstrating that the proposed MJFPN effectively strengthened the overall expressiveness of the multi-scale fused features and enabled the model to learn more expressive feature representations. These experimental results comprehensively verify that the MJFPN effectively improves the detection performance.

4.4.4. Evaluation of the Sparse Attention Mechanism

We conducted a series of ablation experiments, as shown in Table 4, to determine two hyperparameters in the sparse attention mechanism, namely the number of image divisions into regions (

N \times N

) and the number of the top k most relevant regions selected. First, by fixing the value of k, we observed that as N increases, the area of each region becomes smaller, leading to a reduction in the model’s computation time. When N increased from 5 to 7, the mAP improved by 0.4%, which may be attributed to the finer-grained attention calculation enabled by more region divisions within the range of the most relevant areas. However, when N increased from 7 to 9, the detection performance slightly decreased, which may be attributed to excessive partitioning causing the partial loss of feature information. Next, we fixed the number of N and varied the value of k. With an increase in k, the attention was computed over a larger number of regions, resulting in an increased computation time. Moreover, an excessively large k led to performance degradation, presumably as a result of increased background noise. To achieve a balance detection accuracy and computational efficiency, we selected N = 7 and k = 4.

We conducted ablation studies on the number of attention heads and the kernel size of the convolutional layers, with the results presented in Table 5. The experimental results show that the detection accuracy of the model improves as the number of attention heads increases. Specifically, when the number of attention heads increases from 2 to 8, the mAP improves by 0.6%, reaching a peak performance of 32.8%. This indicates that increasing the number of attention heads helps the model capture richer global contextual information, thereby enhancing the detection performance. However, when the number of attention heads is increased to 16, the model’s performance slightly declines, which may be attributed to feature redundancy and optimization difficulties caused by the excessive number of attention heads.

4.5. A Comparison of the Results for Different Datasets

4.5.1. The Results on VisDrone

We compared the proposed MCSA-Net with other state-of-the-art (SOTA) methods on the VisDrone dataset, and the experimental results are shown in Table 6. Compared to existing detection methods, MCSA-Net achieves the highest detection accuracy, with an optimal mAP of 32.8%, and significantly outperforms the other methods in the evaluation metrics mAP50 and mAP75, with values of 52.2% and 34.1%, respectively. This demonstrates that our method delivers a high-quality detection performance. Additionally, in terms of the detection performance for objects of different sizes, MCSA-Net achieves the best detection performance for both small and medium target detection, with mAP values of 21.4% and 43.7%, respectively. The classic aerial image detection method DMNet introduces an image cropping strategy guided by an object density map. By leveraging density information to adaptively adjust the cropping region, it enables the model to focus on potential target areas, thus improving efficient detection of small objects. However, DMNet fails to capture the fine-grained details of the targets during feature extraction, resulting in a 4.3% lower overall mAP and a 1.4% lower mAP_small compared to those of our method. The current SOTA method, YOLC, employs deformable convolution in the regression branch. This allows the convolution operation to adaptively adjust the sampling positions based on the shape of the object, thereby capturing more details of small objects and improving the detection accuracy. However, it does not account for modeling long-range contextual information, which can make it challenging to distinguish small objects from complex backgrounds, resulting in missed or incorrect detections. Our MCSA-Net outperforms YOLC by 3.9% in the overall mAP and 1.3% in small object detection. The proposed MCSA-Net extracts fine-grained local and global contextual information through the Spatial-Aware Feature Selection Module and the sparse attention mechanism, enriching the feature representation of small objects and suppressing background interference. Additionally, the Multi-Level Joint Feature Pyramid Network efficiently aggregates and enhances the multi-level semantic features, enabling the model to learn discriminative features, ultimately achieving the optimal detection performance.

We visualize the detection results on the VisDrone dataset in Figure 6. As illustrated in the visualization, our method demonstrates a superior detection performance. Specifically, as shown in the first and second rows of Figure 6, our method is able to accurately localize and identify small objects in aerial images, including those partially occluded by trees or shadows. Moreover, as depicted in the third and fourth rows, even under challenging conditions such as low-light conditions at night or image degradation, our method is still able to successfully capture small objects (e.g., cars, pedestrians) with blurred texture details and low contrast against the background. This highlights the robustness and generalization capability of our approach.

Furthermore, we compared the detection performance results of the proposed MCSA-Net with the baseline, as shown in Figure 7. Compared to the baseline, MCSA-Net exhibits more accurate detection of small objects. Specifically, as illustrated in the first and second rows of Figure 7, the baseline fails to detect pedestrians when they are partially occluded by complex surroundings or shadows. In contrast, our method effectively captures critical features of these small targets, enabling more accurate localization and recognition. In the third row, the vehicle in the image shares a high degree of similarity with the background. MCSA-Net enhances the ability to distinguish between the target and the background by capturing global contextual features, effectively avoiding false detections. Furthermore, in the fourth and fifth rows, where vehicles and pedestrians are densely distributed, our method also demonstrate a superior detection performance.

Figure 8 presents confusion matrices for both the baseline and MCSA-Net on the ten categories of the VisDrone dataset. The comparative results indicate that our method surpasses the baseline in its recognition accuracy across all categories. Notably, it achieves significant improvements for the typical small object classes of person and bicycle, with accuracy gains of 3% and 7%, respectively. Furthermore, MCSA-Net effectively reduces the probability of misclassifying objects as the background, thereby achieving more a precise detection performance.

We compared MCSA-Net with fully transformer-based detectors in terms of both accuracy and efficiency, and the results are summarized in Table 7. Compared with lightweight DETR variants, our method requires 35.5 fewer GFLOPs while achieving a significantly higher detection accuracy, with improvements of 5.5% in the mAP and 9.7% in mAP50. When compared with UAV-DETR-R50, a method specifically designed for UAV object detection, our model demonstrates a clear advantage in its computational complexity, which is reduced by 84.5 GFLOPs, and approximately halving the computational cost while also achieving a higher accuracy. Although UAV-DETR-R18 requires 8.5 fewer GFLOPs than MCSA-Net, it exhibits 3% and 3.4% lower mAP and mAP50 values, respectively. Among all of the models compared, our method achieves the highest detection accuracy, with the lowest parameters of only 14.1M and relatively low computational complexity. These results strongly validate the effectiveness and deployability of our approach.

To evaluate the deployability of the proposed MSCA-Net model on edge computing devices, we deployed the trained model on the typical UAV hardware platform the Jetson Xavier NX, and the results are shown in Table 8. The Jetson Xavier NX is compact, featuring a six-core ARM CPU and a GPU with 384 CUDA cores, achieving a peak computational power of 21 TOPS under a full load. Our proposed MSCA-Net achieved the highest detection accuracy in both the mAP and mAP50. Compared to the baseline YOLOv8s, there was a significant improvement in the detection accuracy, although the inference speed decreased. This may be because despite employing a lightweight sparse self-attention mechanism, extracting global context features from high-resolution UAV images remains a computationally intensive task, which leads to an increase in the model’s computational complexity. Nevertheless, the frame rate of MSCA-Net remains high enough, and the model’s parameter count has not significantly increased, staying at 14.1M. As a result, the model still meets the real-time detection requirements for most application scenarios, such as routine drone inspections and target monitoring.

4.5.2. Results on UAVDT

On the UAVDT dataset, we compared MCSA-Net with existing mainstream detectors, and the experimental results are shown in Table 9. Conventional general object detectors, such as Faster R-CNN and CenterNet, struggle to effectively capture the key features of small objects. Consequently, their performance on aerial images containing a large number of small targets is suboptimal, achieving only an 11.0% and 13.2% mAP, respectively. Compared to the specialized aerial image object detector CDMNet, the proposed MCSA-Net achieves significant improvements of 1.5% and 6.5% in the mAP and mAP75, respectively. Furthermore, it is noteworthy that our method outperforms CDMNet by 5.9% mAP_small in its small object detection performance. By integrating the proposed components, our MCSA-Net demonstrates an excellent detection performance in the experiments, achieving the highest mAP of 23.6%. Meanwhile, it attains 28.0% in the high-precision metric mAP75 and 20.2% in mAP_small, significantly surpassing all of the other one-stage and two-stage detectors, thereby fully validating the effectiveness of our method.

In addition, we compared MCSA-Net with the classical aerial image object detector QueryDet. As illustrated in Figure 9, MCSA-Net demonstrates a superior detection performance for small objects in aerial images, accurately detecting objects of a minute size under both high-resolution and low-light nighttime conditions. However, QueryDet tends to exhibit missed detections when handling distant small vehicle targets in aerial images, as well as under low-light nighttime conditions, where the vehicle targets share similar color and texture features to those of the background. The experimental results demonstrate that our method enhances the feature representation of small objects by extracting crucial local contextual information and fusing the features of small objects better, thereby significantly improving the localization accuracy. Moreover, by incorporating a sparse attention mechanism to efficiently model long-range dependencies, the model captures fine-grained global semantic information and suppresses background interference, thereby strengthening the discrimination between foreground objects and the background and ultimately achieving precise classification and regression results.

To evaluate the model’s detection performance under low-quality data conditions, we conducted a quantitative comparison in nighttime and blurred scenarios, providing a detailed analysis of the results. As shown in Table 10, our method demonstrates a significant increase in the detection accuracy in both night and blurred scenes compared to that at the baseline, with improvements of 6.9% and 6.7% in the mAP, respectively. While both models experienced some performance degradation on low-quality data, our proposed method exhibited greater robustness. Specifically, our model’s performance decreased by 3.6% in the mAP in the night scenario, which was notably better than the baseline’s decrease of 4.2% in the mAP. Similarly, our method maintained a relatively superior detection performance in the blurred scenario.

4.5.3. Results on MS COCO

We compared the proposed MCSA-Net with other state-of-the-art detectors on the MS COCO dataset. As shown in Table 11, MCSA-Net achieved the highest accuracy, obtaining a 47.9% mAP and a 68.7% mAP50. In addition, for the small object subset of MS COCO, our method significantly outperformed the other methods, reaching a 34.5% mAP_small while maintaining a highly competitive performance on the large object subset. Compared with the transformer-based detectors SMCA-DETR and Deformable-DETR, MCSA-Net achieved substantial improvements, with mAP increases of 2.3% and 1.7% and mAP_small gains of 8.6% and 5.7%, respectively. The specialized small object detector HRDNet enhances the representation capability for small targets by maintaining multiple parallel feature streams of different resolutions within the network while densely exchanging and fusing information across branches to improve the detection performance further. This effectively improves the representation capability for small objects, leading to a significant boost in the detection accuracy for these targets. However, HRDNet exhibits limitations in capturing the global features. In contrast, our method effectively models both local and global contextual information. This enhanced context awareness facilitates accurate recognition of small objects that are occluded or appear in complex environments. Compared with HRDNet, our MCSA-Net improves the mAP and mAP_small further by 0.9% and 2.4%, respectively. Some visualized detection results are shown in Figure 10.

4.5.4. Results on DOTA

The DOTA dataset is a large-scale benchmark specifically designed for object detection in remote sensing imagery. It contains high-resolution images captured from platforms such as UAVs and satellites, comprising approximately 2806 large images with resolutions ranging from 800 × 800 to 4000 × 4000 pixels. The dataset is annotated with 15 object categories, including plane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer ball field, and swimming pool.

We compared the proposed MCSA-Net with other methods on the large-scale remote sensing image dataset DOTA. Table 12 presents the detection accuracy for each category and the mAP across all categories for different methods. Despite the challenges posed by complex environments and diverse object scales in the DOTA dataset, the proposed MCSA-Net demonstrated a superior detection performance, achieving a 77.35% mAP, which surpassed the existing mainstream methods. For typical small object categories, small vehicle (SV) and storage tank (ST), our method achieved the highest detection accuracies of a 79.57% and an 88.64% mAP, respectively. Furthermore, it also attained the highest accuracy for basketball court (BC), plane (PL), and swimming pool (SP), with scores of an 88.19%, 90.55%, and 81.81% mAP, respectively. Compared to the classic remote sensing object detection model SCRDet++, our method achieves a gain of 1.11% in the mAP and shows a significant advantage in small object detection, with accuracy improvements of 9.65% and 3.54% in the mAP for the small vehicle (SV) and storage tank (ST) categories, respectively. These experimental results fully validate the effectiveness and robustness of our method.

4.5.5. Results on Simulated Degradation Phenomena

During flight, the image quality of unmanned aerial vehicles (UAVs) is often degraded due to environmental interference, sensor performance limitations, and unstable data transmission. To simulate these typical degradation phenomena, we introduced Gaussian noise, entire row/column pixel loss, and occlusion patches into our dataset. Specifically, Gaussian noise was used to simulate the continuous or local random offsets generated by CMOS/CCD sensors during the image acquisition process. The loss of entire rows/columns of pixels (randomly generated within a range of 0–7 pixels) and occlusion patches were used to simulate pixel corruption caused by intermittent frame drops or packet loss during frame-by-frame or macroblock-based data transmission.

The experimental results are presented in Table 13. Compared to YOLOv8s, our method achieves a significant accuracy advantage under both noise conditions. After introducing Gaussian noise, the accuracy of MCSA-Net decreased by 11%, whereas YOLOv8s experienced a 12% drop. Under the simulated pixel loss condition, MCSA-Net and YOLOv8s showed the same magnitude of decline, yet our model’s accuracy remained 3% higher. These experiments validate that the proposed MCSA-Net possesses superior robustness and accuracy in complex scenarios, meeting the practical demands of real-world applications. A visual comparison of the detection results under different complex conditions is shown in Figure 11.

5. Visual Analysis

We visualized the feature maps from the detection heads of MCSA-Net and the baseline, as shown in Figure 12. Compared to the baseline, MCSA-Net effectively focuses on objects in the foreground and significantly enhances the intensity of the response to small objects. In the heatmaps of our model, more small object regions are prominently activated, indicating that MCSA-Net can effectively capture the key features of these small targets, thereby achieving precise localization and detecting small objects missed by the other methods. Moreover, our model pays more attention to the contextual information surrounding small objects, exhibiting larger activation regions around them, especially in cases of dense target distribution or occlusion. These observations suggest that MCSA-Net effectively integrates both local and global contextual information through the Spatial-Aware Feature Selection module and the sparse attention mechanism, thereby enhancing the context awareness, which in turn enables the accurate detection of small objects. The experimental results fully validate that our approach significantly improves the detection performance for small objects in aerial images and exhibits strong robustness.

6. Conclusions

In this paper, we propose a Multi-Level Contextual and Semantic Information Aggregation Network (MCSA-Net) to effectively address the challenges of insufficient feature representation and background confusion in small object detection in aerial images, thereby significantly improving the detection accuracy for small objects. MCSA-Net consists of three components: the Spatial-Aware Feature Selection Module (SAFM), the Multi-Level Joint Feature Pyramid Network (MJFPN), the and Attention-Enhanced Head (AEHead). The SAFM precisely extracts the features of the local context by extracting multi-scale local context features and adaptively merging them, thereby enriching the feature representation of small objects. The MJFPN integrates multi-level connections, weighted fusion, and feature aggregation networks to achieve improved fusion of small objects’ features and enable the model to learn more expressive fused features. Finally, the AEHead leverages a sparse attention mechanism to efficiently model the global contextual relationships, enhancing the discrimination capability between small objects and the background. Ablation studies and visualization analyses validate the effectiveness of each proposed component. Extensive experiments demonstrate that MCSA-Net achieves an outstanding detection performance on two aerial image object detection benchmark datasets, VisDrone and UAVDT.

Despite the superior detection accuracy demonstrated by MCSA-Net, some limitations still exist. In extremely dense scenes, the model struggles to accurately distinguish between different instances. Furthermore, processing ultra-high-resolution images incurs a significant computational and memory overhead. Future work will focus on precisely localizing the boundaries of different targets by introducing frequency-domain information or extracting fine-grained local features and will also explore model compression techniques to achieve a balance between the detection accuracy and inference speed.

Author Contributions

Conceptualization, Z.L. and G.H.; Methodology, Z.L. and Y.H.; Validation, Z.L. and Y.H.; Data curation, Y.H.; Writing—original draft, Z.L. and Y.H.; Writing—review & editing, G.H.; Visualization, Z.L.; Supervision, G.H.; Project administration, G.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Aviation Science Foundation under Grant 2023Z073053008 and Grant D5120220246.

Data Availability Statement

The VisDrone and UAVDT datasets are available at the URLs https://github.com/VisDrone/VisDrone-Dataset (accessed on 19 October 2024) and https://sites.google.com/view/grli-uavdt/ (accessed on 20 October 2024), respectively.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, B.; Yang, Z.P.; Chen, D.Q.; Liang, S.Y.; Ma, H. Maneuvering target tracking of UAV based on MN-DDPG and transfer learning. Def. Technol. 2021, 17, 457–466. [Google Scholar] [CrossRef]
Huang, J.; Yujie, C.; Guipeng, X.; Shuangxia, B.; Li, B.; Wang, G.; Evgeny, N. GTrXL-SAC-Based Path Planning and Obstacle-Aware Control Decision-Making for UAV Autonomous Control. Drones 2025, 9, 275. [Google Scholar] [CrossRef]
Zhang, H.; Wang, L.; Tian, T.; Yin, J. A review of unmanned aerial vehicle low-altitude remote sensing (UAV-LARS) use in agricultural monitoring in China. Remote Sens. 2021, 13, 1221. [Google Scholar] [CrossRef]
Wang, Y.; Bashir, S.M.A.; Khan, M.; Ullah, Q.; Wang, R.; Song, Y.; Guo, Z.; Niu, Y. Remote sensing image super-resolution and object detection: Benchmark and state of the art. Expert Syst. Appl. 2022, 197, 116793. [Google Scholar] [CrossRef]
Wan, K.; Wu, D.; Li, B.; Gao, X.; Hu, Z.; Chen, D. ME-MADDPG: An efficient learning-based motion planning method for multiple agents in complex environments. Int. J. Intell. Syst. 2022, 37, 2393–2427. [Google Scholar] [CrossRef]
Geng, J.; Song, S.; Jiang, W. Dual-path feature aware network for remote sensing image semantic segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 3674–3686. [Google Scholar] [CrossRef]
He, Y.; Li, J. TSRes-YOLO: An accurate and fast cascaded detector for waste collection and transportation supervision. Eng. Appl. Artif. Intell. 2023, 126, 106997. [Google Scholar] [CrossRef]
Feng, Q.; Li, B.; Liu, X.; Gao, X.; Wan, K. Low-high frequency network for spatial–temporal traffic flow forecasting. Eng. Appl. Artif. Intell. 2025, 158, 111304. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Yin, Y.; Li, H.; Fu, W. Faster-YOLO: An accurate and faster object detection method. Digit. Signal Process. 2020, 102, 102756. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Ran, Q.; Wang, Q.; Zhao, B.; Wu, Y.; Pu, S.; Li, Z. Lightweight oriented object detection using multiscale context and enhanced channel attention in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5786–5795. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, L. Artificial intelligence for remote sensing data analysis: A review of challenges and opportunities. IEEE Geosci. Remote Sens. Mag. 2022, 10, 270–294. [Google Scholar] [CrossRef]
Laliberte, A.S.; Rango, A. Texture and scale in object-based analysis of subdecimeter resolution unmanned aerial vehicle (UAV) imagery. IEEE Trans. Geosci. Remote Sens. 2009, 47, 761–770. [Google Scholar] [CrossRef]
Song, C.; Zhang, X.; She, Y.; Li, B.; Zhang, Q. Trajectory Planning for UAV Swarm Tracking Moving Target Based on an Improved Model Predictive Control Fusion Algorithm. IEEE Internet Things J. 2025, 12, 19354–19369. [Google Scholar] [CrossRef]
He, G.; Li, F.; Wang, Q.; Bai, Z.; Xu, Y. A hierarchical sampling based triplet network for fine-grained image classification. Pattern Recognit. 2021, 115, 107889. [Google Scholar] [CrossRef]
Liu, L.; Xia, Z.; Zhang, X.; Peng, J.; Feng, X.; Zhao, G. Information-enhanced network for noncontact heart rate estimation from facial videos. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 2136–2150. [Google Scholar] [CrossRef]
Jiang, Y.; Xie, S.; Xie, X.; Cui, Y.; Tang, H. Emotion recognition via multiscale feature fusion network and attention mechanism. IEEE Sens. J. 2023, 23, 10790–10800. [Google Scholar] [CrossRef]
Zhou, L.; Zhao, S.; Wan, Z.; Liu, Y.; Wang, Y.; Zuo, X. MFEFNet: A multi-scale feature information extraction and fusion network for multi-scale object detection in UAV aerial images. Drones 2024, 8, 186. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, H.; Li, Y.; Xiao, L.; Zhang, Y.; Cao, L.; Wu, D. RLRD-YOLO: An Improved YOLOv8 Algorithm for Small Object Detection from an Unmanned Aerial Vehicle (UAV) Perspective. Drones 2025, 9, 293. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2736–2746. [Google Scholar]
Liu, J.J.; Hou, Q.; Cheng, M.M.; Wang, C.; Feng, J. Improving convolutional networks with self-calibrated convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10096–10105. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 280–296. [Google Scholar]
Mao, Y.; Zhang, J.; Wan, Z.; Tian, X.; Li, A.; Lv, Y.; Dai, Y. Generative transformer for accurate and reliable salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 1041–1054. [Google Scholar] [CrossRef]
Boroujeni, S.P.H.; Razi, A.; Khoshdel, S.; Afghah, F.; Coen, J.L.; O’Neill, L.; Fule, P.; Watts, A.; Kokolakis, N.M.T.; Vamvoudakis, K.G. A comprehensive survey of research towards AI-enabled unmanned aerial systems in pre-, active-, and post-wildfire management. Inf. Fusion 2024, 108, 102369. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Liu, H.I.; Tseng, Y.W.; Chang, K.C.; Wang, P.J.; Shuai, H.H.; Cheng, W.H. A denoising fpn with transformer r-cnn for tiny object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704415. [Google Scholar] [CrossRef]
Liu, X.; Leng, C.; Niu, X.; Pei, Z.; Cheng, I.; Basu, A. Find small objects in UAV images by feature mining and attention. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6517905. [Google Scholar] [CrossRef]
Li, X.; Xie, Z.; Deng, X.; Wu, Y.; Pi, Y. Traffic sign detection based on improved faster R-CNN for autonomous driving. J. Supercomput. 2022, 78, 7982–8002. [Google Scholar] [CrossRef]
Jobaer, S.; Tang, X.S.; Zhang, Y. A deep neural network for small object detection in complex environments with unmanned aerial vehicle imagery. Eng. Appl. Artif. Intell. 2025, 148, 110466. [Google Scholar] [CrossRef]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H. Designing network design strategies through gradient path analysis. arXiv 2022, arXiv:2211.04800. [Google Scholar] [CrossRef]
Ren, S.; Zhou, D.; He, S.; Feng, J.; Wang, X. Shunted self-attention via multi-scale token aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10853–10862. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
Yu, H.; Li, G.; Zhang, W.; Huang, Q.; Du, D.; Tian, Q.; Sebe, N. The unmanned aerial vehicle benchmark: Object detection, tracking and baseline. Int. J. Comput. Vis. 2020, 128, 1141–1159. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems, Kuala Lumpur, Malaysia, 10–12 July 2024; pp. 1–6. [Google Scholar]
Chen, L.; Liu, C.; Li, W.; Xu, Q.; Deng, H. DTSSNet: Dynamic training sample selection network for UAV object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8311–8320. [Google Scholar]
Du, B.; Huang, Y.; Chen, J.; Huang, D. Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 13435–13444. [Google Scholar]
Duan, C.; Wei, Z.; Zhang, C.; Qu, S.; Wang, H. Coarse-grained density map guided object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 2789–2798. [Google Scholar]
Zhou, L.; Liu, Z.; Zhao, H.; Hou, Y.E.; Liu, Y.; Zuo, X.; Dang, L. A multi-scale object detector based on coordinate and global information aggregation for UAV aerial images. Remote Sens. 2023, 15, 3468. [Google Scholar] [CrossRef]
Lin, H.; Zhou, J.; Gan, Y.; Vong, C.M.; Liu, Q. Novel up-scale feature aggregation for object detection in aerial images. Neurocomputing 2020, 411, 364–374. [Google Scholar] [CrossRef]
Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 13668–13677. [Google Scholar]
Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density Map Guided Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Zhang, H.; Liu, K.; Gan, Z.; Zhu, G.N. UAV-DETR: Efficient end-to-end object detection for unmanned aerial vehicle imagery. arXiv 2025, arXiv:2501.01855. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Liu, C.; Gao, G.; Huang, Z.; Hu, Z.; Liu, Q.; Wang, Y. Yolc: You only look clusters for tiny object detection in aerial images. IEEE Trans. Intell. Transp. Syst. 2024, 25, 13863–13875. [Google Scholar] [CrossRef]
Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse detr: Efficient end-to-end object detection with learnable sparsity. arXiv 2021, arXiv:2111.14330. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Wei, Z.; Duan, C.; Song, X.; Tian, Y.; Wang, H. Amrnet: Chips augmentation in aerial images object detection. arXiv 2020, arXiv:2009.07168. [Google Scholar] [CrossRef]
Deng, S.; Li, S.; Xie, K.; Song, W.; Liao, X.; Hao, A.; Qin, H. A global-local self-adaptive network for drone-view object detection. IEEE Trans. Image Process. 2020, 30, 1556–1569. [Google Scholar] [CrossRef]
Meethal, A.; Granger, E.; Pedersoli, M. Cascaded zoom-in detector for high resolution aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2046–2055. [Google Scholar]
Song, G.; Du, H.; Zhang, X.; Bao, F.; Zhang, Y. Small object detection in unmanned aerial vehicle images using multi-scale hybrid attention. Eng. Appl. Artif. Intell. 2024, 128, 107455. [Google Scholar] [CrossRef]
Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 840–849. [Google Scholar]
Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast convergence of detr with spatially modulated co-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3621–3630. [Google Scholar]
Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-Resolution Detection Network for Small Objects. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo, Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar]
Yang, X.; Yan, J.; Liao, W.; Yang, X.; Tang, J.; He, T. Scrdet++: Detecting small, cluttered and rotated objects via instance-level feature denoising and rotation loss smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2384–2399. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Dai, L.; Liu, H.; Tang, H.; Wu, Z.; Song, P. Ao2-detr: Arbitrary-oriented object detection transformer. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 2342–2356. [Google Scholar] [CrossRef]
Zeng, Y.; Chen, Y.; Yang, X.; Li, Q.; Yan, J. ARS-DETR: Aspect ratio-sensitive detection transformer for aerial oriented object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610315. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef]
Zhang, G.; Lu, S.; Zhang, W. CAD-Net: A context-aware detection network for objects in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]

Figure 1. Challenges in aerial image object detection. Aerial images contain abundant background information; red boxes highlight regions with small objects, while orange boxes indicate areas with occlusions.

Figure 2. The framework of the proposed MCSA-Net.

Figure 3. The architecture of the proposed spatially aware feature selection module.

Figure 4. An illustration of the proposed Multi-level Joint Feature Pyramid Network.

Figure 5. (a) The architecture of the AEHead. (b) An illustration of the attention computation process of the sparse attention mechanism.

Figure 6. Visual detection results on VisDrone.

Figure 7. A comparison of the detection results on the VisDrone dataset. MCSA-Net demonstrates a superior detection performance in reducing missed detections (shown in the first and second rows), distinguishing between background and foreground objects better (shown in the third row), and accurately localizing dense objects (shown in the fourth and fifth rows).

Figure 8. (a) Normalized confusion matrices of the Baseline on the VisDrone dataset. (b) Normalized confusion matrices of our MCSA-Net on the VisDrone dataset.

Figure 9. Comparison of the detection results on the UAVDT dataset. Our method MCSA-Net accurately captures small objects in the aerial images, demonstrating the effectiveness of our approach.

Figure 10. Visual detection results on MS COCO.

Figure 11. Comparison of detection results under different complex conditions.

Figure 12. Heatmaps of MCSA-Net and the baseline. Our method MCSA-Net significantly enhances the intensity of the response to small objects and pays greater attention to the surrounding context of the targets.

Table 1. Effects of each component on VisDrone.

With SAFM?	With MJFPN?	With AEHead?	mAP	mAP50	mAP75
✗	✗	✗	29.4	47.7	30.7
✓	✗	✗	31.1	50.1	32.2
✗	✓	✗	30.2	49.4	31.3
✗	✗	✓	29.8	48.4	31.0
✓	✓	✗	31.9	51.6	33.4
✓	✗	✓	31.5	50.9	32.4
✗	✓	✓	31.4	51.4	32.9
✓	✓	✓	32.8	52.2	34.1

Table 2. The ablation study of the proposed SAFM. k: kernel size; d: dilation; RF: receptive field.

Kernel Size (k,d)	RFs in Sequence	mAP
(3, 1), (3, 2)	3 × 3, 7 × 7	32.3
(3, 1), (5, 2)	3 × 3, 11 × 11	32.6
(5, 1), (5, 2)	5 × 5, 13 × 13	32.4
(3, 1), (3, 2), (3, 4)	3 × 3, 7 × 7, 15 × 15	32.8
(5, 1), (5, 2), (5, 4)	5 × 5, 13 × 13, 29 × 29	32.1
(3, 1), (5, 2), (7, 4)	3 × 3, 11 × 11, 35 × 35	32.0

Table 3. Effects of each component of MJFPN.

With MLC?	With WF?	With FAN?	mAP
✗	✗	✗	31.5
✓	✗	✗	31.8
✗	✓	✗	32.0
✓	✓	✗	32.2
✓	✓	✓	32.8
✗	✓	✓	32.4

Table 4. An ablation study of the proposed sparse attention mechanism.

Number of Regions	Top-k	mAP	Time (ms)
N = 5	k = 4	32.4	3.43
N = 7	k = 4	32.8	3.20
N = 9	k = 4	32.6	3.13
N = 7	k = 2	32.3	3.09
N = 7	k = 6	33.0	3.42
N = 7	k = 8	32.6	3.60

Table 5. An ablation study on attention heads and kernel sizes in AEHead.

Attention Heads	Kernel Size	mAP
2	3 × 3	32.2
4	3 × 3	32.5
8	3 × 3	32.8
16	3 × 3	32.7
8	5 × 5	32.5

Table 6. Comparisons with previous methods on the VisDrone dataset. Bold font indicates the best-performing data in each column, and underlining indicates the second best data.

Methods	Backbone	mAP	mAP50	mAP75	mAP_small	mAP_medium	mAP_large
RetinaNet [53]	ResNet50	19.4	35.9	18.5	14.1	29.5	33.7
FRCNN [9]	ResNet50	21.5	40.0	20.6	15.4	34.6	37.1
YOLOv8 [54]	CSPDarknet	23.9	39.8	24.4	14.4	32.3	34.2
DTSSNet [55]	MobileNetV2	25.5	41.1	26.9	18.6	34.3	41.2
ClustDet [56]	ResNet50	26.7	50.6	24.7	17.6	38.9	51.4
CEASC [57]	ResNet18	28.7	50.7	28.4	–	–	–
CDMNet [58]	ResNeXt101	29.7	50.0	30.9	21.2	41.8	42.9
CGMDet [59]	CSPDarknet	29.3	50.9	29.4	20.2	40.6	47.4
HawkNet [60]	ResNet50	25.6	44.3	25.8	19.9	36.0	39.1
QueryDet [61]	CSPDarknet	28.3	48.1	28.8	–	–	–
DMNet [62]	ResNet101	28.5	48.1	29.4	20.0	39.7	57.1
UAV-DETR [63]	ResNet50	31.5	51.1	–	–	–	–
RT-DETR [38]	ResNet50	28.4	47.0	–	–	–	–
D-DETR [64]	ResNet50	27.1	42.2	–	–	–	–
YOLC [65]	ResNet101	28.9	51.4	28.3	20.1	41.6	47.3
MCSA-Net	CSPDarknet	32.8	52.2	34.1	21.4	43.7	48.4

Table 7. Comparison with transformer-based detectors.

Model	Parameters (M)	FLOPs(G)	mAP	mAP50
Sparse DETR [66]	40.9	121.0	27.3	42.5
RT-DETR-R18 [38]	20.0	60.0	26.7	44.6
RT-DETR-R50 [38]	42.0	136.0	28.4	47.0
UAV-DETR-R18 [63]	20.0	77.0	29.8	48.8
UAV-DETR-R50 [63]	42.0	170.0	31.5	51.1
MCSA-Net	14.1	85.5	32.8	52.2

Table 8. Performance comparison of different object detection models.

Model	mAP	mAP50	FPS	Parameters (M)	FLOPs(G)
YOLOv5n	28.4	46.3	25.7	9.0	23.9
YOLOv5s	22.8	38.3	32.1	2.5	7.1
YOLOv8n	23.9	39.8	31.5	3.0	8.1
YOLOv8s	29.4	47.7	24.3	11.1	28.5
MCSA-Net	32.8	52.2	10.6	14.1	85.5

Table 9. Comparisons with previous methods on UAVDT. Bold font indicates the highest value in each column, and underline indicates the second highest value.

Methods	Backbone	mAP	mAP50	mAP75	mAP_small	mAP_medium	mAP_large
DMNet [62]	ResNet50	24.6	14.7	16.3	9.3	26.2	35.2
ClusDet [56]	ResNet50	26.5	13.7	12.5	9.1	25.1	31.2
YOLOv8 [54]	CSPDarknet	28.2	16.2	16.8	10.7	28.1	25.3
FRCNN [9]	ResNet50	23.4	11.0	8.4	8.1	20.2	26.5
CenterNet [67]	Hourglass104	26.7	13.2	11.8	7.8	26.6	13.9
CESAC [57]	ResNet18	30.9	17.1	17.8	–	–	–
AMRNet [68]	ResNet50	30.4	18.2	19.8	10.3	31.3	33.5
GLSAN [69]	ResNet50	28.1	17.0	18.8	–	–	–
CZDDet [70]	ResNet50	35.5	19.3	21.3	13.0	31.3	36.0
QueryDet [61]	ResNet50	36.1	18.9	20.6	12.9	30.9	25.6
CDMNet [58]	ResNet50	35.5	20.7	22.4	13.9	33.5	19.8
MHA-YOLO [71]	CSPDarknet	37.2	21.9	21.5	14.3	34.2	37.2
MCSA-Net	CSPDarknet	35.7	23.6	28.0	20.2	29.3	28.9

Table 10. Comparison of mAP performance under normal, nighttime, and blurred conditions.

Method	mAP
Method	Origin	Night	Blur
Baseline	29.4	25.2 (↓ 4.2)	18.9 (↓ 10.5)
MCSA-Net	35.7	32.1 (↓ 3.6)	25.6 (↓ 10.1)

Table 11. Comparisons with previous methods in COCO. Bold font indicates the best-performing data in each column, and underlining indicates the second best data.

Methods	Backbone	mAP	mAP50	mAP75	mAP_small	mAP_medium	mAP_large
RetinaNet [53]	ResNeXt101	40.8	61.1	44.1	24.1	44.2	51.2
FRCNN [9]	ResNet101	34.9	55.7	37.4	15.6	38.7	50.9
FCOS [16]	ResNeXt101	44.7	64.1	48.4	27.6	47.5	55.6
QueryDet [61]	ResNet50	39.3	59.6	41.9	24.9	42.3	51.1
YOLOv8 [54]	CSPDarknet	41.3	59.3	49.5	26.7	38.5	47.8
FSAF [72]	ResNeXt101	42.9	63.8	46.3	26.6	46.2	52.7
SMCA-DETR [73]	ResNet50	45.6	65.5	49.1	25.9	49.3	62.6
D-DETR [64]	ResNet50	46.2	65.2	50.0	28.8	49.2	61.7
HRDNet [74]	ResNet101	47.4	66.9	51.8	32.1	50.5	55.8
MSCA-Net	CSPDarknet	48.3	68.7	52.6	34.5	51.2	55.6

Table 12. Comparisons with previous methods on DOTA. Bold font indicates the best-performing data in each column, and underlining indicates the second best data.

Methods	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
SCRDet [75]	89.98	80.65	52.09	68.36	68.36	60.32	72.41	90.85	87.94	86.86	65.02	66.68	66.25	68.24	65.21	72.67
SCRDet++ [76]	90.01	82.32	61.94	68.62	69.92	81.17	78.83	90.86	86.32	85.10	65.10	61.12	77.69	80.68	64.25	76.24
RoI-Trans [77]	88.64	78.52	43.44	75.92	68.81	73.68	83.59	90.74	77.27	81.46	58.39	53.54	62.83	58.93	47.67	69.56
AO2-DETR [78]	87.99	79.46	45.74	66.64	78.90	73.90	73.30	90.40	80.55	85.89	55.19	63.63	51.83	70.15	60.04	70.91
ARS-DETR [79]	86.61	77.26	48.84	66.76	78.38	78.96	87.40	90.61	82.76	82.19	54.02	62.61	72.64	72.80	64.96	73.79
$R^{3}$ Det [80]	89.80	83.77	48.11	66.77	78.76	83.27	87.84	90.82	85.38	85.51	65.67	62.68	67.53	78.56	72.62	76.47
Glid Vertex [81]	89.64	85.00	52.26	77.34	73.01	73.14	86.82	90.74	79.02	86.81	59.55	70.91	72.94	70.68	57.32	75.02
CADNet [82]	87.80	82.40	49.40	73.50	71.10	63.50	76.60	90.90	79.20	73.30	48.40	60.90	62.00	67.00	62.20	69.90
O-RCNN [83]	89.46	82.12	54.78	70.86	78.93	83.00	88.20	90.90	87.50	84.68	63.97	67.69	74.94	64.84	52.28	75.87
MCSA-Net	90.55	81.91	48.54	72.58	79.57	78.63	87.81	90.04	88.19	88.64	58.64	68.36	77.66	81.81	67.36	77.35
Explanation of each category:
Full Name	plane	baseball diamond	bridge	ground track field	small vehicle	large vehicle	ship	tennis court	basket- ball court	storage tank	soccer- ball field	round- about	harbor	swim- ming pool	heli- copter	–

Table 13. Comparison of detection performance of MCSA-Net and YOLOv8s under different types of noise.

Noise Type	MCSA-Net			YOLOv8s
Noise Type	mAP	mAP50	mAP75	mAP	mAP50	mAP75
Origin	32.8	52.2	34.1	29.4	47.7	30.7
Pixel dropout	28.8	47.2	29.7	25.4	42.6	25.9
Gaussian	21.8	36.4	22.4	17.4	29.5	17.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Z.; He, G.; Hu, Y. Multi-Level Contextual and Semantic Information Aggregation Network for Small Object Detection in UAV Aerial Images. Drones 2025, 9, 610. https://doi.org/10.3390/drones9090610

AMA Style

Liu Z, He G, Hu Y. Multi-Level Contextual and Semantic Information Aggregation Network for Small Object Detection in UAV Aerial Images. Drones. 2025; 9(9):610. https://doi.org/10.3390/drones9090610

Chicago/Turabian Style

Liu, Zhe, Guiqing He, and Yang Hu. 2025. "Multi-Level Contextual and Semantic Information Aggregation Network for Small Object Detection in UAV Aerial Images" Drones 9, no. 9: 610. https://doi.org/10.3390/drones9090610

APA Style

Liu, Z., He, G., & Hu, Y. (2025). Multi-Level Contextual and Semantic Information Aggregation Network for Small Object Detection in UAV Aerial Images. Drones, 9(9), 610. https://doi.org/10.3390/drones9090610

Article Menu

Multi-Level Contextual and Semantic Information Aggregation Network for Small Object Detection in UAV Aerial Images

Abstract

1. Introduction

2. Related Works

2.1. The Attention/Selective Mechanism

2.2. Multi-Scale Feature Fusion

3. The Method

3.1. The Spatially Aware Feature Selection Module (SAFM)

3.2. The Multi-Level Joint Feature Pyramid Network (MJFPN)

3.3. The Attention-Enhanced Head (AEHead)

4. Experiments

4.1. The Datasets

4.2. The Evaluation Metrics

4.3. The Implementation Details

4.4. The Ablation Study

4.4.1. Evaluation of Different Components

4.4.2. Evaluation of the Spatial-Aware Feature Selection Module

4.4.3. Evaluation of the Multi-Level Joint Feature Pyramid Network

4.4.4. Evaluation of the Sparse Attention Mechanism

4.5. A Comparison of the Results for Different Datasets

4.5.1. The Results on VisDrone

4.5.2. Results on UAVDT

4.5.3. Results on MS COCO

4.5.4. Results on DOTA

4.5.5. Results on Simulated Degradation Phenomena

5. Visual Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI