SPOD-YOLO: A Modular Approach for Small and Oriented Aircraft Detection in Satellite Remote Sensing Imagery

Chen, Jiajian; Guo, Pengyu; Liu, Yong; Cao, Lu; Ran, Dechao; Wang, Kai; Hu, Wei; Wan, Liyang

doi:10.3390/rs17243963

Open AccessArticle

SPOD-YOLO: A Modular Approach for Small and Oriented Aircraft Detection in Satellite Remote Sensing Imagery

by

Jiajian Chen

,

Pengyu Guo

^*

,

Yong Liu

,

Lu Cao

,

Dechao Ran

,

Kai Wang

,

Wei Hu

and

Liyang Wan

Defense Innovation Institute, Academy of Military Sciences (AMS), Beijing 100071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(24), 3963; https://doi.org/10.3390/rs17243963

Submission received: 21 October 2025 / Revised: 30 November 2025 / Accepted: 2 December 2025 / Published: 8 December 2025

(This article belongs to the Special Issue Object Detection in Remote Sensing Images Based on Artificial Intelligence)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose a Dense Window Multi-head Self-Attention (DW-MSA) mechanism that integrates multi-level features and spatial attention, together with an enhanced C3k2_OD structure using omni-dimensional dynamic convolution, to significantly improve feature representation and small-object detection performance in remote sensing images.
We design a rotation-invariant Gaussian–Cosine-based Angle-aware Loss Function (GCAL) that combines Gaussian and cosine modeling to enhance the stability and accuracy of rotation regression.

What are the implications of the main findings?

The integration of DW-MSA and C3k2_OD effectively strengthens feature learning and suppresses background interference, leading to higher recall and precision for small and densely distributed aircraft under complex conditions.
GCAL improves angular regression consistency and intersection accuracy, resulting in a notable gain in mAP50:95 and more reliable orientation estimation for multi-oriented targets.

Abstract

The accurate detection of small, densely packed and arbitrarily oriented aircraft in high-resolution remote sensing imagery remains highly challenging due to significant variations in object scale, orientation and background complexity. Existing detection frameworks often struggle with insufficient representation of small objects, instability of rotated bounding box regression and inability to adapt to complex background. To address these limitations, we propose SPOD-YOLO, a novel detection framework specifically designed for small aircraft in remote sensing images. This method is based on YOLOv11, combined with the feature attention mechanism of swintransformer, through targeted improvements on cross-scale feature modelling, dynamic convolutional adaptation, and rotational geometry optimization to achieve effective detection. Additionally, we have constructed a new dataset based on satellite remote sensing images, which has high density of small aircraft with rotated bounding box annotations to provide more realistic and challenging evaluation settings. Extensive experiments on MAR20, UCAS-AOD and the constructed dataset demonstrate that our method achieves consistent performance gains over state-of-the-art approaches. SPOD-YOLO achieves an 4.54% increase in mAP50 and a 11.78% gain in mAP50:95 with only 3.77 million parameters on the constructed dataset. These results validate the effectiveness and robustness of our approach in complex remote sensing scenarios, offering a practical advancement for the detection of small objects in aerospace imagery.

Keywords:

oriented aircraft detection; YOLOv11; remote sensing; rotated bounding box

1. Introduction

High-resolution remote sensing imagery, with its wide-area coverage and fine spatial resolution [1,2,3], has become indispensable for strategic applications such as airport security and military reconnaissance. With the rapid advancement of remote sensing imaging technologies, deep learning-based automatic object detection has emerged as a core approach to interpreting such images [4,5,6]. Among the various objects of interest, aircraft detection plays a critical role in traffic coordination, scene understanding and mission planning. However, aircraft in remote sensing scenes often exhibit a small size, a dense distribution, and arbitrary orientation, posing significant challenges to traditional detection methods.

The widely used horizontal bounding box (HBB) detection methods often suffer from overlapping and occlusion when applied to densely packed objects, limiting their accuracy and effectiveness. As a result, rotated bounding box (RBB) detection has gained prominence in remote sensing applications due to its superior ability to model object orientation [7,8]. However, approaches based on RBB still encounter two major challenges: (1) periodic discontinuity in angle prediction (e.g., sudden transitions between 179° and −1°), leading to unstable regression performance [9,10], and (2) the weak feature representation of small objects in complex backgrounds, making them vulnerable to noise and prone to missed detections [11,12].

To address these challenges, we propose a modular detection framework specifically designed for aircraft detection in remote sensing imagery. This framework comprises three functional sub-networks: a multi-level semantic fusion enhancement module, a dynamic feature response module, and a rotational geometry modeling module.

The multi-level semantic fusion module introduces a dense window-based self-attention mechanism that spatially applies sliding-window attention while integrating shallow and deep feature representations. This design substantially improves boundary localization while simultaneously strengthening the contextual perception required for reliable small-object interpretation. The dynamic feature response module improves the adaptability of the network to scale changes, texture inconsistencies, and pose perturbations by employing a multi-dimensional convolutional kernel regulation strategy, thereby enhancing generalization and feature expressiveness. For rotated bounding box regression, we propose a novel rotation loss function based on a Gaussian–Cosine distribution, which improves both the stability and accuracy of angle prediction by jointly optimizing center offset and angular error.

Moreover, existing remote sensing aircraft detection datasets, such as MAR20 [13] and RSOD [14] are primarily composed of aerial imagery with high resolution and relatively large aircraft, making them less suitable for evaluating the detection of small objects in satellite images. Although datasets like DOTA [15] and HRSC2016 [16] include aircraft categories, they contain a limited number of small objects, low spatial density, and lack sufficient variation in viewing angles, thus failing to fully reflect the challenges of small objects, multi-orientation detection. To address this gap, we construct a new aircraft detection dataset based on satellite imagery, acquired using Jilin-1 satellite imagery. This dataset includes aircraft with diverse scales, densities, and orientations, providing a more challenging and representative benchmark for remote sensing object detection.

In summary, the main contributions of this paper are as follows:

We propose a dense window-based self-attention mechanism that integrates multi-level features and spatial attention, significantly enhancing the detection performance for small objects in remote sensing images.
We design a rotation-invariant angular loss function that combines Gaussian and cosine modeling to improve the stability and accuracy of rotation regression.
We construct a novel aircraft detection dataset based on satellite imagery characterized by high density and small sizes to verify the detection ability of the model in complex environment.

By jointly optimizing the network architecture, loss function, and dataset design, this work achieves comprehensive improvements in both robustness and accuracy for aircraft detection in remote sensing imagery, offering high practical relevance for real-world applications.

2. Related Work

2.1. Object Detection for Remote Sensing Images

Object detection, a fundamental task in computer vision, aims to accurately identify objects of interest and determine their locations within an image [17]. Traditional object detection methods primarily relied on sliding window techniques combined with handcrafted features such as HOG+SVM [18] and part-based models [19]. Although these approaches achieved limited success, they often struggled with low generalization ability and poor scalability when dealing with multi-scale or complex scenes.

With the rapid development of deep learning, object detection has evolved into two primary paradigms: two-stage and one-stage detectors. Two-stage methods such as R-CNN [20], Fast R-CNN [21], and Faster R-CNN [22] operate by first generating candidate regions and then performing classification and regression, offering high accuracy but limited real-time performance. In contrast, one-stage detectors like SSD and the YOLO series prioritize speed and efficiency by treating detection as a single regression problem. Successive versions of YOLO from YOLOv3 to YOLOv7 have introduced multi-scale feature pyramids, PANet [23], CSPNet [24], and other architectural refinements to improve the balance between accuracy and speed. A comprehensive review by Ali et al. [25] comprehensively reviewed the evolution from YOLOv1 to YOLOv11, emphasizing the series achieved the best balance between performance and efficiency in different application scenarios, and has become the mainstream framework for industrial detection tasks.

While deep learning-based detectors have achieved remarkable progress on natural images, remote sensing imagery presents distinct challenges that limit the effectiveness of these general-purpose models. Remote sensing images are captured from overhead perspectives and offer high spatial resolution with wide coverage. However, aircraft, vehicles and ships are often small in size, densely distributed, arbitrarily oriented, and embedded within complex backgrounds [26,27,28]. These factors exacerbate issues related to object scale variance, occlusion, and rotation, making accurate detection significantly more difficult than in natural image settings.

To address these issues, numerous remote sensing-specific improvements have been proposed. Feature representation strategies have been a key focus, with modules such as Switchable Atrous Convolution (SAC) and Efficient Multi-Scale Attention (EMA) introduced in SED-YOLO to better model spatial context and scale diversity [29]. Similarly, Chen et al. [30] improved detection robustness for communication infrastructure monitoring by introducing risk factor modeling and linear threshold regulation to an enhanced YOLOv8 framework.

Another major challenge in remote sensing is the arbitrary orientation of objects. Horizontal bounding box (HBB) detectors often fail to accurately enclose such objects, especially in densely packed scenes, leading to overlap errors and suppression failures during non-maximum suppression. As a result, rotated bounding box (RBB) detection has become an important trend in remote sensing. For example, CR-Mask R-CNN [31] integrates rotated anchors and attention enhancement to improve small object detection, while Zhang et al. [32] propose a cascaded Transformer architecture for more robust geometric modeling. However, existing RBB-based models still suffer from limitations such as angle discontinuity and cyclic regression errors when dealing with high-density or fine-angled instances.

The detection of small objects remains particularly difficult due to their low pixel occupancy and weak features. To this end, Qiang et al. [33] proposed SCM-YOLO, which combines attention-based and cross-layer modules to boost detection accuracy without sacrificing inference efficiency. Li et al. [34] introduced a lightweight framework with variable kernel convolution and spatial pyramid pooling, which improves performance on datasets like DUO and UDD. Nevertheless, in extremely dense scenes, these methods still face challenges such as ambiguous boundaries and increased missed detections.

In summary, while general object detection techniques provide a strong foundation, remote sensing imagery demands specialized solutions. Current research efforts focus on improving small object perception, rotational bounding box regression, and multi-scale feature fusion. Despite these advances, problems such as unstable rotation prediction and weak small-object discrimination remain open challenges.

2.2. Aircraft Detection in Remote Sensing Images

Unlike general object detection tasks, aircraft captured by high-resolution satellites or aerial sensors present unique challenges: they are often extremely small relative to the image scale, densely clustered in specific areas like airports, and exhibit strong orientation consistency. These characteristics exacerbate common detection difficulties such as feature sparsity, bounding box overlap, and rotation regression instability, making aircraft detection an important and highly complex subtask within the broader domain of remote sensing object detection research.

Detecting aircraft in remote sensing imagery is particularly challenging due to their small sizes, dense clustering in airport scenes, and limited feature representation. To address these issues, researchers have proposed various small object oriented detection frameworks. EFCNet [35] enhances YOLOv5 by incorporating STEBIFPN and an ODCSP-Darknet53 backbone to strengthen multi-scale feature extraction, achieving notable improvements on DOTA and DIOR datasets. A lightweight YOLOv5-based model by Wang et al. [36] integrates Shufflenet v2 and deformable convolution with coordinate attention, providing better orientation sensitivity while maintaining computational efficiency. SA3Det [37] introduces pixel-level attention and adaptive label assignment to improve detection of small and occluded objects in dense aerial scenes. More recently, AD³I-Net [38] combines spatial and frequency-domain adaptive modules to capture fine-grained object cues, explicitly improving detection robustness for small aircraft in complex remote sensing environments. Despite these advancements, most methods remain limited by their reliance on dataset-specific optimizations and still struggle to maintain detection precision when objects are extremely small or closely packed.

Due to the strong orientation and elongated shapes of aircraft, rotated bounding boxes (RBBs) have been widely adopted to improve localization accuracy and reduce redundant background regions. SCRDet [39] introduced a rotation-aware detection framework with multi-layer feature fusion and IoU regularized loss to stabilize rotation regression for small and cluttered objects. Point RCNN [40] proposed an angle-free detection paradigm using a PointRPN module to generate rotated proposals and a corner refinement stage, effectively mitigating angular discontinuity. PolarDet [41] reformulated oriented object detection in polar coordinates, providing compact object representation and achieving state-of-the-art performance on UCAS-AOD and HRSC datasets. More recently, a Gaussian-based R-CNN [42] design incorporated a Gaussian-inspired angular loss to smooth orientation regression and enhance stability for rotated object detection in aerial imagery. Although RBB-based methods significantly improve geometric modeling and orientation estimation, challenges remain in resolving periodic angle ambiguities and ensuring robust regression performance in scenes with densely clustered, tiny angle difference aircraft.

3. Methods

3.1. General Overview

The remote sensing aircraft detection framework proposed in this study is improved on the basis of YOLOv11 [43]. The overall framework is shown in Figure 1.

Firstly, the multilayer semantic fusion enhancement module utilizes the dense window self-attention mechanism to process feature maps at different scales in the FPN—by stitching shallow and deep features and applying sliding window self-attention to achieve joint modeling of local details and contextual information, which enhances the detection of small objects. Secondly, the dynamic feature response module introduces a mechanism to adaptively adjust the weights of convolution kernels to cope with scale changes, attitude perturbations, and complex textures exhibited by objects in remote sensing images, which makes the network feature representation more flexible and adaptable. Finally, the rotational geometry modeling module optimizes the RBB regression through Gaussian–Cosine distribution loss function. This achieves error smoothing from the centroid position to the rotation regression, effectively mitigates the periodic jump problem and improves the prediction accuracy.

The overall framework takes high-resolution remote sensing images as input, extracts multi-scale features and processes them sequentially through the aforementioned three modules to produce accurate rotated bounding box predictions. Unlike conventional approaches that rely solely on convolutional or attention-based architectures, the proposed method effectively addresses key challenges in the detection of remote sensing aircraft such as object overlap, loss of small objects, and unstable rotation regression, by integrating structural fusion, dynamic adaptability, and geometric optimization, which results in significant improvements in detection accuracy and robustness.

3.2. Dense Window Multi-Head Self-Attention

Objects in remote sensing imagery are often characterized by high spatial density, large-scale variation, and complex orientation changes, all of which impose stringent requirements on feature representation. To address these challenges, we design an enhanced sliding window self-attention mechanism named Dense Window Multi-head Self-Attention (DW-MSA) as shown in Figure 2. Inspired by the Sliding Window Multi-head Self-Attention (SW-MSA) in Swin Transformer [44], DW-MSA is redesigned to incorporate a cross-layer feature fusion strategy structurally, aiming to integrate shallow contour details, spectral textures, and deeper semantic information for robust recognition of multi-scale and diverse objects morphologically.

In the DW-MSA module, we utilize the P3 and P5 layers from the Feature Pyramid Network (FPN) as input, representing fine-grained local details and high-level semantic context, respectively. Due to differences in spatial resolution and channel dimensions, the P3 feature map is first processed by a 3 × 3 convolution to expand its receptive field and enhance texture modeling. Subsequently, it is downsampled via a stride-2 pooling operation (or an equivalent strided convolution) to match the resolution of P5. Simultaneously, a 1 × 1 convolution is applied to the P5 feature map to unify its channel dimensions. These aligned feature maps are then concatenated along the channel axis to construct a fused representation

F fused

, which takes the following form:

F_{fused} = {Conv}_{1 \times 1} (Concat (Pool (P_{3}), (P_{5})))

(1)

The final

1 \times 1

convolution compresses the concatenated features back to the same channel dimension as the original P5.

To preserve spatial continuity and enhance inter-window communication, we adopt a sliding window partitioning strategy. The fused feature map is divided into overlapping local windows, within which multi-head self-attention is computed independently. The windows in DW-MSA use an overlap rate

γ

to control the shared area across windows, thereby alleviating the problem of edge information fragmentation. In this paper, the overlap rate

γ

fixed to 0.5, corresponding to a half-step sliding between adjacent windows. This design ensures a balanced overlap between local regions, allowing boundary information from neighboring windows to interact while maintaining computational efficiency. The attention calculation form within the window is modeled as follows:

F_{attn}^{(i)} = MSA (W_{γ}^{(i)} (F_{fused}))

(2)

where

W_{γ}^{(i)}

represents the ith sliding window operation with an overlap rate of

γ

.

This design effectively mitigates the edge information loss typically caused by non-overlapping windows in the original Swin Transformer and promotes receptive field sharing across adjacent regions. Through this windowed attention mechanism, DW-MSA captures long-range dependencies between object regions and improves recognition robustness under background interference.

Following the attention computation, the output features are concatenated with the original aligned P3 and P5 maps along the channel dimension. A final 1 × 1 convolution is applied for feature compression and information reconstruction. The resulting DW-MSA output incorporates three complementary types of information: (1) spatial context and global dependencies modeled via attention, (2) fine edge and texture features from the shallow P3 map, and (3) high-level semantic abstraction from the P5 map. This fused representation exhibits superior discriminability and stability for detecting small objects with blurred boundaries and significant rotational variations. The final DW-MSA output represents

F_{DW - MSA} = {Conv}_{1 \times 1} (Concat (F_{attn}^{(i)}, {Conv}_{k_{1} \times k_{1}} (Concat (Pool (P_{3}), {Conv}_{k_{2} \times k_{2}} (P_{5})))))

(3)

where the pooled

P_{3}

feature map and the transformed

P_{5}

feature map are aligned to the same spatial resolution and concatenated along the channel dimension, after which a convolution with kernel size

k_{1} \times k_{1}

is applied for feature fusion. In our implementation,

k_{1} = 1

is used to perform lightweight channel mixing after concatenation, while

k_{2} = 3

is adopted to adjust

P_{5}

before alignment.

Overall, the DW-MSA module effectively overcomes the representational limitations of traditional convolutional or pure Transformer-based architectures in complex remote sensing scenes. Its modular design, enhanced expressiveness, and cross-scale perception ability improve the detection robustness of the model significantly and provide a more informative foundation for subsequent feature interaction and object regression stages.

3.3. Omni-Dimensional Dynamic Convolution

In remote sensing object detection, small objects often exhibit irregular shapes, indistinct boundaries, and low-texture features, placing elevated demands on the feature extraction capacity of detection models. To enhance recognition accuracy under such conditions, we incorporate Omni-Dimensional Dynamic Convolution [45] (ODConv) into the YOLOv11 architecture. By integrating a multi-dimensional attention mechanism, ODConv modulates convolutional kernel weights adaptively in response to the semantic structure and spatial distribution of input features, thereby strengthening the model’s focus on salient object regions—particularly those corresponding to small objects in high-resolution remote sensing imagery.

The core mechanism of ODConv lies in its four-dimensional attention modulation, which adjusts convolution behavior across the spatial, channel, kernel, and filter dimensions. Each dimension generates an attention weight vector that calibrates the response of the convolution operation dynamically within its scope, enabling more flexible and fine-grained feature adaptation across positions, scales, and orientations.

As shown in Figure 3, ODConv performs dynamic adjustment of the convolution operation through a sequence of four attention mechanisms, each operating along a distinct dimension:

Spatial-wise Attention: this step is a multiplication operation along the spatial dimension of the convolution kernel to adjust the weights of different spatial locations in each convolution kernel;
Channel-wise Attention: this step is a multiplication operation along the dimension of the input channels, assigning different weights to each input channel
Convolution kernel-wise Attention: this step is a multiplication operation along the convolution kernel dimension. Each convolution kernel is weighted;
Filter-wise Attention: this step is a multiplication operation along the filter dimension to adjust the convolutional weights of each output channel.

By combining these attention operations, ODConv generates dynamic convolution kernels that are conditioned on the input features, thereby enhancing the flexibility and specificity of feature extraction.

The overall computation process can be summarized as follows:

y = (\sum_{i = 1}^{n} α_{w i} \cdot α_{f i} \cdot α_{c i} \cdot α_{s i} \cdot W_{i}) * x

(4)

where

α_{s i}, α_{c i}, α_{f i}, α_{w i}

represents the attention weights of spatial, channel, output channel and convolution kernel dimensions, respectively,

W_{i}

is the convolution kernel, x is the input feature map, and y is the output feature map.

The above attention mechanisms work in concert to generate dynamic convolution kernels with input sensing ability, which enables the model to respond more accurately to the changing and complex aircraft in remote sensing images in the feature extraction stage. ODConv improves the sensitivity of the network to small objects and the ability of regional expression significantly while retaining the computational structure of the original convolution module.

In the SPOD-YOLO framework, ODConv is integrated into the downsampling module C3k2 in YOLOv11, replacing traditional convolutional layers. The modified version of this module is referred to as C3k2_OD. This module is responsible for low-level feature extraction and spatial resolution reduction. During training, the attention weights across spatial, channel, and kernel dimensions are learned to automatically adapt convolutional parameters to variations in the input. Additionally, both the number and behavior of convolution kernels are adjusted dynamically, thereby enhancing the adaptability of network and improving detection performance. C3k2_OD achieves this while preserving the computational efficiency of standard convolution modules, offering a practical balance between accuracy and model complexity.

3.4. Gaussian–Cosine-Based Angle-Aware Loss Function

In remote sensing imagery, particularly in typical scenarios such as airports, aircraft often exhibit strong directional consistency and dense spatial distributions. In such scenes, traditional object detection methods that employ HBB face critical limitations. Due to the intrinsic orientation of aircraft, horizontal boxes fail to tightly enclose object boundaries, resulting in significant redundant background regions. This issue becomes more pronounced in dense scenes, where multiple HBBs tend to overlap extensively, severely impacting the performance of non-maximum suppression (NMS). These overlaps often lead to increased false positives and missed detections. In contrast, RBB detection can align with the principal axis of objects, providing more compact and precise enclosures. This alignment alleviates bounding box collisions in high-density areas and enhances detection robustness under crowded conditions.

More importantly, RBBs not only improve spatial fitting but also offer richer geometric descriptions. Each rotated box output encodes the object centroid coordinates, width, height, and orientation angle

(x, y, w, h, θ)

, where the direction and the length of the long axis jointly define the object structural signature.

Based on these considerations, this work adopts a rotated bounding box representation within the detection framework. This design not only improves localization accuracy and object separability in dense environments, but also provides structured, orientation-aware features that facilitate robust object tracking and re-identification across sequential remote sensing frames.

In the process of RBB object detection, in addition to the regression of position and size, it is also necessary to supervise the rotation angle of the object. The rotation regression adopted by most current RBB object detection methods is mainly based on linear loss functions (e.g., L1, Smooth L1). However, this loss function has obvious shortcomings in dealing with periodic angular variables, which may introduce serious gradient misjudgment and affect the learning of the real orientation of the object.

As shown in Figure 4, if the true angle of the object

θ_{g t}

is 5° and the predicted value

θ_{p r e d}

is 355°, the two are almost equivalent geometrically, and the actual

θ_{d i f f}

has only 10° deviation, but the result of linear loss calculation

θ_{d i f f}^{'}

is 350°, which is obviously unreasonable.

To solve the above problem, this part proposes a Gaussian–Cosine-based Angle-aware Loss function, which periodically maps the angular difference by introducing the cosine function, so that the loss function can give a more reasonable gradient feedback when the angular error is small in a geometric sense.

In YOLO rotated frame object detection task, the angle of rotation is defined as

θ \in [0, π)

and hence the difference between the predicted angle and the true angle ranges from

π

to

- π

. For this purpose, the angular difference needs to be multiplied by a factor of 2 and then fed into a trigonometric function to realize the periodic mapping. The rotated detection frame formed in the image when the difference between two angles is

π

is geometrically overlapping, which is the same as the correct result when the angle difference is 0. Therefore, the function value is close to 1 when the angle difference is 0 or when the value of the function is close to 1, i.e., x is 0 or

\pm π

. Based on this, the cosine function is chosen to map the angle difference

θ

to x:

x = cos (Δ θ)

(5)

Then input the obtained x into a Gaussian distribution function to obtain the final loss function:

L_{a n g l e} = 1 - \frac{1}{\sqrt{2 π} \cdot σ} e^{- \frac{{(cos (Δ θ))}^{2}}{2 σ^{2}}}

(6)

where

σ

is the standard deviation of the Gaussian distribution, controlling the width of its peak and thus affecting the sensitivity of angular differences to losses. When

σ

is large, the function is smoother and the model is less sensitive to angular errors, and conversely the function is steeper and angular deviations produce greater losses. The effect of different

σ

in the Gaussian distribution function is illustrated in Figure 5. Multiple sets of tuning experiments were conducted on the values of the angular loss weights to balance their effect on model training.

4. Experiments

This section presents comprehensive experiments conducted to assess the effectiveness of the proposed SPOD-YOLO detector on the Jilin-1 aircraft detection dataset and other representative remote sensing benchmarks. The evaluations center on three core challenges in satellite-based object detection: reliable recognition of small objects, accurate discrimination in densely populated scenes, and stable rotated-bounding-box regression under complex environmental factors including cloud occlusion, shadow-induced distortions, and background clutter.

For clarity and consistency, all qualitative visualizations in this section follow a unified annotation protocol: ground-truth bounding boxes are shown in green, whereas predicted bounding boxes from SPOD-YOLO and the compared methods are displayed in red. Based on this convention, the subsequent sections detail the experimental settings, quantitative comparisons, and visual analyses, providing a comprehensive verification of the proposed method’s performance and robustness.

4.1. Experimental Environment and Evaluation Metrics

All experiments are conducted on a single NVIDIA RTX 4090 GPU, including the model training and prediction process. During the training process, the input images are uniformly scaled to a resolution of 1024 × 1024, and the batch size is set to 8. The optimizer adopts stochastic gradient descent (SGD), with an initial learning rate of 0.01, a momentum coefficient of 0.937, a weight decay coefficient of 0.0005, and the number of training rounds is 100. The model is implemented based on the PyTorch framework, and the training process is conducted under Ubuntu 18.04 environment.

In order to comprehensively evaluate the performance of the model in the task of aircraft detection in remote sensing images, a number of standard evaluation metrics that are widely used in the field of object detection, including Precision, Recall, mAP@0.5, mAP@0.5:0.95, etc., are used in the experiments, which are defined as follows.

Precision
Precision rate is used to measure the proportion of samples predicted to be positive classes in the detection results that are actually positive classes, defined as

$Precision = \frac{TP}{TP + FP}$

(7)

where TP (True Positive) indicates the number of positive samples correctly detected and FP (False Positive) indicates the number of negative samples incorrectly detected as positive. A higher precision rate means that the model produces fewer false detections and the detection results are more reliable.
Recall
The recall rate measures the proportion of all actual positive class samples that are correctly detected and is defined as

$Recall = \frac{TP}{TP + FN}$

(8)

where FN (False Negative) denotes the number of positive samples that failed to be detected. A higher recall indicates that the model misses fewer objects in detecting them and has better detection ability.
Average Precision (AP)
In object detection, the average precision (AP) metric for a single category is usually obtained by plotting the Precision–Recall curve and calculating the area under the curve. For a category c, its AP is defined as

${AP}_{c} = \int_{0}^{1} p (r) d r$

(9)

where p(r) denotes the maximum precision at a recall of r. When calculating the AP, a threshold for determining whether the detection is correct or not needs to be set, which is usually based on the intersection over union ratio between the predicted frame and the true frame.
mAP50
mAP50 is the average value of AP for each category when the Intersection over Union (IoU) threshold is set to 0.5, defined as

$mAP 50 = \frac{1}{C} \sum_{c = 1}^{c} {AP}_{c}$

(10)

where C is the total number of categories. mAP50 is a relatively loose index, which is suitable for the preliminary measurement of model detection performance.
mAP50:95
In order to more comprehensively evaluate the performance of the model under different detection accuracy requirements, the COCO Challenge proposes a more stringent mAP@0.5:0.95 metric, i.e., with IoU thresholds ranging from 0.5 to 0.95 in steps of 0.05, a total of 10 different IoU thresholds are set to compute the APs under each IoU separately, and then take the average:

$m A P 50 : 95 = \frac{1}{10} \sum_{I o U = 0.5}^{0.95} m A P @ I o U$

(11)

This index can comprehensively reflect the detection ability of the model under various matching criteria from loose to strict, with higher evaluation challenge and credibility, and is one of the main reference performance indexes in the current object detection field.

In addition, in order to comprehensively evaluate the practical application performance of the detection model, especially the deployment potential in resource-constrained environments, both the number of parameters (Para) and the amount of computation Giga Floating Point Operations (GFLOPs) are introduced in the experiment as evaluation metrics. The number of parameters directly reflects the storage overhead and complexity of the model. The amount of computation is usually expressed in GFLOPs, reflecting the computational overhead of the model in the inference stage. Lower GFLOPs imply faster inference and lower power consumption, which is especially suitable for applications in scenarios such as real-time detection, mobile deployment, and satellite-side inference.

4.2. Datasets

Several public datasets, such as MAR20 and UCAS-AOD [46], have greatly supported research on aircraft detection in remote sensing imagery. However, most of these datasets are based on Google Earth data (including aerial and aerospace images), where aircraft are larger and clearer, making detection less challenging. To more rigorously evaluate model performance, we not only test on these public benchmarks but also construct a new dataset using high-resolution Jilin-1 satellite imagery, which features smaller, denser aircraft and more complex scenes.

4.2.1. MAR20

The MAR20 dataset was proposed by Northwestern Polytechnical University in 2023, mainly for studying the object detection task of military aircraft in high-resolution remote sensing images. As shown in Figure 6, this dataset contains 3842 images, 20 military aircraft, and 22,341 instances, all of which have annotations in both horizontal bounding box and rotated bounding box. The MAR20 dataset contains a large number of aircraft with moderate background complexity and sparse object distribution. Its main application scenarios are aprons and surrounding airspace. However, the MAR20 dataset is mainly provided by Google Earth, which focuses on aerial data. Therefore, the resolution of objects is higher, the scale variation is smaller, the proportion of small objects is lower, and it is easier to achieve full coverage.

4.2.2. UCAS-AOD

UCAS-AOD is the dataset was first released by UCAS in 2014 and supplemented in 2015, designed for vehicle and aircraft detection missions. Among them, the aircraft detection dataset contains a total of 1000 images and about 7482 object instances.The image resolution of UCAS-AOD is 659 × 1280 or 941 × 1372, and the objects are labeled in the form of horizontal boxes. As shown in Figure 6, UCAS-AOD takes civil airport as the image background, and the objects are mainly gathered near the airport aprons, there are aircraft on the concrete floor, which is prone to misidentification problems, and the object scale varies greatly, which makes it difficult to detect them. However, the image size of UCAS-AOD is uneven and the data volume is small, making it difficult to directly train the model.

In summary, both MAR20 and UCAS-AOD are important assessment benchmarks for remote sensing aircraft detection missions and are suitable for RBB detection studies, but they still suffer from an insufficient number of small objects, a single scale distribution, and limited object density. Therefore, we construct a more challenging high-density small aircraft detection dataset on this basis to further promote the in-depth development of research in this field.

4.2.3. Jilin-1 Satellite Aircraft Detection Dataset

In order to evaluate the detection capability of different object detection networks for small aircraft targets in optical satellite remote sensing images, this paper constructed an aircraft detection dataset based on Jilin-1 satellite imagery. Unlike Google Earth data, the Jilin-1 satellite captures images from an altitude of approximately 500 km, with a ground spatial resolution of about 1 m.

During dataset construction, the raw satellite imagery was first cropped into uniform patches of 1024 × 1024 pixels to standardize input dimensions. As shown in Figure 7, scenes containing aircraft were manually selected, while images with excessive cloud cover or motion blur were excluded to ensure quality.

For images affected by cloud occlusion, a consistent annotation policy was adopted: when aircraft were only partially obscured by thin or semi-transparent clouds such that their contours or geometric structures remained discernible, the corresponding instances were still labeled using rotated bounding boxes based on visible cues and contextual information. Conversely, aircraft fully covered by opaque clouds with no identifiable structure were not annotated in order to avoid introducing label noise.

Each aircraft instance was annotated using the X-anylabeling tool with rotated bounding boxes, where the rotation angle was aligned with the aircraft’s longitudinal direction, and the object category was unified as plane. The dataset contains a total of 2503 images and 35,169 aircraft instances, among which 58.16% have pixel sizes smaller than 32 × 32, providing abundant small-object examples. More specifically, 415 instances fall within 0–16 pixels, 1513 within 16–20 pixels, and 15,477 within 20–32 pixels.

To facilitate model development while ensuring fair evaluation and preventing scene-level data leakage, the dataset was split into training, validation, and testing subsets in an 8:1:1 ratio. During partitioning, images belonging to the same airport scene were kept within the same subset to avoid unintended correlations across sets and to preserve the statistical independence needed for reliable performance assessment.

Compared with existing aircraft detection datasets, the proposed Jilin-1 dataset features smaller object sizes, higher target density, and more complex environmental conditions, such as cloud occlusion, low visibility, and cluttered backgrounds. The inclusion of annotated partially occluded aircraft further enhances the dataset’s realism and enables more comprehensive evaluation of model robustness under challenging atmospheric conditions. This makes it a valuable benchmark for verifying the robustness and generalization ability of object detection algorithms in satellite-based remote sensing scenarios.

4.3. Ablation Experiment

In order to comprehensively evaluate the effectiveness of the proposed modules in this paper in the remote sensing aircraft detection task, we conducted a systematic ablation experiment on Jilin-1 satellite aircraft detection dataset. Using YOLOv11 as the base model, this experiment verifies the effect of each module on the detection accuracy by gradually introducing the DW-MSA, ODConv, and GCAL, aiming at revealing the roles and advantages that each module plays in different scenarios. The detection results of SPOD-YOLO are shown in Figure 8, which includes all individual modules, pairwise combinations, and the configuration where all three modules are used together, which allows us to clearly observe their complementary and synergistic effects.

In the evaluation of inference efficiency, it is noted that the measured FPS exhibits inherent fluctuations across multiple runs, primarily due to variations in system load and GPU scheduling. Although the incorporation of DW-MSA, ODConv, and GCAL increases the computational load, the complete SPOD-YOLO model still maintains an inference speed of 29.9 FPS on an NVIDIA RTX 4090 GPU, compared with 34.2 FPS for the YOLOv11 baseline. These results indicate that SPOD-YOLO improve the robustness and precision of aircraft detection while preserving practical real-time applicability in satellite-based monitoring scenarios.

4.3.1. Effect of Different Modules on Extracted Features

As shown in Table 1, the addition of DW-MSA as well as ODConv improves the evaluation metrics, and this is most obvious in the Recall metrics, especially after the addition of the DW-MSA module, which proves the contribution of DW-MSA to the detection of small objects, allowing the model to effectively distinguish between the small objects and the background. To verify this conclusion, we visualize the feature maps in the Figure 9 for whether or not DW-MSA is added in the same layer as ODConv. The brighter the color, the more attention the model pays to the region. Compared to the features extracted by the original YOLOv11, since DW-MSA combines multi-layer features with rich local context features, the network shows good suppression of complex backgrounds while enhancing the feature representation of the object of interest.

These results confirm that DW-MSA and ODConv not only improve feature representation individually but also reinforce each other when used together, demonstrating a synergistic enhancement of small object discrimination and background suppression.

4.3.2. Role of GCAL on Angle Prediction

Table 1 shows the performance enhancement of the model after adding the GCAL function. The GCAL enhancement is mainly reflected in the metric of mAP50:95, due to the angle of the object is predicted more accurately, the IoU and concurrency ratio of the prediction results to the true value is larger, which makes the detection results still valid when the mAP threshold is higher. In Figure 10 we also show the difference in the detection outer frame of the aircraft in the visualized detection results after adding the GCAL loss. When the angle difference between the object orientation and the horizontal or vertical direction is large, the angular error of the RBB predicted by the model is more obvious. When the aspect ratio of the object is close to 1:1, the frequency of the prediction error is further increased, which can be avoided effectively after adding the GCAL loss, so that the predicted rotation frame is closer to the actual orientation of the object. Thanks to the effective supervision of the GCAL loss, the model is able to show better RBB prediction results.

To further investigate the influence of the Gaussian standard deviation parameter in GCAL on angular regression accuracy, we conducted a set of controlled experiments by varying

1 / σ

while keeping all other training settings unchanged. The results are summarized in Table 2. As

1 / σ

increases, the Precision metric exhibits a gradual upward trend, whereas the Recall metric shows mild fluctuations, indicating that an overly sharp angular penalty may cause the model to behave more conservatively, leading to a slight decline in the detection of marginal or low-visibility targets.

In terms of the comprehensive metric mAP50, the performance remains stable when

1 / σ = 3.3

and

1 / σ = 5

, demonstrating that these settings provide an appropriate balance between strict angular supervision and classification–localization robustness. More importantly, the mAP50:95 metric, which is highly sensitive to the accuracy of angle regression, reaches the highest value when

1 / σ = 5

. This indicates that, under this configuration, GCAL provides the most effective constraint on the rotated bounding box orientation, enabling the detector to generate geometrically tighter and more consistent predictions across a range of IoU thresholds.

By comparing Table 2 with the baseline L1 loss results reported in Table 1, it can be observed that GCAL obtains a 2.03% improvement in mAP50:95 despite only marginal changes in mAP50. This aligns with the qualitative observations in Figure 10, where GCAL substantially reduces angular jitters and misalignment errors, particularly for objects with near-square aspect ratios or large orientation deviations, demonstrating the necessity of GCAL.

4.4. Comparison Experiment

4.4.1. Experimental Analysis of MAR20

MAR20 is mainly oriented to RBB object detection in the airport area, the size of the aircraft is large, and as a single-category object detection task, the dataset has a low overall difficulty. The results of different methods on the MAR20 dataset are shown in Table 3, we compare our model with existing state-of-the-art methods of object detection in remote sensing images, and the figure shows the detection results of several methods with better quantization performance with SPOD-YOLO. From the results in Figure 11, we can find that SPOD-YOLO is able to maintain a high detection rate in different scenarios, both in the case of object clustering and similar background colors. At the same time, SPOD-YOLO can also effectively avoid the situation that similar backgrounds are mistakenly detected as objects. Compared with LEGNet, SPOD-YOLO improves 2.37% and 12.19% in mAP50 and mAP50:95, respectively, and the result proves that SPOD-YOLO has a better ability to predict the object orientation.

The experiment further analyzed the computational efficiency and model complexity of different algorithms. As shown in Table 3, although existing object detection algorithms show stable performance in detection accuracy, their computational cost and parameter size are relatively large, exceeding 200 GFLOPs and 30M parameters, making it difficult to meet the real-time inference requirements of spaceborne or edge devices. SPOD-YOLO achieves a significant improvement in accuracy while maintaining low computational overhead. Its computational cost is only 11.2 GFLOPs, with 3.77M parameters, reducing the computational cost by approximately 82.5% compared to LEGNet, while improving the mAP50:95 metric by 12.19%. This indicates that SPOD-YOLO achieves a better balance between complexity and detection accuracy, possessing higher detection efficiency and practical deployment feasibility.

4.4.2. Experimental Analysis of UCAS-AOD

The aircraft in the UCAS-AOD dataset have a large difference in scale, and the objects in different images have several times the difference in size. Meanwhile, the objects are more aggregated in the civil airport scenario, which is more suitable for testing the adaptability of the model to dense small objects. Figure 12 shows the detection results of different methods in different scenarios. Since the aircraft in the UCAS-AOD dataset are more aggregated near the airport towers, and the cross structure of the towers is more similar to that of the aircraft, which makes it easy for the existing methods to misdetect them as aircraft. Moreover, even if there are unlabeled valid objects in the labeling file, SPOD-YOLO can also detect the objects accurately. When object density increases, existing methods not only suffer from missed detections but also exhibit degraded rotated-bounding-box angle prediction, revealing clear limitations in their robustness. As shown in Table 4, LEGNet shows a better recall rate in different scenes and different sizes of objects, but also accompanied by some false detections. SPOD-YOLO is able to guarantee the detection rate of small objects and find objects of interest in remote sensing images while effectively reducing the false detection rate, which is also shown in the quantitative results, where the precision is improved by 1.33% compared with the existing optimal object detection method with similar recall.

4.4.3. Experimental Analysis of Jilin-1 Aircraft Detection Dataset

Jilin-1 satellite aircraft detection dataset contains a large number of objects with a high proportion of small objects, posing a dual challenge to both the robustness of small-object detection and the stability of angular regression. As shown in Figure 13, SPOD-YOLO demonstrates clear advantages across several common complex scenarios. In the first, second, and fourth rows, which correspond respectively to aircraft-like shadows formed after long parking, terminal buildings with aircraft-like geometric structures, and small objects embedded in backgrounds of high similarity, competing methods frequently exhibit false detections or missed detections. In contrast, SPOD-YOLO maintains strong discrimination capability, suppressing background interference while preserving the contours and orientations of true aircraft objects. In contrast, SPOD-YOLO maintains strong discrimination capability, suppressing background interference while preserving the contours and orientations of true aircraft objects.

More challenging cases arise in scenes with extremely dense aircraft distributions, as illustrated in the third row of Figure 13. Although SPOD-YOLO achieves a high recall rate, the rotated bounding boxes of adjacent aircraft may partially overlap, causing certain valid detections to be suppressed during the NMS process. This represents one of the typical failure cases for densely packed small-object scenarios.

The fifth row shows scenes heavily affected by cloud occlusion, where both cloud layers and cloud shadows significantly alter local spectral and texture patterns. SPOD-YOLO remains effective under thin or semi-transparent cloud cover by leveraging the DW-MSA module to enhance local structural cues. However, when aircraft are fully obscured by dense, opaque clouds and no visual structure is preserved, neither SPOD-YOLO nor competing methods can reliably detect the objects. This limitation reflects the inherent boundary conditions of optical remote sensing imagery rather than a flaw specific to the model.

In addition to accuracy, the inference efficiency comparison further confirms the practicality of the proposed approach. As reported in Table 5, SPOD-YOLO attains the highest performance among all methods while achieving an inference speed approximately 5 FPS higher than the other models (without YOLOv8n). This balance between accuracy and efficiency highlights the suitability of SPOD-YOLO for real-time monitoring applications in satellite-based scenarios.

Beyond the qualitative observations, a scale-specific quantitative analysis was further conducted on the Jilin-1 aircraft detection dataset. To enable a more fine-grained understanding of model behavior across different aircraft sizes, the test split was additionally reorganized into three size-oriented subsets. Aircraft were categorized into small, medium, and large targets according to the pixel area of their rotated bounding boxes, following the commonly used thresholds in remote-sensing detection studies: objects with area smaller than 32 × 32 pixels are defined as small, those between 32 × 32 and 64 × 64 as medium, and those larger than 96 × 96 as large. The three subsets were created by filtering the full test set based on these thresholds and therefore serve as size-focused evaluation partitions rather than exact decompositions of the full-test evaluation.

The corresponding results, summarized in Table 6, Table 7 and Table 8, show that SPOD-YOLO achieves the most significant improvements on small and medium aircraft, which constitute the majority of targets in this dataset. For small objects, SPOD-YOLO attains the highest precision and recall simultaneously, improving mAP50 and mAP50:95 by a large margin compared with the baseline YOLOv11 and other state-of-the-art detectors. This demonstrates the enhanced ability of model to extract weak structural cues, suppress background interference, and accurately localize densely distributed small aircraft, which are typically the most challenging category in optical remote sensing imagery.

For medium-sized targets, SPOD-YOLO also maintains consistently strong performance, outperforming all competing methods in both mAP50 and mAP50:95 while sustaining a balanced precision–recall trade-off. The improvement in angular regression accuracy brought by GCAL contributes significantly to this stability, enabling the model to retain high confidence even when objects exhibit notable orientation variations.

Although large aircraft are relatively easier to detect and the performance gap among methods becomes smaller, SPOD-YOLO still delivers competitive results and avoids the substantial recall degradation observed in several YOLO-based detectors. This suggests that the proposed enhancements, particularly DW-MSA and ODConv, do not compromise the representation of large objects and allow the model to maintain reliable detection across different scales. The overall balanced performance indicates that SPOD-YOLO effectively unifies small-object sensitivity with large-object robustness, rather than optimizing for only a single size range.

It is also important to note that the mAP values obtained on the three size-specific subsets are not expected to match or upper-bound the overall mAP on the full test set. Since the per-scale evaluations are conducted on filtered and size-biased subsets—rather than being mutually exclusive partitions of the test set—their mAP scores reflect performance within specialized distributions rather than the aggregate performance across all objects. Consequently, the overall mAP50 (98.26) is computed on the complete test split and is not constrained by the per-scale results reported in Table 6, Table 7 and Table 8.

Overall, as supported by the quantitative results in Table 5, SPOD-YOLO achieves improvements of 4.54% in mAP50 and 11.78% in mAP50:95 over LEGNet, demonstrating strong robustness across multiple challenging conditions while also revealing the natural limits imposed by the sensing modality itself.

5. Discussion

This study addresses three challenges in aircraft detection from optical remote sensing imagery: the difficulty of identifying small aircraft targets, the interference caused by complex backgrounds, and the instability of rotated bounding-box regression. The proposed SPOD-YOLO integrates three components, DW-MSA, ODConv, and GCAL to strengthen multi-scale representation learning, foreground–background discrimination, and geometric modeling of oriented targets.

Experimental results on multiple benchmarks, particularly the newly constructed Jilin-1 aircraft detection dataset, demonstrate that SPOD-YOLO achieves notable improvements in several critical aspects. First, DW-MSA effectively enhances the representation of small and densely distributed aircraft by combining shallow contour cues and deep semantic features within overlapping local attention windows. This leads to significant improvements in recall and contour fidelity under dense-object conditions. Second, ODConv provides stronger adaptability to scene variations by dynamically adjusting kernel weights across spatial, channel, and kernel dimensions, enabling the detector to better suppress background clutter such as runway markings, shadows, and airport structures with aircraft-like geometry. Third, the GCAL loss substantially reduces discontinuities in rotated box regression, resulting in smoother angle estimation and higher mAP50:95 scores in multi-orientation detection scenarios.

Beyond accuracy, SPOD-YOLO maintains competitive computational efficiency. This balance of precision and efficiency highlights the practical value of the method in real-world monitoring systems.

6. Conclusions

This paper presents SPOD-YOLO, a lightweight yet high-precision aircraft detection framework tailored for remote sensing imagery. By incorporating DW-MSA, ODConv, and GCAL loss function, the method significantly enhances multi-scale feature fusion, dynamic feature modeling, and rotated bounding box regression. Extensive evaluations on public datasets and the proposed Jilin-1 aircraft detection dataset confirm that SPOD-YOLO delivers substantial improvements in small-object recognition, orientation stability, and robustness against background complexity. Notably, the model achieves 98.26% mAP50 and 80.81% mAP50:95 on the Jilin-1 dataset, surpassing existing state-of-the-art approaches while maintaining efficient inference.

However, object detection methods based on single-source data may reduce the ability of model to distinguish object boundaries when dealing with camouflage patterns or near background tones, which may result in missed detections or positioning offsets, making it difficult to achieve high-precision detection. Future research will further introduce multimodal data (eg., IR, SAR) and semantic-level background suppression mechanisms to enhance the robustness of the model in complex scenes.

Author Contributions

Methodology, J.C.; Software, J.C.; Validation, J.C.; Formal analysis, J.C. and L.W.; Investigation, J.C. and W.H.; Resources, Y.L.; Data curation, J.C., P.G., D.R. and W.H.; Writing—original draft, J.C. and W.H.; Writing—review & editing, J.C., P.G., Y.L. and L.W.; Visualization, J.C.; Supervision, P.G., Y.L., L.C., D.R. and K.W.; Project administration, Y.L.; Funding acquisition, P.G. and L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, Grant Number 62301590.

Data Availability Statement

The original contributions presented in the study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, B.; Wu, Y.; Zhao, B.; Chanussot, J.; Hong, D.; Yao, J.; Gao, L. Progress and Challenges in Intelligent Remote Sensing Satellite Systems. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1814–1822. [Google Scholar]
Zhao, Q.; Yu, L.; Du, Z.; Peng, D.; Hao, P.; Zhang, Y.; Gong, P. An overview of the applications of earth observation satellite data: Impacts and future trends. Remote Sens. 2022, 14, 1863. [Google Scholar]
Sun, E.; Cui, Y.; Liu, P.; Yan, J. A decade of deep learning for remote sensing spatiotemporal fusion: Advances, challenges, and opportunities. arXiv 2025, arXiv:2504.00901. [Google Scholar] [CrossRef]
Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Yasir, M.; Wan, J.; Liu, S.; Sheng, H.; Xu, M.; Hossain, M. Coupling of deep learning and remote sensing: A comprehensive systematic literature review. Int. J. Remote Sens. 2023, 44, 157–193. [Google Scholar] [CrossRef]
Guo, P.; Celik, T.; Liu, N.; Li, H.C. Break through the border restriction of horizontal bounding box for arbitrary-oriented ship detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4005505. [Google Scholar] [CrossRef]
Wang, K.; Wang, Z.; Li, Z.; Su, A.; Teng, X.; Liu, M.; Yu, Q. Oriented object detection in optical remote sensing images using deep learning: A survey. arXiv 2023, arXiv:2302.10473. [Google Scholar]
Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning modulated loss for rotated object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 2458–2466. [Google Scholar]
Xu, H.; Liu, X.; Xu, H.; Ma, Y.; Zhu, Z.; Yan, C.; Dai, F. Rethinking boundary discontinuity problem for oriented object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17406–17415. [Google Scholar]
Wang, X.; Wang, A.; Yi, J.; Song, Y.; Chehri, A. Small object detection based on deep learning for remote sensing: A comprehensive review. Remote Sens. 2023, 15, 3265. [Google Scholar]
Zhang, X.; Zhang, T.; Wang, G.; Zhu, P.; Tang, X.; Jia, X.; Jiao, L. Remote sensing object detection meets deep learning: A metareview of challenges and advances. IEEE Geosci. Remote Sens. Mag. 2023, 11, 8–44. [Google Scholar]
Yu, W.; Cheng, G.; Wang, M.; Yao, Y.; Xie, X.; Yao, X.; Han, J. MAR20: A benchmark for military aircraft recognition in remote sensing images. Natl. Remote Sens. Bull. 2024, 27, 2688–2696. [Google Scholar] [CrossRef]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017; SciTePress: Setúbal, Portugal, 2017; Volume 2, pp. 324–331. [Google Scholar]
Trigka, M.; Dritsas, E. A comprehensive survey of machine learning techniques and models for object detection. Sensors 2025, 25, 214. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 9197–9206. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Ali, M.L.; Zhang, Z. The YOLO framework: A comprehensive review of evolution, applications, and benchmarks in object detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
Yuan, X.; Chakravarty, A.; Gu, L.; Wei, Z.; Lichtenberg, E.; Chen, T. An Empirical Study of Methods for Small Object Detection from Satellite Imagery. arXiv 2025, arXiv:2502.03674. [Google Scholar] [CrossRef]
Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Mei, S.; Lian, J.; Wang, X.; Su, Y.; Ma, M.; Chau, L.P. A comprehensive study on the robustness of deep learning-based image classification and object detection in remote sensing: Surveying and benchmarking. J. Remote Sens. 2024, 4, 0219. [Google Scholar] [CrossRef]
Wei, X.; Li, Z.; Wang, Y. SED-YOLO based multi-scale attention for small object detection in remote sensing. Sci. Rep. 2025, 15, 3125. [Google Scholar] [CrossRef]
Chen, S.; Chen, J.; Shi, M.; Lin, S.; Zhang, W.; Liu, X. Research on anti-breakout system of power optical fiber communication network based on remote sensing image target detection. In Proceedings of the International Conference on Computer Vision and Image Processing (CVIP 2024), Chennai, India, 20–22 December 2025; SPIE: Bellingham, WA, USA, 2025; Volume 13521, pp. 144–152. [Google Scholar]
Wan, M.; Zhong, G.; Wu, Q.; Zhao, X.; Lin, Y.; Lu, Y. CR-Mask RCNN: An Improved Mask RCNN Method for Airport Runway Detection and Segmentation in Remote Sensing Images. Sensors 2025, 25, 657. [Google Scholar] [CrossRef]
Zhang, X.; Wang, C.; Yang, X. Enhanced Multiscale Vision Transformer with Cascaded Feature Fusion for Efficient Object Detection in Remote Sensing Images. J. Circuits Syst. Comput. 2025, 34, 2550197. [Google Scholar] [CrossRef]
Qiang, H.; Hao, W.; Xie, M.; Tang, Q.; Shi, H.; Zhao, Y.; Han, X. SCM-YOLO for Lightweight Small Object Detection in Remote Sensing Images. Remote Sens. 2025, 17, 249. [Google Scholar] [CrossRef]
Li, T.; Gang, Y.; Li, S.; Shang, Y. A small underwater object detection model with enhanced feature extraction and fusion. Sci. Rep. 2025, 15, 2396. [Google Scholar] [CrossRef]
Wang, Y.; Li, Z.; Zhu, S.; Wei, X. EFCNet for small object detection in remote sensing images. Sci. Rep. 2025, 15, 20393. [Google Scholar] [CrossRef]
Wang, J.; Bai, Z.; Zhang, X.; Qiu, Y. A lightweight remote sensing aircraft object detection network based on improved yolov5n. Remote Sens. 2024, 16, 857. [Google Scholar] [CrossRef]
Wang, W.; Cai, Y.; Luo, Z.; Liu, W.; Wang, T.; Li, Z. SA3Det: Detecting rotated objects via pixel-level attention and adaptive labels assignment. Remote Sens. 2024, 16, 2496. [Google Scholar] [CrossRef]
Zhao, Y.; Yang, T.; Wang, S.; Su, H.; Sun, H. Adaptive dual-domain dynamic interactive network for oriented object detection in remote sensing images. Remote Sens. 2025, 17, 950. [Google Scholar] [CrossRef]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 8232–8241. [Google Scholar]
Zhou, Q.; Yu, C. Point RCNN: An angle-free framework for rotated object detection. Remote Sens. 2022, 14, 2605. [Google Scholar] [CrossRef]
Zhao, P.; Qu, Z.; Bu, Y.; Tan, W.; Guan, Q. Polardet: A fast, more precise detector for rotated target in aerial images. Int. J. Remote Sens. 2021, 42, 5831–5861. [Google Scholar] [CrossRef]
Yang, X.; Mohamed, A.S.A. Gaussian-based R-CNN with large selective kernel for rotated object detection in remote sensing images. Neurocomputing 2025, 620, 129248. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Li, C.; Zhou, A.; Yao, A. Omni-dimensional dynamic convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar] [CrossRef]
Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 3735–3739. [Google Scholar]
Yu, X.; Lu, J.; Lin, M.; Zhou, L.; Ou, L. MKIoU loss: Toward accurate oriented object detection in aerial images. J. Electron. Imaging 2023, 32, 033030. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16794–16805. [Google Scholar]
Lu, W.; Chen, S.B.; Li, H.D.; Shu, Q.L.; Ding, C.H.; Tang, J.; Luo, B. Legnet: Lightweight edge-Gaussian driven network for low-quality remote sensing image object detection. arXiv 2025, arXiv:2503.14012. [Google Scholar]

Figure 1. The overall framework diagram of object detection network.

Figure 2. Structure of Dense Window Multi-head Self-Attention Module.

Figure 3. Schematic of Omni-Dimensional Dynamic Convolution (* denotes matrix multiplication.).

Figure 4. Angle error obtained from different types of loss functions.

Figure 5. Gaussian Distribution with Different 1/

σ

.

Figure 5. Gaussian Distribution with Different 1/

σ

.

Figure 6. Display of existing aircraft detection datasets.

Figure 7. Display of Jilin-1 satellite aircraft detection dataset.

Figure 8. Qualitative comparison between ground truth and detection results of SPOD-YOLO. (Top row: ground truth; Bottom row: SPOD-YOLO results. The figure covers multiple representative scenarios, including cloud occlusion, dense and aligned aircraft, aircraft near terminal buildings, and regular airport scenes).

Figure 9. Feature map visualization under different module configurations in SPOD-YOLO.

Figure 10. Visual comparison of rotation regression results before and after applying GCAL (Top: with L1 loss; Bottom: with GCAL).

Figure 11. Comparison of detection results on MAR20. (The five rows illustrate: objects with colors similar to the background, images with high exposure intensity, very small aircraft objects, densely arranged aircraft, and aircraft with shadow artifacts formed by long-time parking).

Figure 12. Comparison of detection results on UCAS-AOD. (The five rows illustrate: scenes containing background structures visually similar to aircraft, cases with missing annotations in the original dataset, densely distributed aircraft in complex layouts, airport terminal structures easily confused with aircraft, and cases where overlapping ground-truth boxes occur due to tight aircraft spacing).

Figure 13. Comparison of detection results on Jilin-1 aircraft detection dataset. (The five rows illustrate: aircraft shadows produced by long parking, aircraft located close to terminal buildings, small and densely distributed aircraft, aircraft embedded in background regions with high visual similarity, and aircraft partially obscured by cloud cover).

Table 1. Results of Ablation Experiment (DW = DW-MSA, OD = ODConv, P = Precision ↑, R = Recall ↑, mAP50 ↑, mAP50:95 ↑, GFLOPs ↓, Para ↓).

DW	OD	GCAL	P	R	mAP50	mAP50:95	GFLOPs	Para	FPS
×	×	×	90.65	91.00	95.18	75.33	6.7	2.53	34.2
✓	×	×	92.57	94.32	97.47	78.81	11.3	3.79	29.3
×	✓	×	92.65	92.81	96.93	77.67	6.4	2.80	33.7
×	×	✓	91.84	92.57	95.50	76.33	6.7	2.53	32.6
✓	✓	×	93.12	93.34	97.40	78.78	11.2	3.77	29.2
✓	×	✓	94.34	94.59	97.98	79.04	11.3	3.79	30.0
×	✓	✓	92.97	93.17	97.29	77.97	6.4	2.80	32.6
✓	✓	✓	93.67	95.69	98.26	80.81	11.2	3.77	29.9

Notes: ↑ indicates that a higher value represents better performance; ↓ indicates that a lower value represents better efficiency or compactness.

Table 2. Performance comparison under different values of

1 / σ

in GCAL.

Table 2. Performance comparison under different values of

1 / σ

in GCAL.

$1 / σ$	P	R	mAP50	mAP50:95
2.5	92.81	95.55	98.16	80.08
3.3	93.43	95.02	98.26	80.72
5	93.28	96.25	98.26	80.81
10	93.57	95.27	98.08	80.00
20	93.84	94.43	98.03	79.96

Table 3. Performance Comparison on MAR20.

Method	P	R	mAP50	mAP50:95	GFLOPs	Para	FPS
Rotated_RetinaNet [47]	94.76	94.28	96.19	65.45	209.58	36.13	27.2
R3Det [48]	91.23	98.39	98.27	73.54	225.63	36.86	22.8
LSKNet [49]	99.36	94.51	94.46	75.21	173.66	30.96	25.0
LEGNet [50]	99.01	97.04	97.03	76.53	148.83	20.64	24.2
YOLOv8n	97.43	94.94	98.17	81.99	8.3	2.9	31.1
SPOD-YOLO (Ours)	99.05	99.37	99.41	88.72	11.2	3.77	30.4

Table 4. Performance Comparison on UCAS-AOD.

Method	P	R	mAP50	mAP50:95	GFLOPs	Para	FPS
Rotated_RetinaNet	81.65	93.81	93.29	60.22	209.58	36.13	26.0
R3Det	51.16	97.73	96.90	64.51	225.63	36.86	23.5
LSKNet	95.10	98.11	97.59	71.42	173.66	30.96	24.2
LEGNet	93.07	98.36	97.77	73.86	148.83	20.64	23.8
YOLOv8n	96.07	95.22	97.64	81.57	8.3	2.93	33.6
SPOD-YOLO (Ours)	96.43	98.08	98.52	85.49	11.2	3.77	29.3

Table 5. Performance Comparison on Jilin-1 Aircraft Detection Dataset.

Method	P	R	mAP50	mAP50:95	GFLOPs	Para	FPS
Rotated_RetinaNet	27.07	86.76	81.01	40.51	209.58	36.13	25.1
R3Det	15.30	89.10	85.86	47.41	225.63	36.86	23.3
LSKNet	77.38	94.11	93.07	66.80	173.66	30.96	24.3
LEGNet	72.34	94.64	93.72	69.03	148.83	20.64	23.5
YOLOv8n	91.18	90.55	93.63	68.93	8.3	2.93	32.4
SPOD-YOLO (Ours)	93.67	95.69	98.26	80.81	11.2	3.77	29.9

Table 6. Detection performance for small-size aircraft on the Jilin-1 dataset.

Method	Nums	P	R	mAP50	mAP50:95
GT	1606	–	–	–	–
Rotated_RetinaNet	215	68.37	9.15	6.77	2.70
R3Det	1349	80.95	68.00	57.53	28.82
LSKNet	1434	81.21	87.48	75.36	43.94
LEGNet	1062	79.19	52.37	46.07	19.41
YOLOv8	1944	72.42	81.10	65.73	34.11
YOLOv11 (baseline)	1730	72.48	87.73	71.87	38.21
SPOD-YOLO (Ours)	1732	82.45	90.10	80.28	48.37

Table 7. Detection performance for medium-size aircraft on the Jilin-1 dataset.

Method	Nums	P	R	mAP50	mAP50:95
GT	983	–	–	–	–
Rotated_RetinaNet	903	72.54	66.63	57.75	34.59
R3Det	929	75.58	74.26	65.95	40.07
LSKNet	1159	72.13	85.05	66.18	42.81
LEGNet	1279	66.22	86.16	61.89	36.89
YOLOv8	972	72.94	72.13	59.61	34.73
YOLOv11 (baseline)	989	76.14	76.60	62.65	41.63
SPOD-YOLO (Ours)	1032	76.68	80.06	66.94	45.65

Table 8. Detection performance for large-size aircraft on the Jilin-1 dataset.

Method	Nums	P	R	mAP50	mAP50:95
GT	265	–	–	–	–
Rotated_RetinaNet	209	88.04	69.43	64.98	39.42
R3Det	241	87.97	77.74	71.63	46.03
LSKNet	277	86.28	90.19	83.07	58.32
LEGNet	284	84.86	90.94	81.42	55.28
YOLOv8	169	88.17	56.23	53.68	29.87
YOLOv11 (baseline)	200	87.50	66.04	62.89	36.13
SPOD-YOLO (Ours)	233	88.41	80.00	71.92	50.16

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Guo, P.; Liu, Y.; Cao, L.; Ran, D.; Wang, K.; Hu, W.; Wan, L. SPOD-YOLO: A Modular Approach for Small and Oriented Aircraft Detection in Satellite Remote Sensing Imagery. Remote Sens. 2025, 17, 3963. https://doi.org/10.3390/rs17243963

AMA Style

Chen J, Guo P, Liu Y, Cao L, Ran D, Wang K, Hu W, Wan L. SPOD-YOLO: A Modular Approach for Small and Oriented Aircraft Detection in Satellite Remote Sensing Imagery. Remote Sensing. 2025; 17(24):3963. https://doi.org/10.3390/rs17243963

Chicago/Turabian Style

Chen, Jiajian, Pengyu Guo, Yong Liu, Lu Cao, Dechao Ran, Kai Wang, Wei Hu, and Liyang Wan. 2025. "SPOD-YOLO: A Modular Approach for Small and Oriented Aircraft Detection in Satellite Remote Sensing Imagery" Remote Sensing 17, no. 24: 3963. https://doi.org/10.3390/rs17243963

APA Style

Chen, J., Guo, P., Liu, Y., Cao, L., Ran, D., Wang, K., Hu, W., & Wan, L. (2025). SPOD-YOLO: A Modular Approach for Small and Oriented Aircraft Detection in Satellite Remote Sensing Imagery. Remote Sensing, 17(24), 3963. https://doi.org/10.3390/rs17243963

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SPOD-YOLO: A Modular Approach for Small and Oriented Aircraft Detection in Satellite Remote Sensing Imagery

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Object Detection for Remote Sensing Images

2.2. Aircraft Detection in Remote Sensing Images

3. Methods

3.1. General Overview

3.2. Dense Window Multi-Head Self-Attention

3.3. Omni-Dimensional Dynamic Convolution

3.4. Gaussian–Cosine-Based Angle-Aware Loss Function

4. Experiments

4.1. Experimental Environment and Evaluation Metrics

4.2. Datasets

4.2.1. MAR20

4.2.2. UCAS-AOD

4.2.3. Jilin-1 Satellite Aircraft Detection Dataset

4.3. Ablation Experiment

4.3.1. Effect of Different Modules on Extracted Features

4.3.2. Role of GCAL on Angle Prediction

4.4. Comparison Experiment

4.4.1. Experimental Analysis of MAR20

4.4.2. Experimental Analysis of UCAS-AOD

4.4.3. Experimental Analysis of Jilin-1 Aircraft Detection Dataset

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI