1. Introduction
Against the backdrop of the deep integration of Information and Communication Technology (ICT) and the industrial internet, the intelligent object perception capability for optical remote sensing images has become a core driver of digital transformation in the manufacturing industry. With the rapid development of satellite technology [
1] and the widespread application of drones [
2], high-resolution optical remote sensing imaging based on an integrated space-air-ground perception network has been widely applied in industrial scenarios such as plant inspection, equipment monitoring, logistics scheduling, and security management. However, optical remote sensing images are characterized by complex backgrounds and significant scale variations. Typical targets such as aircraft, vehicles, and ships often occupy only dozens or even several pixels [
3,
4]. These targets exhibit extremely limited features and face inherent challenges including insufficient spatial resolution and severe background interference. Especially in densely distributed scenes, tiny objects are prone to feature confusion, causing traditional detection models to lose critical texture and contour information during the feature extraction stage. Furthermore, traditional methods rely on complex preprocessing and post-processing algorithms (e.g., Non-Maximum Suppression), which increase the burden on model design and computational resources. Therefore, it is of great significance to conduct an in-depth analysis of the development history and limitations of existing detection methods.
Optical remote sensing image object detection technology has evolved from traditional handcrafted feature methods to deep learning approaches. Early detection systems primarily relied on handcrafted features such as HOG and SIFT, which were combined with sliding window strategies for object recognition. However, they suffered from insufficient multi-scale adaptability and vulnerability to background interference. When object textures resembled the background or were occluded, traditional methods were prone to false positives and missed detections. More critically, key targets such as vehicles and ships in optical remote sensing images only occupy a minimal number of pixels. After multiple downsampling operations, their feature information is almost entirely lost, leading to a substantial decrease in detection rates. These issues severely constrain the practical application of optical remote sensing object detection technologies. With the rise of deep learning, object detection methods based on convolutional neural networks (CNNs) have gradually become mainstream. Two-stage detectors such as Faster R-CNN [
5] generate candidate regions through Region Proposal Networks (RPNs) and then perform classification and regression, achieving excellent accuracy. However, they suffer from high computational complexity and limited capability in detecting small targets. Single-stage detectors such as the YOLO [
6] series and SSD [
7] directly predict object locations and categories on feature maps, offering fast detection speeds suitable for real-time scenarios. Nevertheless, they typically rely on complex preprocessing and post-processing algorithms, which consequently affect their overall performance.
In recent years, DETR (Detection Transformer) [
8] has attracted significant attention due to its encoder–decoder structure and global self-attention mechanism. Unlike the YOLO series, DETR directly predicts bounding boxes through a set of learnable object queries, enabling end-to-end training and avoiding the issue of redundant bounding boxes. However, DETR still faces several challenges in optical remote sensing applications, such as slow training convergence, high computational costs, low efficiency in processing high-resolution images, and computational complexity that increases quadratically with image size. To address these issues, RT-DETR (Real-Time DETR) [
9] was proposed. It significantly improves detection speed and accuracy while maintaining the end-to-end advantages of DETR through multi-scale feature fusion and hybrid encoder architecture. Its core innovations include adopting a lightweight Feature Pyramid Network (FPN) [
10] to replace single-scale feature extraction, designing a dynamic sparse attention mechanism to reduce computational complexity, and introducing a task-specific query initialization strategy to accelerate convergence. These improvements enable RT-DETR to achieve state-of-the-art accuracy on MS COCO [
11] while maintaining real-time inference speed.
Although RT-DETR has improved its detection performance for conventional targets through multi-scale feature fusion enhancements, it still falls short in addressing the unique challenges of small object detection in optical remote sensing images. The limitation lies in the fact that the upsampling operations in the feature pyramid fail to effectively reconstruct the detailed features of small objects while overly aggressive downsampling strategies cause pixel-level information of tiny targets to be largely lost at early stages. This makes it difficult for subsequent fusion modules to integrate sufficiently effective multi-scale information. Additionally, when processing high-resolution images, the global attention mechanism of Transformers tends to severely dilute the already weak features of small objects amidst the vast amount of background information. Furthermore, due to the inadequate representation of small objects in the feature pyramid, the limited query vectors are more inclined to match prominent medium and large targets, leading to missed detections when faced with densely distributed small objects. These limitations at the level of feature scale transformation and fusion collectively constrain the performance of RT-DETR in small object detection in optical remote sensing images.
Based on the aforementioned issues, this study proposes an ORSSO-DETR (Optical Remote Sensing Small Object Detection Transformer) model. The model is based on RT-DETR-R18 and involves a redesign of its efficient hybrid encoder. The performance of ORSSO-DETR will be evaluated on the public optical remote sensing image dataset NWPU VHR-10 and RSOD across various application scenarios, including but not limited to the detection of typical object types such as aircraft, ships, and vehicles. The specific improvements are as follows:
- (1)
To enhance the detection performance for densely distributed small targets, we introduce a combination of the C3K2 and FCM modules to improve the integration of high-level features into low-level features.
- (2)
To effectively mitigate feature loss for multi-scale targets while maintaining computational efficiency, we reconstruct the downsampling module based on dynamic convolution and design an input-adaptive dimensionality reduction strategy. Meanwhile, the EUCB module is adopted to better preserve high-frequency texture details.
- (3)
To improve feature expression capability and semantic alignment, we optimize the hybrid encoder structure by introducing a 1 × 1 convolution to enhance high-level features before cross-scale feature fusion.
- (4)
To enhance the model’s nonlinear expressiveness and its sensitivity to weak features under complex backgrounds, we adopt the DyT module to replace the original normalization layer in the Transformer.
2. Literature Review
To address the problems of cluttered backgrounds and significant scale variations in optical remote sensing images, attention mechanisms and multi-scale feature fusion strategies have been widely adopted to improve detection performance. Shi et al. [
12] proposed a small object detection method employing an adaptive multi-level feature fusion module (AMFFM) and an attention-augmented high-resolution head (AAHRH). AMFFM upsamples high-level features through semantic context modeling and refines low-level features for noise removal, then fuses the enhanced multi-level features based on spatial and channel significance. AAHRH enhances the perception of small objects by embedding cross-dimensional interaction with the attention mechanism, achieving state-of-the-art performance on multiple datasets. Li et al. [
13] addressed the challenges of cluttered backgrounds and large-scale variations in ship detection by proposing a dual attention and scale-aware feature alignment network. This network comprises a deformable spatial attention module with channel integration and a bidirectional flow alignment network, which enhance discrimination between ships and distractors and resolve spatial misalignment in feature fusion, respectively. A dynamic alpha complete intersection over union loss based on target prior knowledge further refines the detector for maritime ships. Ji et al. [
14] proposed EFR-ACENet based on explicit feature reconstruction and adaptive context enhancement. The explicit feature reconstruction module preserves detailed information of small objects, while the adaptive context enhancement module integrates contextual features from different receptive fields.
Small objects in remote sensing images commonly suffer from weak textures, scale variations, and dense arrangements. Relying solely on feature pyramids is insufficient to fully exploit contextual information. Xu et al. [
15] proposed a prior-guided context fusion network with three novel components. The prior-guided context fusion module integrates multi-scale features and applies prior-guided dynamic channel weighting to address weak textures, the depthwise aggregator refines feature aggregation using dilated convolutions and dynamic feature adjustment for precise multi-scale detection in densely packed small-object environments, and the prior-guided small-object detector leverages prior knowledge to reduce background interference. Wang et al. [
16] proposed a position-guided dynamic receptive field network that establishes positional guidance relationships for small objects across different feature layers, preventing small objects from vanishing or being submerged in features. A combined head structure utilizes additional supervised information extracted from small objects, and a dynamic perception algorithm based on feature construction optimizes receptive fields and feature hierarchies.
Beyond dedicated designs for small objects, researchers have also explored holistic network architectures better suited to the characteristics of remote sensing images. Yan et al. [
17] proposed an adaptive semantic network that innovatively integrates Transformer and CNN technologies in a dual-branch encoder, capturing both global dependencies and local fine-grained details. The network also features an adaptive semantic matching module, an adaptive feature enhancement module, and a multi-scale fine-grained inference module, achieving strong performance in optical remote sensing image detection. Lee et al. [
18] proposed LSHNet, a dual-branch architecture consisting of an edge encoder that leverages structure features using edges as a prior structure and an image encoder that extracts context features. Through image-structure fusion, local-global feature fusion, and semantic cue updating modules, LSHNet effectively integrates hierarchical information, improving boundary clarity and background clutter suppression. Sun et al. [
19] built an end-to-end channel-enhanced remodeling network. A channel enhance module enhances shallow features using channel attention with a no-downscaling strategy, while a redefined feature module reconstructs deep features and generates global attention features via dimensional transformation and feature relationship aggregation. Multi-scale features are cascaded to produce the final output. Zeng et al. [
20] introduced a conditional diffusion transformer network, designing a Transformer-based progressive cross-stage fusion decoding unit, a patch strategy, and an encoder feature enhancement module. Diffusion-guided feature learning improves detection accuracy. Fang et al. [
21] proposed an interactive edge awareness network with an interactive edge awareness module that integrates a three-branch feature interaction mechanism and an average-max pooling difference strategy, adaptively enhancing edge representations without explicit edge supervision. A context attention guidance module and a semantic consistency preservation module suppress background clutter and alleviate semantic degradation, achieving excellent performance across multiple evaluation metrics.
Although the above work has advanced the development of remote sensing image object detection, challenges such as complex background interference, dense multi-scale object distributions, and severe loss of small object details in optical remote sensing images remain fundamentally unresolved. The upsampling in conventional feature pyramids merely magnifies resolution mechanically, failing to reconstruct the texture and contour information lost by small objects during downsampling, leaving the fused features insufficiently discriminative for tiny objects. Conventional downsampling compresses features in a fixed pattern, irreversibly erasing the spatial information of small objects occupying merely tens or even a few pixels at the early stages of feature extraction. Moreover, traditional normalization layers in Transformers struggle to capture the fine-grained feature differences between complex backgrounds and small objects, limiting the model’s sensitivity to weak targets. These structural deficiencies collectively constrain further improvements in small object detection accuracy.
3. Materials and Methods
3.1. RT-DETR Network Structure
RT-DETR (Real-Time Detection Transformer) is a real-time object detection model based on the Transformer architecture that was proposed by Baidu. Its core innovation lies in adopting an end-to-end detection paradigm, completely eliminating the NMS (Non-Maximum Suppression) post-processing step relied upon by traditional YOLO series models. The RT-DETR model primarily consists of four core components: the backbone network, hybrid encoder, uncertainty-minimal query selection, and decoder. The backbone network extracts multi-scale feature maps from the input image, generating three feature maps of different sizes (S3, S4, S5), thereby providing a rich foundation of visual information for subsequent encoder processing. The hybrid encoder plays a key role in multi-scale feature fusion and interaction. It effectively integrates features of different scales from the backbone network through an efficient internal scale feature transformation mechanism and utilizes the Transformer encoder to model global contextual information. The uncertainty-minimal query selection optimizes the input to the decoder by filtering out object queries with the lowest initial uncertainty, thereby accelerating model convergence and improving detection accuracy. The decoder receives learnable object queries and the features output by the encoder, gradually refines the query vectors through a cross-attention mechanism, and directly outputs the final detection predictions. RT-DETR offers multiple model versions of varying scales, including RT-DETR-R18, RT-DETR-R34, RT-DETR-R50, and RT-DETR-R101, all based on the ResNet backbone network. These models exhibit a progressive relationship in terms of parameter count, computational complexity, and detection accuracy. Among them, RT-DETR-R18 demonstrates the highest computational efficiency, the fewest parameters, and the fastest training and inference speed.
Although this method achieves excellent performance in general object detection tasks, it still exhibits significant limitations in the context of small object detection in optical remote sensing imagery. Optical remote sensing images typically feature complex backgrounds and contain numerous objects that are extremely small in scale, densely distributed, and highly susceptible to interference. Existing methods often struggle to adequately preserve and effectively exploit the discriminative features of such tiny objects, resulting in insufficient detection sensitivity and suboptimal localization accuracy for small targets. Motivated by these challenges, this paper presents a dedicated study to address them.
3.2. ORSSO-DETR Network Structure
In this paper, RT-DETR-R18 is selected as the baseline model, and improvements are made from three perspectives: feature alignment, sampling fidelity, and nonlinear enhancement. To address the problem of small object location information being lost in deep networks, the C3K2 [
22] is introduced to replace the original Fusion module. Compared with the original Fusion module, C3K2 achieves more lightweight and efficient multi-scale feature extraction. Meanwhile, Feature Complementary Mapping (FCM) [
23] is introduced. FCM actively integrates shallow spatial location information into deep semantics, directly compensating for the location information loss caused by downsampling in the backbone network. Their synergy enhances the alignment between spatial location information and deep semantic features, thereby improving the localization capability for small objects. To address the issues of insufficient detail reconstruction during upsampling and overly aggressive downsampling that causes small object pixels to disappear, the Efficient Upsampling Convolutional Block (EUCB) [
24] is adopted to replace the original upsampling, and a dynamic convolution based downsampling module (DynamicConv DownSample) [
25] is designed. EUCB preserves high-frequency texture details through its “interpolation–feature extraction–channel interaction” pipeline, while dynamic convolution downsampling is chosen because fixed convolution kernels cannot adapt to the drastic scale variations of targets in remote sensing images, whereas dynamic convolution adaptively adjusts the receptive field according to the input content, preventing small objects from vanishing completely in deep feature maps. To address the insufficient nonlinear expressiveness of normalization layers under complex backgrounds and the dilution of weak features, the standard normalization layer in the Attention-based Intra-scale Feature Interaction (AIFI) is replaced with Dynamic Tanh (DyT) [
26]. DyT does not require computing running statistics and achieves nonlinear scaling and extreme value compression through learnable parameters, enhancing the model’s sensitivity to weak features while preserving the functionality of normalization.
In addition, the structure of the hybrid encoder is slightly modified. During the fusion of low-level features into high-level features, the high-level features are first processed by a 1 × 1 convolution, followed by batch normalization and ReLU activation, before being fused. This adjusts the channel dimensions, enhances feature representation capability, and provides optimized feature inputs for subsequent fusion. The network architecture of the proposed ORSSO-DETR (Optical Remote Sensing Small Object Detection Transformer) is illustrated in
Figure 1.
The processing pipeline of ORSSO-DETR consists of four stages. In the first stage, the ResNet-18 backbone extracts three-scale feature maps S3, S4, and S5. In the second stage, the hybrid encoder performs feature fusion and enhancement. C3K2 enables lightweight multi-scale feature extraction, while FCM integrates shallow spatial location information into deep semantics to compensate for localization loss. EUCB upsampling preserves high-frequency texture details, and dynamic convolution downsampling adaptively adjusts the receptive field to prevent small objects from disappearing. Furthermore, during the fusion of low-level features into high-level features, the high-level features are first processed by a 1 × 1 convolution followed by batch normalization and ReLU activation to adjust channel dimensions and enhance representational capability. Meanwhile, in the attention-based intra-scale feature interaction, DyT replaces the standard normalization layer to enhance the model’s nonlinear response to weak features. In the third stage, the uncertainty-minimal query selection strategy filters high-quality object queries. In the fourth stage, the decoder refines the object queries through cross-attention and directly outputs bounding boxes and class predictions. The entire pipeline is executed end-to-end without requiring NMS post-processing.
3.2.1. C3K2 Module
The original model suffers from inefficient gradient information flow and low feature reuse rates during the feature extraction process, resulting in insufficient integration of deep semantic information and shallow detail features. This makes it particularly challenging to effectively retain key detail features for small-sized object detection. To address this, we introduce the C3K2 module for structural optimization. This module employs a dual-path design that establishes a collaborative mechanism combining deep feature extraction in the main branch with original feature preservation in the side branch. The bottleneck structure intelligently optimizes computational load through channel dimension transformation, while the final fusion of branch features creates cross-level feature reuse pathways. This enhances gradient propagation efficiency and improves the representation quality of multi-scale features, thereby boosting the model’s capability to capture details and localization accuracy for small targets, while significantly reducing both parameter count and computational complexity without compromising feature extraction capability.
The module uses a 1 × 1 standard convolution block as the input preprocessing layer and splits the features into dual branches channel-wise through a Split operation. The main branch performs deep feature extraction through two consecutive Bottleneck modules. Each Bottleneck first calculates the channel number of the intermediate hidden layer based on the input and output channel numbers along with an expansion ratio. This approach reduces the channel number in intermediate layers to lower computational costs without compromising model performance. Subsequently, two convolution operations are defined: the first maps the input data from the original channel number to a reduced hidden channel number, while the second restores the data from the hidden channel number to the object output channel number. The side branch preserves the original features as supplementary information, and finally, the features from both branches are fused through a concatenation operation. The structure of C3K2 is shown in
Figure 2.
3.2.2. Feature Complementary Mapping (FCM)
The continuous downsampling operations in the backbone network of the original model led to a significant reduction in spatial resolution, causing the fine-grained positional information of small targets to be severely lost in deeper feature layers. Additionally, when feature pyramids perform fusion, the direct fusion of features at different scales results in coordinating mapping deviations. These factors collectively prevent the precise localization information of small targets from being effectively integrated with deep semantic features, thereby compromising detection accuracy. To address the insufficient alignment between spatial positional information and deep semantic features, this paper introduces the Feature Complementary Mapping (FCM) module. The FCM aims to integrate more spatial positional information into rich semantic information, enhancing the perception capability for small targets in optical remote sensing images. This module propagates shallow spatial positional information to deeper layers of the network, mitigating the loss of spatial information during downsampling in the backbone network and achieving better alignment with deep semantic information, thereby improving the localization capability for small targets. The structure of FCM is shown in
Figure 3.
First, the input feature map is split into two parts according to a certain ratio α: one part is used to preserve low-level spatial information, and the other part is used to extract rich semantic information. The calculation formula is as follows:
where
,
.
Next, 3 × 3 convolution and 1 × 1 convolution are applied to these two parts, respectively, to obtain richer feature information
and relatively weaker information
, while
retains substantial shallow-level information. The calculation formulas are as follows:
where
.
contains rich channel information and
contains original spatial position information, where
represents the mapping for learning semantic information, and
represents the mapping for learning spatial position information.
Subsequently, the feature information from the two branches undergoes channel interaction and spatial interaction, respectively, assigning unique weights to the information in each channel or spatial location. The channel interaction consists of depthwise convolution, adaptive average pooling, and sigmoid. The spatial interaction consists of 1 × 1 spatial convolution, batchnorm, and sigmoid. The calculation formulas are as follows:
where
represents the channel information weight, and
represents the spatial position information weight.
Finally, the obtained weights are projected into the feature information of the other branch to achieve complementary fusion, compensating for the missing feature information in each branch. The processed features are then aggregated to obtain the feature
, which incorporate dual mappings of spatial and semantic relationships. The calculation formula is as follows:
where
denotes element-wise multiplication and
denotes addition.
3.2.3. Efficient Upsampling Convolutional Block (EUCB)
The RT-DETR [
9] suffers from feature information loss and insufficient cross-channel interaction during the upsampling process, resulting in weaker detail preservation capability when recovering feature map resolution, which affects the effectiveness of multi-scale feature fusion. To address this, we introduce an Efficient Upsampling Convolutional Block (EUCB) for optimization. This module establishes a progressive processing pipeline of “interpolation–feature extraction–channel interaction,” maintaining feature integrity while enhancing resolution, effectively preserving local feature details during spatial resolution improvement.
Specifically, bilinear interpolation is used to achieve 2 × Upsampling(x), followed by a depthwise separable convolution to extract local features while maintaining low parameter counts and computational complexity. Then, channel shuffle is applied to break the isolation between channel groups and enhance cross-group information interaction, thereby improving the overall expressive power of the model. Finally, the shuffled features are fed into a pointwise convolution for deep fusion across channels, ultimately outputting the enhanced feature map. The formula for EUCB is shown in Equation (7). The structure of EUCB is illustrated in
Figure 4.
3.2.4. DynamicConv DownSample Module
RT-DETR employs fixed 3 × 3 convolution kernels for downsampling operations. This design lacks adaptability to targets of varying scales during feature extraction. Especially when dealing with optical remote sensing images characterized by complex multi-scale object distributions and significant scale variations, fixed convolution kernels struggle to effectively extract and retain critical features. To address this issue, this paper introduces a dynamic convolution mechanism to replace the convolution in the original downsampling, thereby redesigning the downsampling module. This module can dynamically adjust convolution kernel parameters, enabling the network to adaptively modify its feature extraction strategy based on input features, thus compensating for the shortcomings of traditional downsampling when handling multi-scale optical remote sensing object.
The core idea of dynamic convolution is to generate convolution kernel parameters dynamically based on input features. Specifically, a global descriptor vector of the feature map is extracted through global average pooling, followed by a fully connected layer to generate fusion weights for each expert convolution kernel. Multiple expert convolution kernels are then weighted and fused into a dynamic convolution kernel. This process endows the convolution operation with adaptability to different semantic contents, as mathematically expressed in Equation (8). The structure of DynamicConv DownSample Module is illustrated in
Figure 5.
3.2.5. Dynamic Tanh
Traditional normalization layers exhibit insufficient nonlinear expressiveness when processing multi-scale features in optical remote sensing images, making it difficult to adapt to the distribution differences across various hierarchical features. Specifically, LayerNorm relies on per-sample statistics, remaining stable during training and inference, but its linear transformation struggles to capture complex nonlinear patterns of features. BatchNorm utilizes batch statistics but is sensitive to small batch sizes and unfriendly to varying sequence lengths. Other contemporary normalization methods, while having their own advantages, are still limited by fixed linear transformation forms when handling multi-scale features. To address this issue, this paper introduces the Dynamic Tanh (DyT) activation function to replace the original normalization layers in the Transformer encoder. Unlike the above normalization techniques, DyT is an adaptive activation function designed to enhance the performance of neural networks in complex data processing. By incorporating a learnable scaling parameter
and the hyperbolic tangent function
, it achieves scaling of input activations and nonlinear compression of extreme values without the need to compute activation statistics. DyT adaptively adjusts the traditional Tanh function using generated parameters, which modify scaling factors, offsets, and other properties of Tanh, enabling the function to provide suitable nonlinear transformations for different input regions based on data distribution and model requirements. The adjusted DyT achieves more effective feature extraction and representation across data of varying scales and characteristics compared to traditional normalization layers in Transformers, without requiring hyperparameter tuning. The specific formulation is shown in Equation (9).
where α is a learnable scalar parameter that allows the input to be scaled differently based on its range while accounting for variations in the scale of x. γ and β are learnable per-channel vector parameters, serving the same purpose as those in traditional normalization layers, enabling the output to be rescaled to any desired range. Compared with LayerNorm and BatchNorm, DyT offers several notable advantages. It does not need to maintain running statistics, making computation more efficient. It is insensitive to batch size and accommodates small-batch training well. Moreover, its nonlinear transformation (tanh) provides richer feature representation capabilities. The structure of DyT in ORSSO-DETR is illustrated in
Figure 6.
4. Results
4.1. Experimental Environment
To validate the effectiveness of the proposed improved model, this study established a unified experimental platform based on the Windows 11 operating system and the PyTorch 2.0.1 deep learning framework. The specific configurations are as follows: the hardware system utilizes an NVIDIA GeForce RTX 4060 graphics card (8GB GDDR6), while the software environment is configured with Python 3.11 interpreter, using RT-DETR-R18 as the baseline. The detailed experimental environment is shown in
Table 1.
During the model training process, uniform hyperparameter settings were adopted. Key training parameters, including input image size, number of workers, and learning rate, were kept consistent across all experiments. Throughout the training phase, the experimental parameters remained unchanged, balancing training efficiency and model performance while ensuring experimental reproducibility. The specific parameter configurations are detailed in
Table 2.
In the comparative experiments, all compared models were trained under the same hardware environment and software framework. To ensure fairness as much as possible, we unified the following settings: input image size of 640 × 640 pixels, AdamW optimizer, and batch size of 4. The learning rate and number of training epochs were set according to the official recommendations of each model. All models were initialized with their official pre-trained weights. We acknowledge that the number of training epochs varies across different models, which is due to their different convergence characteristics, and the official recommended configurations already reflect the optimal performance of each model to the greatest extent.
4.2. Datasets
The NWPU VHR-10 dataset [
27] used in this study is publicly available and was released by Northwestern Polytechnical University (NWPU). It is an ultra-high-resolution optical remote sensing dataset specifically designed for optical remote sensing object detection. The NWPU VHR-10 dataset comprises 650 pan-sharpened color images with spatial resolutions of 0.5–2 m/pixel ensuring fine-grained detail capture. Its detailed annotations and diverse scenes make it one of the most important benchmark datasets in the field of remote sensing image analysis. The images are sourced from Google Earth and other commercial satellites, with an RGB three-channel image format. The dataset includes ten categories: airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle. Each object is annotated with a bounding box, and each image is also assigned a category label indicating the primary land cover type depicted. The original annotation files were stored in TXT format. To meet the MS COCO format requirements of RT-DETR, these files were uniformly converted into JSON format. The dataset was split into training and test sets at a ratio of 8:2, resulting in 520 images for training and 130 images for testing.
To ensure the generalizability of the proposed method and provide a more comprehensive evaluation of its performance, experiments were also conducted on another optical remote sensing dataset, RSOD [
28]. RSOD is a publicly available dataset released by Wuhan University, specifically designed for remote sensing image object detection tasks. The dataset comprises 936 aerial images sourced from Google Maps, with a spatial resolution ranging from 0.5 to 3 m/pixel. It annotates four typical land-cover objects in remote sensing applications: aircraft, oil tank, overpass and playground. The image sizes in the dataset vary from 512 × 512 pixels to 1961 × 1193 pixels, exhibiting significant inter-class scale differences that span a wide range, from small aircraft to large playgrounds.
Figure 7 visualizes sample images from the NWPU VHR-10 and RSOD datasets. The characteristics of these two datasets are summarized in
Table 3.
4.3. Assessment Indicators
To objectively evaluate model performance, this study adopts mAP (mean Average Precision) as the core metric for detection accuracy. mAP comprehensively reflects the model’s detection capability by calculating the average precision under different IoU thresholds. IoU (Intersection over Union) is a metric that measures the degree of overlap between the predicted box and the ground truth box, where a higher value indicates more accurate localization.
This study adopts three IoU thresholds for evaluation: mAP@0.50 represents the average precision at an IoU threshold of 0.50, which evaluates the model’s basic detection performance under lenient localization requirements; mAP@0.50:0.95 represents the average precision across ten IoU thresholds ranging from 0.50 to 0.95 (with a step size of 0.05), comprehensively reflecting the model’s overall detection and localization performance under varying precision requirements; and mAP@0.75 represents the average precision at an IoU threshold of 0.75, which evaluates the model’s high-precision localization capability under stringent localization requirements. The calculation formulas for mAP and IoU are shown in Equations (10) and (11), respectively.
On this basis, to specifically evaluate the model’s detection capability for small objects, this study further adopts the scale definition of the MS COCO dataset and uses mAP
s, mAP
m, and mAP
l to evaluate the model’s detection accuracy for small, medium, and large targets, respectively. Here, mAP
s, mAP
m, and mAP
l are all computed under the mAP@0.5:0.95 metric, i.e., averaged over IoU thresholds from 0.5 to 0.95. Specifically, mAP
s represent the average precision for small targets (area < 32 × 32 pixels), mAP
m represent the average precision for medium targets (32 × 32 ≤ area < 96 × 96 pixels), and mAP
l represent the average precision for large targets (area ≥ 96 × 96 pixels). These three metrics enable a more detailed evaluation of the model’s detection performance across different object scales and are particularly crucial for validating the effectiveness of improvements made for small object detection.
where N is the number of categories,
represents the average precision of the
-th class.
where A is the area of the prediction box, B is the area of the ground truth box,
is the intersection of the two boxes (overlapping area), and
is the union of the two boxes (total area).
In terms of model efficiency, GFLOPs is adopted to measure the model’s computational complexity, reflecting the computational resource consumption during inference; Parameters is adopted to assess the model’s scale and memory footprint to determine the degree of lightweight for its storage and deployment.
4.4. Model Performance Experiments
Figure 8 displays the comparative curves of the original model and the improved model proposed in this paper on two metrics: mAP@0.5 and mAP@0.5:0.95. In the first comparison of mAP@0.5, the method presented in this paper maintains a stable advantage throughout the entire training cycle, with accuracy significantly higher than that of the RT-DETR baseline model. The second comparison of mAP@0.5:0.95 further validates the effectiveness of our method, as it achieves better performance on both key evaluation metrics. This indicates that the improvements proposed in this paper are not only effective under a single IoU threshold but also demonstrate superiority under the more rigorous multi-threshold comprehensive evaluation.
The specific experimental results of ORSSO-DETR on the NWPU VHR-10 dataset are presented in
Table 4. Compared with the baseline model RT-DETR-R18, the proposed method achieves significant improvements in optical remote sensing image detection tasks. In terms of mean average precision, the proposed method achieves 96.2% mAP@0.50 (a 3.9% improvement), 74.1% mAP@0.75 (a 4.3% improvement), and 64.4% mAP@0.50:0.95 (a 2.9% improvement). Furthermore, the detection capability for targets at different scales is significantly enhanced: the detection accuracy for small targets (mAP
s) reaches 51.5% (a 4.7% improvement); for medium targets (mAP
m), it reaches 65.1% (a 2.0% improvement); and for large targets (mAP
l), it reaches 65.0% (a 5.5% improvement). In terms of model efficiency, the proposed method has 20.0 M parameters and 61.0 GFLOPs, achieving higher detection accuracy while controlling computational overhead compared to the baseline model.
The visualization comparison of results before and after the improvement is shown in
Figure 9. The ORSSO-DETR model demonstrates excellent performance in optical remote sensing image object detection tasks. From the visualization results of multiple representative scenarios, it can be observed that the model accurately identifies various object types, including airplane, sports fields, vehicles, and ships. Particularly under complex background interference, it maintains strong localization and classification capabilities, while also exhibiting robust performance for small-scale targets and densely distributed scenes. For example, in scenarios such as urban streets, sea surface ships, and airport runways, the model correctly outlines objects with high confidence, clear boundaries, and low false detection rates. Compared to the original model, the improved version shows reduced instances of missed and false detections.
Figure 10 presents the comparative curves of the original model and the proposed improved model on the RSOD dataset in terms of mAP@0.5 and mAP@0.5:0.95. Although the improvement on this dataset is not as significant as that on the NWPU VHR-10 dataset, the improved model consistently outperforms the original model on both metrics, demonstrating stable performance advantages. These results further indicate that the proposed method is not only effective on the NWPU VHR-10 dataset but also achieves consistent and stable improvements over the original model on the RSOD dataset.
The experimental results of OURS on the RSOD dataset are presented in
Table 5. Compared with the baseline model RT-DETR-R18, the proposed method achieves consistent improvements across all detection metrics. Specifically, the proposed method achieves 97.6% mAP@0.50, 83.8% mAP@0.75, and 71.2% mAP@0.50:0.95, outperforming the baseline by 2.5%, 7.5%, and 2.8%, respectively. In terms of scale-wise detection performance, the proposed method demonstrates stable advantages across small (mAP
s: 37.8%, +6.6%), medium (mAP
m: 71.9%, +0.1%), and large (mAP
l: 76.2%, +3.1%) targets. Regarding model efficiency, the proposed method has 61.0 M parameters and a computational cost of 20.05 GFLOPs, which are comparable to those of the baseline model. These results fully demonstrate that the proposed method effectively enhances detection accuracy for optical remote sensing images while maintaining high computational efficiency.
Figure 11 presents the visual comparison results of ORSSO-DETR and RT-DETR on the RSOD dataset. Four categories are selected: aircraft, overpass, oil tank, and playground, with one group for each category, resulting in a total of four comparative image pairs. From the results, it can be observed that the detection confidence of ORSSO-DETR is mostly higher than that of the original RT-DETR model. Moreover, under complex background interference, ORSSO-DETR effectively mitigates the missed detection issues present in the original model. For instance, in the first image, a relatively small aircraft that is missed by the original model is accurately detected by the proposed model. These results further validate the reliability and accuracy of the proposed model in optical remote sensing image object detection tasks.
4.5. Comparative Experimental Studies
In this study, the proposed model was compared with the original RT-DETR-R18, YOLO series models (e.g., YOLOv8-n, YOLOv9-m, YOLOv10-m, YOLOv11-n, YOLOv12-m), as well as several recent remote sensing models including AL-YOLOv8 and FEMT-YOLO, and the latest DETR series model DEIM-R18. The performance of all models was evaluated based on mAP@50 (%), mAP@50:95 (%), GFLOPs, and Parameters (M), as shown in
Table 6. The proposed method achieved 96.2% mAP@50 and 64.4% mAP@50:95 accuracy after 100 training epochs with an input size of 640 × 640, surpassing all compared models. Specifically, it outperformed YOLOv8-n (93.7%, 300 epochs), YOLOv9-m (89.8%, 300 epochs), YOLOv10-m (81.5%, 300 epochs), YOLOv11-n (92.45%, 100 epochs), YOLOv12-m (83.5%, 300 epochs), AL-YOLOv8 (92.2%, 200 epochs), FEMT-YOLO (89.7%, 300 epochs), and DEIM-R18 (86.43%, 120 epochs). Notably, compared with the baseline RT-DETR-R18, which achieved 92.3% mAP@50 and 61.5% mAP@50:95 with 59.9 GFLOPs and 20.1 M parameters, our method improved mAP@50 by 3.9 percentage points (from 92.3% to 96.2%) and mAP@50:95 by 2.9 percentage points (from 61.5% to 64.4%) while maintaining comparable model complexity with 20.0 M parameters and 61.0 GFLOPs. These results fully demonstrate the effectiveness and superiority of our proposed approach. The visualization of comparative experiments is shown in
Figure 12.
Table 7 presents the performance comparison of different methods on the RSOD dataset. In terms of detection accuracy, the proposed method achieves 97.6% mAP@50 and 71.2% mAP@50:95, outperforming all compared models. Compared with the baseline model RT-DETR-R18 (95.1% mAP@50, 68.4% mAP@50:95), the proposed method achieves improvements of 2.5% and 2.8%, respectively, validating the effectiveness of the improvements. Among the YOLO series models, YOLOv11-n achieves the best performance with 94.6% mAP@50 and 69.7% mAP@50:95, while having only 2.59 M parameters and 6.4 GFLOPs, demonstrating significant advantages in lightweight design. However, the proposed method still leads by 2.8 percentage points in accuracy. YOLOv9-m and YOLOv12-m achieve 92.2% and 90.1% mAP@50, respectively, but with relatively high computational costs and GFLOPs reaching 132.4 and 60.1, respectively. Among other remote sensing models, AL-YOLOv8 achieves 96.9% mAP@50, which is closest to our method, but its mAP@50:95 is only 66.1%, trailing our method by 4.4 percentage points, while its parameter count and GFLOPs are relatively low. FEMT-YOLO achieves 94.8% mAP@50 and 63.0% mAP@50:95 on the RSOD dataset, with only 3.23 M parameters and 7.8 GFLOPs. Although it demonstrates outstanding lightweight performance, there is a certain gap in detection accuracy compared to our method. DEIM-R18, as a representative model of the DETR series, achieves 94.2% mAP@50 and 67.5% mAP@50:95 with 19.9 M parameters and 59.9 GFLOPs. Compared with our method, its mAP@50 and mAP@50:95 are 3.2 and 3.0 percentage points lower, respectively, indicating room for improvement in detection accuracy. In terms of model efficiency, the proposed method has 20.0 M parameters and 61.0 GFLOPs, which are comparable to the baseline model, achieving the highest detection accuracy while maintaining efficient computation. In summary, the proposed method achieves a favorable balance between detection accuracy and computational efficiency on the RSOD dataset, validating its superiority in optical remote sensing object detection tasks. The visualization of comparative experiments is shown in
Figure 12.
4.6. Ablation Experiment
To validate the impact of each proposed module on model performance, we designed systematic ablation experiments, with the results shown in
Table 8. All experiments start from the same baseline (RT-DETR-R18). The experiments including five key modules: EUCB, DynamicConv DownSample (DyConv), C3K2, FCM, and DyT. The evaluation metrics cover four aspects: mAP@50, mAP@50:95, GFLOPs, and Parameters.
Individual module contributions. Starting from the baseline of 92.3% mAP@50 and 61.5% mAP@50:95, each module was added individually. Adding EUCB alone improved mAP@50 by +1.6% (to 93.9%) and mAP@50:95 by +1.5% (to 63.0%), with a moderate computational cost (61.01 GFLOPs). Adding DyConv alone achieved the most significant overall improvement, with gains of +2.6% in mAP@50 (to 94.9%) and +2.8% in mAP@50:95 (to 64.3%), while maintaining a low computational cost (59.94 GFLOPs). Adding C3K2 alone increased mAP@50 by +2.6% (to 94.9%) and demonstrated a clear lightweight advantage, reducing GFLOPs from 59.90 to 55.42 and parameters from 20.0 M to 18.9 M. Adding FCM alone achieved the largest gain in mAP@50:95, with an improvement of +1.7% (to 63.2%), highlighting its advantage in enhancing localization accuracy. In contrast, DyT alone showed limited improvement (+1.2% in mAP@50) and even a slight decrease (−1.1%) in mAP@50:95, suggesting that DyT is most effective when combined with other modules.
Module combination analysis. The combination of EUCB and DyConv achieved 95.2% mAP@50 and 64.1% mAP@50:95, confirming the effectiveness of the sampling optimization strategy. Notably, the combination of EUCB and FCM achieved the best performance among two-module groups, with mAP@50 reaching 95.5% (+3.2%) and mAP@50:95 reaching 63.9% (+2.4%), demonstrating strong complementarity between texture detail preservation (EUCB) and spatial-semantic alignment (FCM). The combination of C3K2 and FCM achieved 63.3% mAP@50:95 (+1.8%) while maintaining a relatively low computational cost (58.83 GFLOPs).
Finally, when all modules were combined, the model achieved optimal performance: mAP@50 increased to 96.2%, mAP@50:95 stabilized at 64.4%, computational load was controlled at 61.03 GFLOPs, and the parameter count remained at 20.0 M. These results validate the effectiveness of each module and the advantages of their synergistic interaction, demonstrating improved detection accuracy while maintaining computational efficiency. The experimental results indicate that the proposed modules improve model performance from different perspectives, and their combined use yields the best overall results. Experimental data are shown in
Table 8 and visualized in
Figure 13, where the blue dashed line represents the mAP@50 of the proposed method, and the red dashed line represents the mAP@50:95 of the proposed method.
4.7. Statistical Significance Analysis
To verify the robustness and statistical significance of the proposed ORSSO-DETR model against the baseline RT-DETR-R18, we conducted five independent training runs for each model under identical settings using different random seeds and reported the mean ± standard deviation of three key metrics (mAP50, mAP50:95, mAP
s) over the five runs. The results show that ORSSO-DETR outperforms the baseline across all metrics: mAP50 increased from 92.3% ± 0.4% to 96.0% ± 0.2% (+3.7%), mAP50:95 from 61.5% ± 0.3% to 63.9% ± 0.4% (+2.4%), and for small objects (mAP
s), the improvement is +5.0% (from 46.8% ± 0.4% to 51.8% ± 0.3%). These significant improvements further confirm the effectiveness of our method. The
p-values obtained from the independent
t-test are 0.0025 for mAP50, 0.0018 for mAP50:95, and 0.0013 for mAP
s, all below 0.01. The experimental results demonstrate that the improvements brought by ORSSO-DETR are highly statistically significant, further validating the reliability and reproducibility of the proposed method. Experimental data are shown in
Table 9.
5. Discussion
Experimental results demonstrate that the proposed ORSSO-DETR model achieves consistent and statistically significant performance improvements over the baseline RT-DETR-R18 on both the NWPU VHR-10 and RSOD datasets, with mAP@50 increased by 3.9% and 2.5%, and mAP@50:95 increased by 2.9% and 2.8%, respectively. These results validate the effectiveness of the modifications. Compared with other mainstream object detectors, ORSSO-DETR exhibits superior performance in handling small and densely distributed objects under complex background interference, and the low standard deviations across five independent runs demonstrate the model’s robustness to random initialization and training variations. Nevertheless, several limitations should be noted: although the parameter count and computational cost of ORSSO-DETR are comparable to those of the baseline, the computational overhead is still non-negligible for resource-constrained edge devices; the current evaluation is limited to two public datasets, and the generalization ability to other optical remote sensing scenarios (e.g., different resolutions, sensor types, or weather conditions) requires further validation.
On the fairness and limitations of comparative experiments. Several limitations need to be pointed out. First, the number of training epochs differs across models, which is due to differences in convergence speeds; we followed official recommendations to reflect their optimal performance. Second, the complexity of some compared models varies, but this study focused on detection accuracy. Third, all models used official default hyperparameters without tuning on this dataset, which may affect the upper bound of performance for some models. Nevertheless, we unified the hardware, input size, optimizer, and batch size, and reported the standard deviations and statistical test results from multiple runs. These measures ensure the fairness of the comparison and the reliability of the conclusions. Future work may adopt unified training epochs and individual hyperparameter tuning for further validation.
6. Conclusions
This paper proposes an ORSSO-DETR model, which improves the hybrid encoder to adapt to the characteristics of optical remote sensing images, aiming to address key challenges in optical remote sensing image object detection, including complex background interference, difficulties in multi-scale object detection, and the susceptibility of small object features to being lost. Specifically, to enhance the detection performance for densely distributed small targets, this study integrates the C3K2 multi-scale module with the FCM feature complementary mapping module to improve the integration of high-level features into low-level features. To effectively mitigate feature loss for multi-scale targets while preserving high-frequency texture details and maintaining computational efficiency, a dynamic convolution-based downsampling module is designed and the EUCB upsampling technique is adopted. To improve feature expression capability and semantic alignment, the encoder structure is optimized by introducing a 1 × 1 convolution to enhance high-level features before cross-scale feature fusion. Furthermore, the Dynamic Tanh module is adopted to replace the original normalization layer in the Transformer, enhancing the model’s nonlinear expressiveness and its sensitivity to weak features under complex backgrounds. Experimental results show that on the NWPU VHR-10 dataset, compared with the original RT-DETR, mAP@50 and mAP@50:95 improved by 3.9% and 2.9%, respectively. Furthermore, on the RSOD dataset, mAP@50 and mAP@50:95 improved by 2.5% and 2.8%, respectively. The results highlight the superiority of the proposed method over other mainstream models, underscoring the advanced effectiveness of the improved algorithm.
Author Contributions
Conceptualization, Y.C. and J.L.; methodology, J.L. and Y.C.; software, Y.Y.; validation, K.W., Y.J. and R.G.; formal analysis, Y.J. and Y.Z.; investigation, J.L. and Y.Y.; resources, R.G.; data curation, Y.Z. and J.L.; writing—original draft preparation, J.L. and Y.J.; writing—review and editing, Y.C., K.W. and R.G.; visualization, Y.Y.; supervision, Y.C.; project administration, Y.C. and R.G.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.
Funding
This study was funded by the Science and Technology Planning Projects of the Xinjiang Production and Construction Corps (Grant No. 2023AB020). The work was partially supported by XPCC Projects (Grant No. BTBKXM-2025-Y33), Shihezi University Projects (Grant No. RCZK2018C09), the Ideological and Political Special Projects of Shihezi University (Grant No. SZZX201906), the Collaborative Education Projects of the Ministry of Education of China (Grant No. 221001141130337) and the Supply-Demand Docking Employment and Education Projects of the Ministry of Education of China (Grant Nos. 2023122950976, 2023122958628).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Acknowledgments
The authors would like to thank the anonymous reviewers for their valuable comments and constructive suggestions, which helped in improving the quality of the paper. Special thanks go to Dongrui Wang for her contributions to resources, formal analysis, and data curation.
Conflicts of Interest
Authors Yaohui Chang, Jin Li and Runhua Geng were employed by the company Xinjiang Tianye (Group) Co., Ltd. The remaining authors declare that they have no commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| DETR | Detection Transformer |
| RT-DETR | Real-Time Detection Transformer |
| ORSSO-DETR | Optical Remote Sensing Small Object Detection Transformer |
| FCM | Feature Complementary Mapping |
| EUCB | Efficient Upsampling Convolutional Block |
| DyConv | Dynamic Convolution |
| DyT | Dynamic Tanh |
| AIFI | Attention-based Intra-scale Feature Interaction |
| FPN | Feature Pyramid Network |
| CNN | Convolutional Neural Network |
| NMS | Non-Maximum Suppression |
| RPN | Region Proposal Network |
| HOG | Histogram of Oriented Gradients |
| SIFT | Scale-Invariant Feature Transform |
| mAP | mean Average Precision |
| IoU | Intersection over Union |
| AMFFM | Adaptive Multi-Level Feature Fusion Module |
| AAHRH | Attention-Augmented High-Resolution Head |
References
- Bagwari, N.; Kumar, S.; Verma, V.S. A comprehensive review on segmentation techniques for satellite images. Arch. Comput. Methods Eng. 2023, 30, 4325–4358. [Google Scholar] [CrossRef]
- Ahmad, T.; Morel, A.; Cheng, N.; Palaniappan, K.; Calyam, P.; Sun, K.; Pan, J. Future UAV/Drone Systems for Intelligent Active Surveillance and Monitoring. ACM Comput. Surv. 2025, 58, 35. [Google Scholar] [CrossRef]
- Wang, K.; Wang, Z.; Li, Z.; Su, A.; Teng, X.; Pan, E.; Liu, M.; Yu, Q. Oriented Object Detection in Optical Remote Sensing Images Using Deep Learning: A Survey. Artif. Intell. Rev. 2025, 58, 350. [Google Scholar] [CrossRef]
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]
- Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhu, G.; Yuan, B. Adaptive feature fusion with attention-guided small target detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5623116. [Google Scholar] [CrossRef]
- Li, H.; Zhang, T.; Jiang, S.; Zhi, X.; Wang, D.; Gao, L.; Yang, D.; Li, J. DASFA-Net for accurate ship detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2026, 64, 5622114. [Google Scholar] [CrossRef]
- Ji, J.; Zhao, Y.; Li, A.; Ma, X.; Wang, C.; Lin, Z. EFR-ACENet: Small object detection for remote sensing images based on explicit feature reconstruction and adaptive context enhancement. Eng. Appl. Artif. Intell. 2025, 151, 110722. [Google Scholar] [CrossRef]
- Xu, Y.; He, W.; Zhang, G.; Qi, Q.; Chen, S.; Wu, J. From weak textures to dense arrangements: Leveraging prior knowledge for small-object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 3000915. [Google Scholar] [CrossRef]
- Wang, L.; Li, J.; Zhang, J.; Zhuo, L.; Tian, Q. Position guided dynamic receptive field network: A small object detection friendly to optical and SAR images. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 8265–8282. [Google Scholar] [CrossRef]
- Yan, R.; Yan, L.; Geng, G.; Cao, Y.; Zhou, P.; Meng, Y. ASNet: Adaptive semantic network based on transformer–CNN for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608716. [Google Scholar] [CrossRef]
- Lee, S.; Cho, S.; Park, C.; Park, S.; Kim, J.; Lee, S. LSHNet: Leveraging structure-prior with hierarchical features updates for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5642516. [Google Scholar] [CrossRef]
- Sun, L.; Wang, Q.; Chen, Y.; Zheng, Y.; Wu, Z.; Fu, L. CRNet: Channel-enhanced remodeling-based network for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5618314. [Google Scholar] [CrossRef]
- Zeng, C.; Zhang, J.; Kwong, S. Learning conditional diffusion transformer for salient object detection in optical remote sensing images. IEEE Trans. Cybern. 2026, 1–14. [Google Scholar] [CrossRef] [PubMed]
- Fang, X.; Wang, Q.; Chen, Q.; Li, G. Interactive edge awareness network for salient object detection in optical remote sensing images. Expert Syst. Appl. 2026, 316, 131879. [Google Scholar] [CrossRef]
- Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 26 October 2024).
- Xiao, Y.; Xu, T.; Xin, Y.; Zhang, L.; Wang, C. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 25–27 February 2025; Volume 39, pp. 8673–8681. [Google Scholar] [CrossRef]
- Rahman, M.M.; Munir, M.; Marculescu, R. EMCAD: Efficient Multi-Scale Convolutional Attention Decoding for Medical Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 11769–11779. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. ParameterNet: Parameters Are All You Need. arXiv 2023, arXiv:2306.14525. [Google Scholar] [CrossRef]
- Zhu, J.; Chen, X.; He, K.; LeCun, Y.; Liu, Z. Transformers without Normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 14901–14911. [Google Scholar] [CrossRef]
- Cheng, G.; Zhou, P.; Han, J. Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
- Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 26 October 2024).
- Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar] [CrossRef]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar] [CrossRef]
- Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
- Zhang, F.; Tian, C.; Li, X.; Yang, N.; Zhang, Y. AL-YOLOv8: A Small Object Detection Algorithm for Remote Sensing Images Based on an Improved YOLOv8s. Sensors 2026, 26, 2016. [Google Scholar] [CrossRef] [PubMed]
- Cao, B.; Yang, Z.; Zhou, P.; Chen, H. FEMT-YOLO: Frequency-Enhanced Multi-Scale Network for Small Object Detection in Aerial Images. Results Eng. 2026, 29, 109726. [Google Scholar] [CrossRef]
- Huang, S.; Lu, Z.; Cun, X.; Yu, Y.; Zhou, X.; Shen, X. DEIM: DETR with Improved Matching for Fast Convergence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 15162–15171. [Google Scholar] [CrossRef]
Figure 1.
Overall Architecture of ORSSO-DETR.
Figure 1.
Overall Architecture of ORSSO-DETR.
Figure 2.
Structure of the C3K2 Module.
Figure 2.
Structure of the C3K2 Module.
Figure 3.
Structure of the FCM Module.
Figure 3.
Structure of the FCM Module.
Figure 4.
Structure of the EUCB Module.
Figure 4.
Structure of the EUCB Module.
Figure 5.
Structure of the DynamicConv DownSample Module.
Figure 5.
Structure of the DynamicConv DownSample Module.
Figure 6.
Transformer with DyT.
Figure 6.
Transformer with DyT.
Figure 7.
Sample images from two baseline datasets.
Figure 7.
Sample images from two baseline datasets.
Figure 8.
mAP Curves on NWPU VHR-10.
Figure 8.
mAP Curves on NWPU VHR-10.
Figure 9.
Comparison of Test Results of RT-DETR and ORSSO-DETR on NWPU VHR-10.
Figure 9.
Comparison of Test Results of RT-DETR and ORSSO-DETR on NWPU VHR-10.
Figure 10.
mAP Curves on RSOD dataset.
Figure 10.
mAP Curves on RSOD dataset.
Figure 11.
Comparison of Test Results of RT-DETR and ORSSO-DETR on RSOD.
Figure 11.
Comparison of Test Results of RT-DETR and ORSSO-DETR on RSOD.
Figure 12.
Visualization of comparative experiments. Figure (a) shows the visualization of comparative experiments on the NWPU VHR-10 dataset, and Figure (b) shows the visualization of comparative experiments on the RSOD dataset.
Figure 12.
Visualization of comparative experiments. Figure (a) shows the visualization of comparative experiments on the NWPU VHR-10 dataset, and Figure (b) shows the visualization of comparative experiments on the RSOD dataset.
Figure 13.
Ablation experiment performance visualization.
Figure 13.
Ablation experiment performance visualization.
Table 1.
Experimental Environment Information.
Table 1.
Experimental Environment Information.
| Classification | Configuration |
|---|
| System Environment | Windows 11 |
| GPU | GeForce RTX 4060 |
| CPU | Intel(R) Core(TM) i7-13650HX |
| Framework | Pytorch 2.0.1 |
| Programming Language | Python 3.11 |
Table 2.
Experimental Parameter Information.
Table 2.
Experimental Parameter Information.
| Parameter | Value |
|---|
| Learning Rate | 0.0001 |
| Image Size | 640 × 640 |
| Epoch | 100 |
| Batch Size | 4 |
| Weight Decay | 0.0001 |
| Optimizer | AdamW |
| Number Works | 4 |
Table 3.
Dataset characteristics and statistics.
Table 3.
Dataset characteristics and statistics.
| Dataset | Images | Categories | Total Instances | Avg. Instances/ Image | Resolution Range | Spatial Resolution | Image Sources |
|---|
| NWPU VHR-10 | 650 | 10 | 1705 | 4.6 | 533–1728 × 597–1028 | 0.5–2 m/pixel | Google Earth/Vaihingen/ Potsdam |
| RSOD | 936 | 4 | 6950 | 7.1 | 800–1500 × 800–1500 | 0.5–3 m/pixel | Google Maps/Google Earth |
Table 4.
Comparison of mAP between RT-DETR-R18 and ORSSO-DETR on NWPU VHR-10.
Table 4.
Comparison of mAP between RT-DETR-R18 and ORSSO-DETR on NWPU VHR-10.
| Method | mAP50 | mAP75 | mAP50:95 | mAPs | mAPm | mAPl | Params (M) | Flops (G) |
|---|
| RT-DETR-R18 | 92.3 | 69.8 | 61.5 | 46.8 | 63.1 | 59.5 | 59.9 | 20.1 |
| OURS | 96.2 | 74.1 | 64.4 | 51.5 | 65.1 | 65.0 | 61.0 | 20.0 |
Table 5.
Comparison of mAP of RT-DETR-R18 and ORSSO-DETR on RSOD.
Table 5.
Comparison of mAP of RT-DETR-R18 and ORSSO-DETR on RSOD.
| Method | mAP50 | mAP75 | mAP50:95 | mAPs | mAPm | mAPl | Params (M) | Flops (G) |
|---|
| RT-DETR-R18 | 95.1 | 76.3 | 68.4 | 31.2 | 71.8 | 73.1 | 59.9 | 20.08 |
| OURS | 97.6 | 83.8 | 71.2 | 37.8 | 71.9 | 76.2 | 61.0 | 20.05 |
Table 6.
Performance comparison with different methods on NWPU VHR-10.
Table 6.
Performance comparison with different methods on NWPU VHR-10.
| Model Name | Epoch | Input Shape | mAP@50 (%) | mAP@50:95(%) | GFLOPs | Params (M) |
|---|
| YOLOv8-n [29] | 300 | 640 × 640 | 93.7 | 59.8 | 28.8 | 11.14 |
| YOLOv9-m [30] | 300 | 640 × 640 | 89.8 | 54.9 | 132.5 | 32.7 |
| YOLOv10-m [31] | 300 | 640 × 640 | 81.5 | 47.2 | 64.0 | 16.5 |
| YOLOv11-n [32] | 100 | 640 × 640 | 92.45 | 60.6 | 6.5 | 2.6 |
| YOLOv12-m [22] | 300 | 640 × 640 | 83.5 | 57.5 | 59.5 | 19.6 |
| AL-YOLOv8 [33] | 200 | 640 × 640 | 92.2 | 62.8 | 18.6 | 16.8 |
| FEMT-YOLO [34] | 300 | 640 × 640 | 89.7 | 55.3 | 7.8 | 3.23 |
| DEIM-R18 [35] | 120 | 640 × 640 | 86.43 | 54.7 | 59.9 | 19.9 |
| RT-DETR-R18 [9] | 100 | 640 × 640 | 92.3 | 61.5 | 59.9 | 20.1 |
| Ours | 100 | 640 × 640 | 96.2 | 64.4 | 61.0 | 20.0 |
Table 7.
Performance comparison with different methods on RSOD.
Table 7.
Performance comparison with different methods on RSOD.
| Model Name | Epoch | Input Shape | mAP@50 (%) | mAP@50:95(%) | GFLOPs | Params (M) |
|---|
| YOLOv8-n [29] | 300 | 640 × 640 | 90.6 | 61.9 | 28.8 | 11.14 |
| YOLOv9-m [30] | 300 | 640 × 640 | 92.2 | 63.9 | 132.4 | 32.7 |
| YOLOv10-m [31] | 300 | 640 × 640 | 87.2 | 62.7 | 64.0 | 16.5 |
| YOLOv11-n [22] | 100 | 640 × 640 | 94.6 | 69.7 | 6.4 | 2.59 |
| YOLOv12-m [32] | 300 | 640 × 640 | 90.1 | 65.2 | 60.1 | 19.6 |
| AL-YOLOv8 [33] | 200 | 640 × 640 | 96.9 | 66.1 | 18.6 | 16.8 |
| FEMT-YOLO [34] | 300 | 640 × 640 | 94.8 | 63.0 | 7.8 | 3.23 |
| DEIM-R18 [35] | 120 | 640 × 640 | 94.2 | 67.5 | 59.9 | 19.9 |
| RT-DETR-R18 [9] | 100 | 640 × 640 | 95.1 | 68.4 | 59.9 | 20.1 |
| Ours | 100 | 640 × 640 | 97.6 | 71.2 | 61.0 | 20.0 |
Table 8.
Overall architecture ablation experiment results on NWPU VHR-10.
Table 8.
Overall architecture ablation experiment results on NWPU VHR-10.
| RT-DETR | EUCB | DyConv | C3K2 | FCM | DyT | mAP@50 (%) | mAP@50:95 (%) | GFLOPs | Params (M) |
|---|
| √ | | | | | | 92.3 | 61.5 | 59.90 | 20.0 (M) |
| √ | √ | | | | | 93.9 | 63.0 | 61.01 | 20.2 (M) |
| √ | | √ | | | | 94.9 | 64.3 | 59.94 | 20.1 (M) |
| √ | | | √ | | | 94.9 | 62.1 | 55.42 | 18.9 (M) |
| √ | | | | √ | | 94.1 | 63.2 | 63.31 | 20.5 (M) |
| √ | | | | | √ | 93.5 | 60.4 | 59.90 | 20.1 (M) |
| √ | √ | √ | | | | 95.2 | 64.1 | 62.06 | 20.7 (M) |
| √ | | √ | √ | | | 92.4 | 62.3 | 56.47 | 19.5 (M) |
| √ | | | √ | √ | | 93.9 | 63.3 | 58.83 | 19.4 (M) |
| √ | | √ | | √ | | 94.0 | 63.1 | 64.36 | 21.0 (M) |
| √ | √ | | | √ | | 95.5 | 63.9 | 64.4 | 20.6 (M) |
| √ | √ | √ | √ | | | 94.6 | 62.6 | 57.62 | 19.6 (M) |
| √ | √ | √ | | √ | | 94.2 | 63.5 | 65.46 | 21.2 (M) |
| √ | √ | √ | | | √ | 94.4 | 64.3 | 62.06 | 20.7 (M) |
| √ | √ | √ | √ | √ | | 95.8 | 63.9 | 61.03 | 20.0 (M) |
| √ | √ | √ | √ | √ | √ | 96.2 | 64.4 | 61.03 | 20.0 (M) |
Table 9.
Repeated experiment results on NWPU VHR-10.
Table 9.
Repeated experiment results on NWPU VHR-10.
| Method | mAP50 | mAP50:95 | mAPs |
|---|
| RT-DETR-R18 | 92.3 ± 0.4 | 61.5 ± 0.3 | 46.8 ± 0.4 |
| OURS | 96.0 ± 0.2 | 63.9 ± 0.4 | 51.8 ± 0.3 |
| Improvement | +3.7 | +2.4 | +5.0 |
| p-value | 0.0025 | 0.0018 | 0.0013 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |