YOLO-SBA: A Multi-Scale and Complex Background Aware Framework for Remote Sensing Target Detection

Yuan, Yifei; Wei, Yingmei; Zhou, Xiaoyan; Guo, Yanming; Chen, Jiangming; Jiang, Tingshuai

doi:10.3390/rs17121989

Open AccessArticle

YOLO-SBA: A Multi-Scale and Complex Background Aware Framework for Remote Sensing Target Detection

by

Yifei Yuan

¹,

Yingmei Wei

^1,*

,

Xiaoyan Zhou

^2,3,

Yanming Guo

¹,

Jiangming Chen

¹

and

Tingshuai Jiang

¹

Laboratory for Big Data and Decision, College of Systems Engineering, National University of Defense Technology, Deya Road, Changsha 410073, China

²

State Key Laboratory of Complex Electromagnetic Environment Effects on Electronics and Information Systems, College of Electronic Science, National University of Defense Technology, Deya Road, Changsha 410073, China

³

Center for Machine Vision and Signal Analysis, Faculty of Information Technology and Electrical Engineering, University of Oulu, 90014 Oulu, Finland

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(12), 1989; https://doi.org/10.3390/rs17121989

Submission received: 21 April 2025 / Revised: 27 May 2025 / Accepted: 3 June 2025 / Published: 9 June 2025

(This article belongs to the Special Issue Advanced Artificial Intelligence Algorithm for the Analysis of Remote Sensing Images (Third Edition))

Download

Browse Figures

Versions Notes

Abstract

Remote sensing target detection faces significant challenges in handling multi-scale targets, with the high similarity in color and shape between targets and backgrounds in complex scenes further complicating the detection task. To address this challenge, we propose a multi-Scale and complex Background Aware network for remote sensing target detection, named YOLO-SBA. Our proposed YOLO-SBA first processes the input through the Multi-Branch Attention Feature Fusion Module (MBAFF) to extract global contextual dependencies and local detail features. It then integrates these features using the Bilateral Attention Feature Mixer (BAFM) for efficient fusion, enhancing the saliency of multi-scale target features to tackle target scale variations. Next, we utilize the Gated Multi-scale Attention Pyramid (GMAP) to perform channel–spatial dual reconstruction and gating fusion encoding on multi-scale feature maps. This enhances target features while finely suppressing spectral redundancy. Additionally, to prevent the loss of effective information extracted by key modules during inference, we improve the downsampling method using Asymmetric Dynamic Downsampling (ADDown), maximizing the retention of image detail information. We achieve the best performance on the DIOR, DOTA, and RSOD datasets. On the DIOR dataset, YOLO-SBA improves mAP by 16.6% and single-category detection AP by 0.8–23.8% compared to the existing state-of-the-art algorithm.

Keywords:

remote sensing; target detection; multi-scale target; complex background

1. Introduction

Remote sensing target detection is pivotal in various domains, including defense, agriculture, and urban development. Its core value lies in the ability to acquire information about the Earth’s surface in a non-contact, large-scale, and timely manner, providing important data for scientific decision-making and social development. In recent years, the development of deep learning has significantly advanced target detection technology [1,2,3,4]. The YOLO series [5,6,7,8,9,10,11] of detectors, while continuously improving detection accuracy, have effectively reduced model computational costs, providing efficient solutions for real-time applications and edge deployment. Sun et al.’s research [12] has demonstrated the great potential of YOLO in remote sensing target detection. However, the diversity of target scales and complex backgrounds in remote sensing images pose significant challenges to general target detection algorithms.

In remote sensing scenarios [13], it is common to have targets of different categories and scales or targets of the same category but different sizes within the same field of view. As shown in Figure 1, we can intuitively see that current target detection algorithms face significant challenges in handling multi-scale targets. These algorithms struggle to balance different sizes of targets, and the interference from complex backgrounds exacerbates the difficulty of detection, leading to frequent missed or false detections. In Figure 1a, where golf courses and running tracks are present, the algorithm exhibits insufficient sensitivity for small targets and excessive sensitivity for large targets, leading to false and missed detections. In Figure 1b, featuring dense urban structures, the intricate background of the remote sensing image creates a visually chaotic scene, where the color and shape similarities between targets and backgrounds make differentiation very challenging. In Figure 1c, under the coexistence of multi-scale variations and complex backgrounds, the scale differences between complex urban backgrounds and targets cause the algorithm to lose its ability to discriminate between targets and backgrounds, leading to severe false and missed detections. Hence, there is a pressing requirement to develop an advanced target detection algorithm capable of tackling the challenges arising from diverse target scales and complex backgrounds in remote sensing.

To tackle the challenge of detecting targets at multiple scales, researchers have explored various approaches. Zhang et al. [15] introduced a label assignment approach centered on the central region, aiming to provide additional positive samples for small targets, thereby enhancing the instance and feature level representations of these targets. Zhong et al. [16] introduced the PSWP-DETR detector, which integrates multi-scale features through a scale-aware difference module to capture edge features and local textures. However, these methods often overlook the interplay between context information extraction and feature fusion when handling multi-scale targets, thus not fully exploiting the processing advantages of advanced attention mechanisms for multi-scale information fusion. In terms of complex background processing, Yang et al. [17] proposed the Frequency Self-Attention Refinement module to eliminate redundant information across various frequency bands. Zhang et al. [18] proposed the Balanced Decoupled Loss for handling foreground and background information losses. However, these algorithms often ignore the extraction and enhancement of detail information while suppressing background redundancy. The suppression of complex backgrounds and the enhancement of multi-scale target feature saliency should complement each other, but existing methods have not adequately addressed their relationship.

To address the aforementioned challenges, this paper proposes a multi-scale and complex background-aware network for remote sensing target detection, named YOLO-SBA. We adopt the network structure of YOLO for efficient feature extraction, achieving channel-level enhancement for multi-scale targets and fine-grained suppression of complex backgrounds, while maximizing the retention of image detail information to significantly improve the capability of remote sensing target detection. Specifically, we proposed the Multi-Branch Attention Feature Fusion Module (MBAFF) to efficiently fuse multi-scale information and improve feature weakening issues by combining multi-branch dilated convolutions with the Bilateral Attention Feature Mixer (BAFM). Next, the Gated Multi-scale Attention Pyramid (GMAP) suppresses spectral redundancy through channel–spatial dual reconstruction mechanisms and enhances target saliency expression while suppressing background noise through adaptive gating fusion. Additionally, we improved the downsampling method for global network inference, Asymmetric Dynamic Downsampling (ADDown), by combining decompositional depthwise separable convolution downsampling with a dynamic channel attention mechanism to enhance edge sensing and prevent information loss. These innovations enable the model to fully learn key information during global network inference, enhancing its effectiveness and robustness, thereby improving the performance of remote sensing target detection.

Enhancing the saliency of multi-scale targets effectively reduces background interference, while complex background suppression improves the distinguishability of targets at all scales. These two aspects complement each other, jointly improving the effectiveness of remote sensing target detection.

Breakthroughs and Contributions of This Paper:

1. This study introduces a multi-scale and complex background-aware network for remote sensing target detection, termed YOLO-SBA. This algorithm is designed to tackle the inherent difficulties posed by variations in target scales and complex backgrounds in the context of remote sensing target detection. It extends the model’s receptive field, enhances the feature representation of targets at multiple scales, and reduces the noise redundancy in complex backgrounds.

2. We proposed the Multi-Branch Attention Feature Fusion Module (MBAFF). It combines a multi-branch dilated convolution structure with the Bilateral Attention Feature Mixer (BAFM) to capture contextual information of feature maps at different spatial scales. By incorporating an efficient attention mechanism, it enhances the expression of target features, mitigating the feature weakening issue caused by target multi-scale variations.

3. We proposed the Gated Multi-scale Attention Pyramid (GMAP). By feeding the feature maps derived from the multi-level receptive field pyramid into a channel–spatial dual reconstruction mechanism and a learnable gating mechanism for fusion attention, it achieves adaptive background noise suppression while enhancing features.

4. We proposed Asymmetric Dynamic Downsampling (ADDown). By combining decompositional depthwise separable convolution downsampling with a dynamic channel attention mechanism, this mechanism fully retains the key information extracted by each module during downsampling, significantly enhancing the model’s robustness.

5. We performed comprehensive experiments on publicly available remote sensing target detection datasets, including DIOR [14], DOTA-v1.0 [19], and RSOD [20], demonstrating that the proposed algorithm YOLO-SBA outperforms existing target detection algorithms. On the DIOR [14] dataset, YOLO-SBA achieved a 16.6% improvement in mAP and a 0.8–23.8% improvement in single-category detection AP compared to the current state-of-the-art algorithm. Additionally, ablation studies were conducted to verify the efficacy of the proposed modules.

2. Related Work

This section primarily explores the three issues that this paper focuses on: target size variability, complex background interference, and the improvement of downsampling algorithms. It provides a detailed introduction to the research background and development history of each issue.

2.1. Remote Sensing Image Target Detection Algorithms for Multi-Scale Targets

Compared to conventional targets, remote sensing targets exhibit greater variations in scale. This variability is not only evident between different categories of targets, such as aircraft and vehicles, which are typically smaller, and buildings and playgrounds, which are typically larger. It also occurs within the same category of targets, where significant differences in size can be observed [21,22]. To address this, some researchers focus on learning multi-scale features.

Li et al. [23] introduced an enhanced

R^{3} D e t

algorithm, replacing the traditional feature pyramid network with a search-based architecture. This enables the network to adaptively learn and choose feature combinations for updates, thereby enriching multi-scale feature information. Moreover, they integrated shallow features into the context information enhancement module to enhance the semantic information of small targets, facilitating the detection of water surface vessels. Teng et al. [24] introduced a Multi-Scale Perception (MSP) module, combining the global contextual cues it extracts with the local spatial contextual correlations encoded by Clip Long Short-Term Memory. They further utilized these rich semantic features to develop an Adaptive Anchor module, alleviating the scale variations of remote sensing image targets. Huang et al. [25] employed a feature pyramid structure with multi-scale feature representations, integrating deformable convolution modules to learn multi-scale features of targets. Zhou et al. [26] divided each convolutional layer within the multi-scale structure into sub-layers of equal size. By obtaining super-scale features through multi-scale sub-layers and enhancing information transmission within convolutional layers, they improved multi-scale target detection. Li et al. [27] introduced the enhanced feature pyramid network architecture, enabling multi-scale detection through the creation of top-down feature maps. Zhong et al. [16] handle multi-scale information by designing scale-aware PdConv and multi-scale feature extraction SDM. Guo et al. [28] achieve multi-scale feature fusion by calculating cross-attention between low-resolution feature maps and high-resolution feature maps, transferring the semantic information of high-resolution feature maps to low-resolution feature maps.

Existing algorithms have achieved remarkable advancements in multi-scale feature extraction and enhancement. Nevertheless, they have paid less attention to the interplay between context information extraction and feature fusion during the feature extraction process. The effective integration of advanced attention mechanisms with multi-scale feature extraction remains largely unexplored. Therefore, in this paper, we introduce MBAFF, which combines a multi-branch dilated convolution structure with a bilateral attention mechanism. This method incorporates effective attention mechanisms into multi-scale feature integration, comprehensively comprehending image context and enhancing detection performance for targets of different scales.

2.2. Remote Sensing Image Target Detection Algorithms for Complex Backgrounds

Remote sensing images, with their extensive coverage, encompass numerous targets across diverse categories. Yet, only a limited number of these categories are annotated for detection, leading to the background dominating the image [29,30,31,32,33,34]. Targets in remote sensing are frequently embedded in intricate backgrounds, presenting difficulties for detection.

To address this, researchers typically focus on reducing background interference. Yang et al. [35] introduced an automatic aircraft target detection algorithm that fuses multi-scale circular frequency filters with convolutional neural networks. This algorithm utilizes multi-scale circular frequency filters to remove noise from images and extract candidate regions of aircraft targets at different scales. Ye et al. [36] aimed to suppress background clutter interference by fusing features from different layers in a hierarchical manner. Subsequently, the combined features were mapped into a reduced-dimensionality subspace via a feature selection mechanism to capture more pertinent contextual details. Wang et al. [37] introduced a background filtering module grounded in the principle of attention mechanisms to mitigate background interference in optical remote sensing images. Gao et al. [38] handle complex background interference by designing a spatial bias module (SBM) and a multi-task enhancement structure (MES). Ma et al. [39] reduce mutual interference between different categories by dividing the feature maps into different channels and separating features according to categories, effectively suppressing background interference and improving detection accuracy. Yang et al. [17] propose BAC-FSAR, which achieves filtering of different frequency information by introducing feature processing in the frequency domain, using channel-adaptive pooling as a low-frequency feature filter, and using deep convolution to extract high-frequency features.

Similar to the research problem addressed in this paper, TMAFNet [33], when handling complex background tasks, does not design an independent background noise suppression algorithm. Instead, it indirectly suppresses redundant information through target enhancement and selective fusion. In contrast, the method proposed in this paper for dealing with complex backgrounds is significantly different and has been experimentally proven to outperform TMAFNet [33] in terms of performance on the DIOR [14] dataset.

Existing background suppression algorithms often focus solely on eliminating background noise or merely suppressing noise by enhancing target features. However, if both target feature saliency enhancement and background suppression can be performed simultaneously, it would be more conducive to suppressing redundant information and thus improving overall performance. In this study, we propose GMAP, which inputs multi-receptive field feature maps extracted by a cascaded pooling structure into a channel–spatial dual reconstruction mechanism. Through a gating fusion mechanism, GMAP enhances the saliency of target features while better suppressing redundant information from complex backgrounds.

2.3. Evolution of Downsampling Algorithm Improvements

With the widespread application of deep learning, numerous researchers have focused on enhancing downsampling techniques to retain valuable information during network inference and improve the accuracy of subsequent tasks.

Son et al. [40] introduced a degradation modeling approach using adaptive downsampling techniques. This method utilizes adversarial training strategies and adaptive data loss to dynamically capture the low-frequency distribution characteristics of real low-resolution images. This allows the super-resolution network to adaptively adjust downsampling parameters to match real complex degradation patterns. Yang et al. [41] compressed the original high-resolution images to extremely low resolutions through a multi-scale residual structure. They also utilized a joint feature representation of directional coding features and local binary patterns to retain texture information under extreme dimensionality reduction conditions.

However, these methods overlook the need for fine-grained information in remote sensing images and are prone to losing key information extracted during network inference, which in turn affects the accuracy of target detection. Therefore, this paper proposes Asymmetric Dynamic Downsampling (ADDown), which combines decompositional depthwise separable convolution downsampling with a dynamic channel attention mechanism. This enhances independent sensing of horizontal and vertical edges, achieving efficient utilization of spatial information and adaptive enhancement of channel features. During the downsampling phase of global network inference, it fully retains the detail information of the image.

This study tackles the difficulties posed by variations in target scales and intricate backgrounds in remote sensing target detection by incorporating algorithm designs that enhance the prominence of multi-scale targets and mitigate complex background interference. It constructs a multi-scale and complex background-aware network called YOLO-SBA, significantly improving target detection accuracy.

3. Method

This part offers an in-depth explanation of the YOLO-SBA algorithm, tackling the two inherent significant issues in remote sensing target detection: multi-scale target distribution and complex background interference. We adopt the network structure of YOLOv10 for efficient feature extraction, achieving channel-level enhancement for multi-scale targets and fine-grained suppression of complex backgrounds, while preserving image detail information to substantially enhance remote sensing target detection capabilities.

As shown in Figure 2, YOLO-SBA first processes the extracted feature maps through the Multi-Branch Attention Feature Fusion Module (MBAFF), combining the multi-branch information extracted by dilated convolutions with the Bilateral Attention Feature Mixer (BAFM) to facilitate contextual feature integration and improve the feature representation for targets at various scales. Then, it utilizes the Gated Multi-scale Attention Pyramid (GMAP) to process the differentiated multi-level receptive field features through a channel–spatial dual reconstruction mechanism, finely suppressing spectral redundancy. Through the integration of a gating fusion mechanism, the model’s capability to differentiate targets from backgrounds is improved, thereby reducing background interference. Additionally, Asymmetric Dynamic Downsampling (ADDown) achieves efficient utilization of spatial information and adaptive enhancement of channel features by combining decompositional depthwise separable convolution downsampling with a dynamic channel attention mechanism, ensuring that the fine-grained effective information extracted by key modules is not lost during downsampling. Ultimately, the detection task is executed via the YOLOv10 [6] detection head. Thanks to the careful design of the network architecture and feature processing, YOLO-SBA demonstrates superior ability in differentiating target information from redundant backgrounds, resulting in improved performance in remote sensing target detection.

In addition, we retained two classic modules of YOLOv10 [6], namely PSA (Partial Self-Attention) and C2fCIB, which are briefly introduced below. PSA is an efficient self-attention mechanism introduced in YOLOv10. It divides the feature map extracted from the backbone into two parts and only inputs one part into a submodule consisting of multi-head self-attention and a feed-forward network. Then, the two parts of the feature map are connected and fused through convolution operations to efficiently capture global features. C2fCIB is an efficient feature extraction module proposed in YOLOv10 [6]. It uses cheap depthwise convolution for spatial mixing and efficient pointwise convolution for channel mixing to achieve more efficient feature extraction. C2fCIB divides the input feature map into two parts, one of which is processed by CIB, and the other is directly output. Finally, the two parts are fused through convolution to improve model performance and efficiency, saving computing resources while enhancing feature representation capabilities.

3.1. Multi-Branch Attention Feature Fusion Module (MBAFF)

The MBAFF (Multi-Branch Attention Feature Fusion) module is an innovative feature fusion structure designed for tasks involving multi-scale variations in targets. The core of this structure consists of two primary components: the Multi-Branch Dilated Convolution Structure and the Bilateral Attention Feature Mixer (BAFM). Figure 3 illustrates that the operational procedure of this module is divided into two crucial phases. Initially, the input feature map is processed through a parallel multi-branch dilated convolution mechanism to capture features across multiple scales. Subsequently, the BAFM component adaptively fuses the heterogeneous features through a bilateral attention mechanism. During the feature extraction phase, the multi-branch structure utilizes dilated convolution operations with varying dilation rates. Through the parallel computation of differentiated receptive fields, it effectively captures the contextual information of the feature map at various spatial scales. This architectural design overcomes the scale constraints of conventional single-branch convolutions, substantially improving the feature responsiveness of targets and boosting the expressiveness of multi-scale target features.

The multi-branch dilated convolution structure substantially improves the feature representation ability of targets across various scales via the combined approach of multi-scale feature extraction and dynamic receptive field enlargement. This structure is composed of multiple parallel branches, where each branch processes through D-Bottleneck as shown in Figure 4 and cascaded dilated convolutions. Its core is manifested in three aspects:

1. By adaptively adjusting the dilation rate (d) parameter, the structure accomplishes an exponential increase in the receptive field while preserving the spatial resolution of the feature map. For instance, with a 3 × 3 convolution kernel and a default configuration of d = 2 (illustrated in Figure 5), the effective receptive field of a single convolutional layer expands from 3 × 3 to 5 × 5. As shown in Figure 4, after passing through three D-Bottleneck modules, the accumulated receptive field reaches 25 × 25. This progressive expansion mechanism effectively captures the cross-regional correlation features of multi-scale targets.

2. The multi-branch structure realizes heterogeneous multi-scale feature representation. The base branch (Branch_1) retains the detailed texture at the original resolution, while the deeper branches adopt a step-wise increasing dilation rate to encode global information. This design enables the network to simultaneously perceive local features and global contextual information.

3. By fusing the outputs of branches with different receptive fields through concatenation operations and then utilizing feature reorganization along the channel dimension, the multi-scale representation capability is enhanced.

Specifically, the feature map

X \in R^{B \times C \times H \times W}

input into the MBAFF first undergoes channel expansion via a 1 × 1 convolution. It is then split into two base branches,

B r a n c h_1

and

B r a n c h_2

, and iteratively generates new branches through n cascaded D-bottleneck modules. Here, the D-bottleneck denotes a residual convolutional block with a dilation rate. Subsequently, all branches

{{B r a n c h}_{1}, {B r a n c h}_{2}, \dots, {B r a n c h}_{n + 2}}

are concatenated along the channel dimension. In each D-bottleneck unit, we expand the receptive field through dilated convolutions, with the computational process as follows:

{B r a n c h}_{k}^{'} = {Conv}_{3 \times 3} ({B r a n c h}_{k}, dilation = d)

(1)

{B r a n c h}_{k + 1} = {Conv}_{3 \times 3} ({B r a n c h}_{k}^{'}, dilation = d)

(2)

The expansion rate d controls the receptive field range, and in this paper, the expansion rate

d = 4

. The effective receptive field is expanded to

(d + 1) (c - 1) + 1

(c is the size of the convolutional kernel), and the basis for selecting the expansion rate is detailed in Section 4.5 through ablation experiments. Lastly, a convolutional layer with reduced filters restores the initial channel dimension.

The multi-branch dilated convolution structure substantially enhances the detection capability of targets at different scales in remote sensing images by simultaneously optimizing feature resolution and receptive field. This architecture, while preserving the spatial resolution of the input feature map, improves the model’s adaptability to adapt to targets of varying sizes via the use of different dilation rates in parallel branches. It addresses the issue of feature weakening in the distribution of multiple targets. The tensor obtained after concatenating multi-scale features is then input into the Bilateral Attention Feature Mixer (BAFM) for cross-dimensional feature refinement, as shown in Figure 6.

The Bilateral Attention Feature Mixer (BAFM) accomplishes accurate integration of features across layers using a dual-path attention mechanism, as depicted in Figure 6.

This module receives two input feature maps: Feature Map

X \in R^{B \times C \times H \times W}

(original input feature) and Residual Feature Map

Y \in R^{B \times C \times H \times W}

(residual feature extracted by the Multi-Branch Dilated Convolution Structure). The feature fusion process consists of two core steps: initially, the input features undergo cross-modal initialization, where feature map X is element-wise added to feature map Y to generate the baseline feature matrix. The baseline feature matrix, by preserving the common information between feature map X and feature map Y, constructs a feature base with rich feature information. Subsequently, we input the baseline feature matrix into both the local attention path and the global attention path to compute local and global attention weights, respectively.

In the local attention path, local feature encoding is first performed. The local feature encoding process, designed in this study, includes four processing stages: Initially, channel compression reduces the original number of channels to

C / r

, extracting local inter-channel correlations. Subsequently, a ReLU activation function introduces spatial non-linearity through non-linear mapping, enhancing the expression capability of local patterns. Then, convolutional computation restores the original channel dimension, preserving the integrity of local features. Finally, the processed result is input into the Sigmoid function to produce a spatial attention weight matrix, thus completing the calculation of local attention weights.

In the network structure of the global attention computation path, the hierarchical compression and adaptive weighting processes are closely integrated and performed continuously. The main steps are as follows: Firstly, the input feature map is compressed in the spatial dimension through a global average pooling (GAP) layer, resulting in a global average feature for each channel. This converts the two-dimensional feature map into a scalar value for each channel, achieving preliminary extraction and compression of global contextual information.

The tensor output after global average pooling encodes the global spatial response intensity of each channel. Subsequently, through dimensionality reduction, ReLU activation function for non-linear transformation, and channel dimension reconstruction, it captures the cross-channel dependencies, realizes non-linear interaction between channels, and generates channel attention weights. Finally, the global attention weights are combined with the local attention weights, and a Sigmoid function is utilized to produce the global–local dual-path combined attention map.

By adaptively fusing the input features Feature map X and Residual Feature map Y based on the computed fusion weights from the dual pathways, the network can dynamically adjust the fusion ratio according to the feature importance, thereby further enhancing the feature expression capability for targets at different scales.

3.2. Gated Multi-Scale Attention Pyramid (GMAP)

The Gated Multi-scale Attention Pyramid (GMAP) module is developed to tackle the challenge of complex backgrounds in remote sensing images. It proposes an innovative architecture that combines cascaded pooling with a dual attention reconstruction mechanism. As shown in Figure 7, GMAP consists of three core components: 1. Cascaded Pooling Multi-scale Feature Extraction. This component constructs a multi-level receptive field feature pyramid. 2. Channel–Spatial Dual Reconstruction Mechanism. This mechanism suppresses spectral redundancy, weakens the interference of complex backgrounds, and enhances the saliency of targets. 3. Adaptive Gating Fusion. This component dynamically balances the weights of multi-scale features. The module’s computational procedure can be succinctly described as follows: initially, the Feature Map is fed into a cascaded pooling structure to perform multi-scale feature extraction.The extracted feature tensors undergo channel reconstruction and spatial reconstruction, respectively. Finally, a gating mechanism computes fusion weights to direct the integration of the enhanced multi-scale features. Following this, we will offer an in-depth explanation of the cascaded pooling architecture, the dual reconstruction approach, and the gating fusion method.

3.2.1. Cascaded Pooling Structure

Given the input feature map

{F_{i n} \in R^{B \times C \times H \times W}}

, the GMAP employs three cascaded max pooling operations to respectively obtain the outputs

y_{1}

,

y_{2}

, and

y_{3}

. Generate a three-level feature pyramid

{F_{in}, y_{1}, y_{2}, y_{3}}

that includes the original input features, with its effective receptive field expanding exponentially:

R = {r, 3 r - 2, 5 r - 4, 7 r - 6},

(3)

where r is the base receptive field. When the kernel size

k = 5

, the sequence of receptive fields is

5 \times 5 \to 9 \times 9 \to 13 \times 13 \to 17 \times 17

, covering multi-scale context from global to local. The multi-scale feature tensor obtained through the cascaded pooling structure contains more extensive and rich information compared to the original feature map.

3.2.2. Dual Reconstruction Mechanism

Channel Reconstruction

We enhance the feature representation capability related to the target by suppressing redundant channels in multispectral remote sensing images through channel reconstruction. Firstly, we split the input X along the channel dimension into two sub-features,

X_{1}

and

X_{2}

. Subsequently, we apply grouped depthwise convolutions to each branch to obtain

{\hat{X}}_{i}

. The

{\hat{X}}_{i}

is then input into the SE (Squeeze-and-Excitation) mechanism to adaptively modify the channel weights:

z = \frac{1}{H W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} {\hat{X}}_{i}^{(h, w)}

(4)

s = σ (W_{2} δ (W_{1} z))

(5)

X_{i}^{att} = s ⊙ {\hat{X}}_{i}

(6)

Equations (4)–(6) describe the core computational steps of the SE (Squeeze-and-Excitation) attention mechanism. First, through global average pooling (GAP) as described in Equation (4), the input feature map

{\hat{X}}_{i}

is averaged across the spatial dimensions (height H and width W), resulting in a global response z for each channel. Then, using the channel weighting computation in Equation (5), a channel weight vector s is calculated through two fully connected layers (with weight matrices

W_{1}

and

W_{2}

for the first and second layers, respectively) and ReLU and Sigmoid activation functions. This vector s is employed to rescale each channel of the input feature map. Eventually, via the feature rescaling in Equation (6), the channel weight vector s undergoes element-wise multiplication (denoted by ⊙) with the input feature map

{\hat{X}}_{i}

, yielding the attention-refined feature map

X_{i}^{att}

. This mechanism adaptively modulates the response of each channel, allowing the network to concentrate on the most task-relevant features, thereby improving model performance. The channel reconstruction weights

X_{Channel Reconstruction}

are obtained by adding the attention weights calculated based on

X_{1}

and

X_{2}

.

Space Reconstruction

We enhance the spatial saliency of small targets and suppress background clutter through spatial reconstruction.

Here, we compute the spatial attention map through cross-channel aggregation

A \in R^{B \times 1 \times H \times W}

.

A_{average} = \frac{1}{C} \sum_{c = 1}^{C} X_{Channel Reconstruction}^{(c)}

(7)

A_{\max} = max_{c} X_{Channel Reconstruction}^{(c)}

(8)

A = σ ({Conv}_{7 \times 7} ([A_{average}; A_{\max}]))

(9)

We then apply secondary weighting to the attention map obtained from channel reconstruction:

X_{Space Reconstruction} = A ⊙ X_{Channel Reconstruction}

(10)

By utilizing the dual-channel reconstruction mechanism, we amplify the prominence of target feature representations while mitigating the impact of non-target areas.

3.2.3. Gated Fusion

We facilitate the integration of the enhanced multi-scale feature tensor via the creation of gating signals.

For each scale feature

f_{i} \in {f_{0}, f_{1}, f_{2}, f_{3}}

: Initially, we compute the weight values through Global Average Pooling (GAP). Subsequently, we introduce a learnable method for generating gating weights:

g_{i} = sigmoid (W_{g} e_{i} + b_{g})

(11)

Here, both

W_{g}

and

b_{g}

are learnable parameters of the network, utilizing the sigmoid function as the activation mechanism to produce the gating weights.

Finally, the weights derived from the gating process direct the weighted integration of the multi-scale feature tensor:

w_{i} = \frac{exp (g_{i})}{\sum_{j = 0}^{3} exp (g_{j})}, F_{Fusion} = \sum_{i = 0}^{3} w_{i} \cdot f_{i}

(12)

Based on the previous description, we know that here,

f_{0}

dominates small target detection, and

f_{3}

dominates large area target recognition. Utilizing the Gated Multi-scale Attention Pyramid (GMAP), we amplify the prominent features of targets while effectively reducing the superfluous details from intricate backgrounds.

3.3. Asymmetric Dynamic Downsampling (ADDown)

To ensure that the key modules better fulfill their roles and that the effective information they extract is not lost during downsampling, this paper proposes ADDown, as shown in Figure 8. ADDown combines asymmetric decompositional depthwise separable convolution downsampling with a dynamic channel attention mechanism. This enhances the independent sensing of horizontal and vertical edges, effectively prevents information loss, and achieves efficient utilization of spatial information and adaptive enhancement of channel features. During the downsampling phase of global network inference, this mechanism fully retains the extracted key information, significantly enhancing the model’s robustness.

Figure 8 demonstrates that the ADDown module utilizes a two-step feature transformation framework: Initially, channel dimension alignment is achieved via a 1 × 1 convolutional layer; subsequently, the input undergoes downsampling through asymmetric depthwise separable convolution; finally, feature calibration is accomplished via dynamic compression-ratio channel attention. Below, we detail the computational workflow of ADDown.

The input feature map

X \in R^{B \times C_{1} \times H \times W}

undergoes a learnable 1 × 1 convolution to accurately modify the number of feature channels, yielding

X' \in R^{B \times C_{2} \times H \times W}

.

Following the adjustment of the channel count to obtain

X'

, it is fed into a decomposed depthwise separable convolution to improve the response ability towards target features while performing spatial downsampling.

X_{h}^{″} = f_{3 \times 1} (X') = \sum_{i = - 1}^{1} W_{h} [i, 0] \cdot X' [:, :, h + i, w]

(13)

X_{v}^{″} = f_{1 \times 3} (X_{h}^{″}) = \sum_{j = - 1}^{1} W_{v} [0, j] \cdot X_{h}^{″} [:, :, h, w + j]

(14)

In the equation,

W_{h} \in R^{C_{2} \times 3}

and

W_{v} \in R^{C_{2} \times 3}

represent the depthwise convolution kernels for the horizontal and vertical directions, respectively. Through separable convolution, these kernels achieve an equivalent receptive field of

5 \times 5

.

R_{e f f} = R_{3 \times 1} \oplus R_{1 \times 3} = 3 + (3 - 1) = 5

(15)

Why Choose Decomposed Depthwise Separable Convolution? The square structure of a traditional

3 \times 3

convolutional kernel is susceptible to variations in shape orientation. When the target shape has varied directions, its edge response will diminish due to the fixed geometric layout of the convolutional kernel. Decomposed convolution achieves spatial decoupling through independent horizontal and vertical convolutions, as shown in Formulas (16)–(18).

R (θ) = \sum_{i = - 1}^{1} \sum_{j = - 1}^{1} W_{i, j} \cdot X_{x + i, y + j}

(16)

R_{h} = \sum_{i = - 1}^{1} W_{h, i} \cdot X_{x + i, y}

(17)

R_{v} = \sum_{j = - 1}^{1} W_{v, j} \cdot X_{x, y + j}

(18)

This separable operation enables the convolution kernel to respond independently to horizontal/vertical features. When the target direction or shape changes by

θ

degrees, the response intensity of the original

3 \times 3

convolution decreases due to the angle deviation, following a

cos θ

attenuation pattern. In contrast, decomposed convolution maintains rotational sensitivity through dynamic weight combination (

W_{h} \cdot cos θ + W_{v} \cdot sin θ

).

Finally, through feature enhancement with an attention mechanism of adaptive compression ratio, the response to effective information is further enhanced.

z = \frac{1}{H' W'} \sum_{h = 1}^{H'} \sum_{w = 1}^{W'} X ″ [:, :, h, w]

(19)

q = δ (W_{2} (δ (W_{1} z)))

(20)

A = σ (q)

(21)

Here,

W_{1} \in R^{max (C_{2} / 16, 4) \times C_{2}}

and

W_{2} \in R^{C_{2} \times max (C_{2} / 16, 4)}

form the dynamic bottleneck layer. When

C_{2} < 64

, the minimum intermediate dimension of 4 channels is maintained to prevent excessive compression and potential information loss.

4. Experiments

In this section, we perform comprehensive experiments on three publicly available remote sensing target detection benchmark datasets: DIOR [14], DOTAv1.0 [19], and RSOD [20]. These experiments demonstrate the superiority of the YOLO-SBA introduced in this paper and the efficacy of the three modules: Multi-Branch Attention Feature Fusion Module (MBAFF), Gated Multi-scale Attention Pyramid (GMAP), and Asymmetric Dynamic Downsampling (ADDown). Section 4.1 presents the three datasets widly used in remote sensing object detection; Section 4.2 outlines the evaluation criteria used for experimental validation; in Section 4.3, we elaborate on the experimental equipment information and parameter settings; in Section 4.4, we compare our approach with state-of-the-art target detection algorithms used in remote sensing, demonstrating that the YOLO-SBA algorithm presented in this study is at the forefront of research; in Section 4.5, we performed ablation studies to validate the efficacy of the proposed modules; and in Section 4.6, we visually analyze the superiority of the algorithm and its detection capability on remote sensing images.

4.1. Datasets

This study performs experiments on three prominent remote sensing object detection datasets: DIOR [14], DOTAv1.0 [19], and RSOD [20].

DIOR [14] datasets. The DIOR [14] dataset, released by China’s Northwestern Polytechnical University in 2020, serves as a comprehensive benchmark for object detection in optical remote sensing images. It comprises 23,463 images with an 800 × 800 pixel resolution, spanning 20 categories including airplane (AL), airport (AT), baseball field (BF), basketball court (BC), bridge (B), chimney (C), dam (D), expressway service area (ESA), expressway toll station (ETS), harbor (HB), golf course (GC), ground track field (GTF), overpass (O), ship (S), stadium (SD), storage tank (ST), tennis court (TC), train station (TS), vehicle (V), and windmill (WM). The dataset includes 190,288 annotated horizontal bounding box instances across various spatial resolutions (0.5 to 30 m), encompassing urban, rural, mountainous, and coastal scenes. With its diverse categories and numerous small targets, DIOR is widely employed for assessing target detection algorithm performance.

DOTAv1.0 [19] datasets. The DOTAv1.0 [19] dataset, released by Wuhan University in 2018, serves as a benchmark for target detection in aerial images. It includes 2806 remote sensing images with a resolution of 4000 × 4000 pixels (some varying from 800 × 800 to 20,000 × 20,000 pixels). The dataset includes 15 common target categories: Plane (PL), Baseball Diamond (BD), Bridge (BR), Ground Track Field (GTF), Small Vehicle (SV), Large Vehicle (LV), Ship (S), Tennis Court (TC), Basketball Court (BC), Storage Tank (ST), Soccer-ball Field (SBF), Roundabout (R), Harbor (HB), Swimming Pool (SP), and Helicopter (HC). It contains a total of 188,282 annotated bounding boxes. In this paper, the DOTA [19] dataset is processed by cropping and stitching to ensure that the image resolution does not exceed 1024 × 1024. Then, 21,046 images are randomly selected from the processed dataset for experiments.

RSOD [20] datasets. The RSOD [20] (Remote Sensing Object Detection) dataset, released by Wuhan University in 2015, functions as an open benchmark for target detection in remote sensing imagery. It includes four primary target categories: aircraft (446 images with 4993 annotations), oil tanks (165 images with 1586 annotations), playgrounds (189 images with 191 annotations), and overpasses (176 images with 180 annotations). The dataset contains a total of 976 images and 19,950 annotated instances, each with bounding box coordinates and category labels. The images have an 800 × 800 pixel resolution and depict various scenes including urban areas, suburban regions, and transportation hubs. The target sizes vary considerably, from small oil tanks around 10 pixels to large airport facilities spanning several hundred pixels.

4.2. Evaluation Metrics

We adopt mAP [42] (Mean Average Precision) as the primary evaluation metric to assess the model’s performance. The computation of this metric depends on the equilibrium between Precision and Recall, defined as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(22)

R e c a l l = \frac{T P}{T P + F N}

(23)

Here, TP (True Positive) represents the number of correct detections where the Intersection over Union (IoU) between the predicted bounding box and the ground truth exceeds a threshold (commonly 0.5). FP (False Positive) indicates the number of incorrect detections with insufficient or excessive IoU, whereas FN (False Negative) denotes the number of missed true targets. The IoU is calculated as the ratio of the overlapping area to the total area of the predicted and actual bounding boxes:

I o U = \frac{A \cap B}{A \cup B}

(24)

For each category, the predicted bounding boxes are sorted in descending order according to their confidence levels. Then, Precision and Recall values are computed progressively at various confidence thresholds to generate the P-R curve (with Recall on the x-axis and Precision on the y-axis). The area beneath this curve represents the AP (Average Precision) for that category:

A P = \int_{0}^{1} P (r) d r

(25)

We also introduce size-based evaluation metrics: APs (Average Precision for small targets), APm (Average Precision for medium targets), and APl (Average Precision for large targets), to comprehensively assess the model’s ability to detect targets across various scales. APs is for small targets with a pixel area less than 32 × 32; APm evaluates the detection performance for medium targets with a pixel area between 32 × 32 and 96 × 96; APl evaluates the detection performance for large targets with a pixel area greater than 96 × 96.

4.3. Training Details

This study follows the efficient design principles of the Ultralytics [10] framework for experimental setup and parameter configuration. The SGD optimizer is used with an initial learning rate of

l r 0 = 0.01

, a momentum of 0.937, and a weight decay of

5 \times 10^{- 4}

. The batch size ranges from 4 to 16 to balance memory usage and convergence stability. The input size is set to

1024 \times 1024

. Data augmentation techniques include Mosaic augmentation followed by RandomPerspective, which involves rotation, shearing, and perspective transformations. The probability parameters, such as scale, are set to 0.5, while shear is set to 0.0. Experiments are performed on an Ubuntu system equipped with an NVIDIA GeForce RTX 4090 GPU and 64 GB RAM.

4.4. Comparison with State-of-the-Art Algorithms

In this part, we evaluate our approach against traditional target detection methods and cutting-edge algorithms across three datasets.

Comparison on DIOR [14] dataset. On the DIOR [14] dataset, we evaluated our approach against 18 sophisticated algorithms, as detailed in Table 1. The YOLO-SBA algorithm introduced in this paper achieved an mAP of 89.8% on the DIOR [14] dataset, marking a 16.6% improvement over the current leading AGMF-Net [38] algorithm. Among the 20 categories in the DIOR [14] dataset, we achieved the best detection accuracy in all 20 categories, respectively: 97.5% (Airplane, AL), 93.3% (Airport, AT), 96.5% (Baseball, BF), 92.1% (Basketball Court, BC), 71.2% (Bridge, B), 89.9% (Chimney, C), 85.6% (Dam, D), 96.9% (Expressway Track Field, ESA), 91.8% (Expressway Toll Station, ETS), 90% (Golf Course, GC), 93.2% (Ground Track Field, GTF), 77.9% (Harbor, HB), 76.7% (Overpass, O), 95.6% (Ship, S), 97.5% (Stadium, SD), 93.1% (Storage Tank, ST), 97.2% (Tennis Court, TC), 82.3% (Train Station, TS), 81.8% (Vehicle, V), and 96.8% (Windmill, WM).

Comparison on DOTA [19] dataset. We evaluated our approach against 18 cutting-edge algorithms on the DOTA [19] dataset, as detailed in Table 2. The YOLO-SBA algorithm presented in this paper achieved a mean Average Precision (mAP) of 76.14% on the DOTA [19] dataset, demonstrating a 0.81% improvement over the current leading FSoD-Net [43] algorithm. Among the 15 categories in the DOTA [19] dataset, we achieved the best detection accuracy in 4 categories, respectively: 86.8% (Harbor, HB), 88.7% (Large Vehicle, LV), 65.1% (Soccer-ball field, SBF), and 95.6% (Tennis Court, TC).

Table 1. Comparison with state-of-the-art algorithms on the DIOR [14] dataset. The metric is mAP(%), and the bold values denote the top performance, while blue values signify the second-best results.

Method	mAP	AL	AT	BF	BC	B	C	D	ESA	ETS	GC	GTF	HB	O	S	SD	ST	TC	TS	V	WM
YOLT [44]	60.3	64.7	69.0	32.8	87.8	32.3	71.5	45.8	54.9	55.8	65.6	66.3	49.9	49.9	87.7	30.3	73.3	82.0	29.9	52.4	74.0
Faster-RCNN [45]	53.6	51.4	61.6	62.2	80.7	27.0	74.2	37.3	53.4	45.1	69.6	61.8	43.7	49.0	56.1	41.8	39.6	73.9	44.7	34.0	65.3
Cascade R-CNN [46]	67.4	81.2	81.4	90.1	81.1	46.9	81.5	68.1	84.2	65.3	74.5	81.6	37.8	60.4	68.9	88.8	60.5	81.0	57.5	47.7	80.7
HSF-Net [47]	54.9	51.0	62.1	60.2	80.7	27.4	72.8	43.8	52.1	45.2	67.5	58.7	34.0	48.9	70.8	45.5	53.5	72.9	46.8	37.4	67.0
SSD [42]	60.9	92.9	48.3	77.9	77.2	35.7	66.6	46.5	47.9	58.5	37.3	53.7	47.9	39.9	59.6	77.5	80.8	88.3	55.6	38.8	86.7
YOLOv3 [5]	59.9	67.5	54.7	65.8	86.8	34.2	73.5	34.3	55.7	49.6	67.3	68.9	54.3	51.7	86.8	40.3	67.8	83.9	32.3	49.1	73.6
YOLOv4 [7]	65.9	78.3	69.7	69.6	88.3	38.7	76.9	48.6	57.6	57.9	74.1	70.6	52.8	56.8	89.4	50.4	77.5	83.3	46.0	55.7	75.9
RFB-Net [48]	62.6	93.8	48.3	80.2	77.7	40.7	67.5	47.4	48.5	70.0	37.3	54.9	48.2	43.1	66.1	78.0	80.0	89.1	55.6	40.4	88.3
RetinaNet [49]	65.2	63.4	47.8	83.3	78.4	48.2	67.9	49.4	53.2	73.6	66.3	59.1	47.6	45.7	72.0	82.4	80.7	90.1	55.0	47.7	92.9
CenterNet [50]	61.6	64.0	66.3	65.7	86.3	34.8	73.1	41.1	60.8	54.2	73.0	66.0	45.3	53.3	81.3	53.5	63.7	80.9	44.1	46.3	78.8
EfficientDet D0 [51]	59.9	91.9	47.3	79.2	76.0	36.0	66.5	46.8	65.6	47.5	37.0	53.3	45.8	39.8	54.9	77.2	76.3	87.3	54.4	31.9	85.1
EfficientDet D2 [51]	61.5	72.5	66.2	64.3	87.0	33.2	74.0	43.1	71.8	54.3	55.6	52.7	37.4	47.0	86.1	66.8	70.9	81.1	50.2	43.3	73.0
EfficientDet D4 [51]	66.1	86.5	57.4	75.7	85.2	33.5	75.4	65.6	80.1	67.4	58.3	71.4	35.6	50.6	78.8	90.3	61.8	82.9	54.6	30.0	81.5
S3FD [52]	54.7	71.7	46.3	72.7	75.5	31.0	71.9	47.9	51.7	49.6	67.2	70.9	23.0	45.1	44.3	61.3	43.9	80.0	30.0	34.3	74.8
M2Det [53]	57.6	63.3	67.6	71.1	80.2	32.1	72.5	58.0	62.3	52.5	69.8	63.5	41.3	52.0	35.9	66.3	35.9	72.2	52.4	25.0	78.3
FSoD-Net [43]	71.8	88.9	66.9	86.8	90.2	45.5	79.6	48.2	86.9	75.5	67.0	77.3	53.6	59.7	78.3	69.9	75.0	91.4	52.3	52.0	90.6
TMAPNet [33]	72.9	92.2	77.7	75.0	91.3	47.1	78.6	53.6	67.1	66.2	78.5	76.3	64.9	61.4	90.5	72.2	75.4	90.7	62.1	55.2	83.2
AGMF-Net [38]	73.2	90.9	72.8	79.3	89.7	44.7	81.4	59.3	66.0	62.7	73.8	79.2	65.0	61.7	91.7	78.6	75.8	90.7	60.0	58.0	83.1
YOLO-SBA (ours)	89.8	97.5	93.3	96.5	92.1	71.2	89.9	85.6	96.9	91.8	90.0	93.2	77.9	76.7	95.6	97.5	93.1	97.2	82.3	81.8	96.8

Table 2. Comparison with state-of-the-art algorithms on the DOTA [19] dataset. The metric is mAP(%), and the bold values denote the top performance, while blue values signify the second-best results.

Method	mAP	BD	BC	B	GTF	HB	H	LV	P	R	S	SV	SBF	ST	SP	TC
YOLT [44]	61.44	60.5	70.93	22.09	37.54	56.58	22.16	70.48	87.20	57.62	84.43	70.75	52.70	78.96	55.76	93.94
Faster-RCNN [45]	56.13	63.70	50.10	29.60	54.86	65.71	48.37	62.59	76.15	40.91	67.7	28.59	55.84	67.33	43.64	86.89
HSF-Net [47]	62.36	69.88	69.40	37.70	57.98	66.84	41.78	64.24	79.96	52.81	71.84	66.76	47.47	61.79	59.15	87.87
SSD [42]	50.19	32.83	39.40	23.02	38.82	65.21	22.16	62.84	85.37	34.83	52.31	33.44	52.29	67.33	55.73	87.30
YOLOv3 [5]	65.67	68.78	66.82	45.93	51.92	74.03	56.68	60.67	93.91	45.01	85.56	50.12	52.45	83.47	55.85	93.88
YOLOv4 [7]	67.97	72.82	60.04	70.00	72.27	64.53	60.21	67.39	77.09	79.29	78.97	49.76	61.45	60.96	60.97	83.81
RFB-Net [48]	64.03	49.76	63.83	35.74	36.35	63.21	20.53	77.53	87.75	67.93	90.05	78.00	33.57	89.54	72.78	93.84
RetinaNet [49]	63.27	68.17	62.33	65.96	76.22	65.96	48.50	62.78	72.99	80.90	68.59	25.51	57.78	51.31	57.81	84.20
CenterNet [50]	73.94	78.56	66.11	45.39	53.39	78.86	66.82	80.24	97.37	69.02	90.30	62.16	64.86	85.75	75.63	94.58
EfficientDet D0 [51]	61.96	64.63	64.63	44.22	72.25	51.14	55.05	52.39	93.09	42.74	81.57	38.87	52.12	78.27	44.54	92.18
EfficientDet D2 [51]	68.61	73.60	60.51	72.14	74.62	64.57	60.50	67.38	77.37	80.28	78.64	50.46	61.82	61.59	62.11	83.58
EfficientDet D4 [51]	70.72	72.23	64.01	73.26	73.23	65.35	62.02	63.15	93.27	79.94	85.61	59.70	40.92	75.94	59.56	92.66
S3FD [52]	30.22	30.60	36.43	23.71	25.12	33.91	9.11	26.92	36.33	16.51	27.01	25.11	28.52	27.20	26.23	81.01
M2Det [53]	34.48	59.34	43.91	27.42	40.63	35.22	18.21	26.91	36.13	49.84	18.11	9.11	35.32	27.33	17.31	72.70
ICN [54]	68.16	74.30	79.06	47.70	70.32	67.02	50.23	67.82	81.36	62.90	69.98	64.89	53.64	78.2	64.17	90.76
Rol Transformer [55]	69.56	78.52	77.27	43.44	75.92	62.83	47.67	73.68	88.64	53.54	83.59	68.81	58.39	81.46	58.93	90.74
Rotated-YOLOX [56]	66.32	65.17	76.58	45.24	54.97	56.03	44.88	72.71	88.78	57.88	79.60	72.97	41.36	82.23	65.23	90.87
FSoD-Net [43]	75.33	71.13	78.93	51.34	57.46	71.34	64.86	82.15	93.11	70.64	92.39	78.18	53.32	93.48	78.69	92.87
YOLO-SBA (ours)	76.14	75.40	68.50	52.40	67.80	86.80	66.40	88.70	95.90	65.40	90.80	74.40	65.10	83.00	65.90	95.60

Comparison on RSOD [20] dataset. On the RSOD [20] dataset, we evaluated our method against eight sophisticated algorithms, as detailed in Table 3. The YOLO-SBA algorithm introduced in this study achieved an mAP of 94.7% on the RSOD [20] dataset, representing a 0.4% improvement over the current leading AGMF-Net [38] algorithm. In the RSOD [20] dataset, across its 4 categories, we obtained the top detection accuracy of 97.56% for the Aircraft class.

4.5. Ablation Study

To demonstrate the efficacy of the YOLO-SBA algorithm introduced in this paper, as well as the MBAFF module, GMAP module, and ADDown module, this section showcases three ablation experiments performed on the DIOR [14], DOTA [19], and RSOD [20] datasets. The results are presented in Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9. Specifically, Table 4, Table 5 and Table 6 present the comparison of the YOLO-SBA or modules with the baseline in terms of mAP and detection accuracy for each category on each dataset. Table 7, Table 8 and Table 9 display the APs, APm, and APl values for each dataset, comparing the proposed algorithm or modules with the baseline. These comparisons offer a more precise evaluation of the individual impacts of the modules and the overall effectiveness of the algorithm.

Baseline: The baseline for this section is YOLOv10 [6], which extracts features from remote sensing images using an enhanced version of CSPNet [62]. Subsequently, it employs SPPF and PSA for feature integration and augmentation. Ultimately, the refined feature maps are fed into the detection head to complete the object detection task. The baseline achieves mAP values of 85.9%, 72.00%, and 88.6% on the DIOR [14], DOTA [19], and RSOD [20] datasets, respectively.

MBAFF: The MBAFF module captures local details and global semantic information through a multi-branch structure with differentiated receptive fields. It also implements adaptive feature enhancement through a dual-channel attention mechanism, strengthening the feature response of targets with multi-scale variations. As shown in Table 4, Table 5 and Table 6, after adding the MBAFF module, the model’s mAP value increased by 2.5%, 3.0%, and 4.7% on the DIOR [14], DOTA [19], and RSOD [20] datasets, respectively. In the evaluation of single-category target detection, the single-category AP value on the DIOR [14] dataset increased by 0.35–7.17%, as shown in Table 4, fully demonstrating the module’s capability to handle target detection at different scales. From Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9, we can see that after adding the MBAFF module, both APl and mAP values showed significant improvements, with APl and mAP increasing by 2.5–4.7% on the three datasets. On the RSOD [20] dataset, the AP value for the overpass category increased by 16.20%.

GMAP: The GMAP module integrates multi-scale feature extraction with a gating fusion mechanism. It extracts multi-scale representations through a cascaded pooling structure, utilizes a dual-channel reconstruction mechanism to finely suppress background noise interference, and adaptively modulates the fusion weights of different scale features using a gating fusion mechanism. This amplifies the prominence of target features and more effectively reduces the noise interference from complex backgrounds. As shown in Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9, after adding the GMAP module to the baseline, the AP values for target detection in various categories improved significantly. On the DIOR [14] dataset, the AP values for bridge (B), dam (D), expressway toll station (ETS), overpass (O), storage tank (ST), train station (TS), and vehicle (V) categories increased by 7.01%, 4%, 6.31%, 5.17%, 3.29%, 6.3%, and 7.17%, respectively. As shown in Table 7, Table 8 and Table 9, after adding the GMAP module, the mAP values on the three datasets improved, with significant increases in APs and APm values. The APs value on the DIOR [14] dataset increased by 4.7%, while on the RSOD [20] and DOTA [19] datasets, the APs increased by 0.8%, demonstrating the effective suppression of complex background influence and the improvement in detecting small targets.

ADDown: Integrating asymmetric decompositional depthwise separable convolution downsampling with a dynamic channel attention mechanism for feature calibration effectively retains more key information and enhances the model’s robustness. As shown in Table 7, Table 8 and Table 9, after adding the ADDown module to the baseline, the mAP values on the DIOR [14], DOTA [19], and RSOD [20] datasets increased by 1.0%, 1.31%, and 0.8%, respectively. On the RSOD [20] dataset, APm increased by 1.1%. On the DIOR [14] dataset, APs increased by 4.9% and APm by 2.5%. On the DOTA [19] dataset, APs improved by 0.9%, showcasing its prowess in detecting medium and small targets and highlighting the module’s ability to preserve crucial detail information.

YOLO-SBA: As shown in Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9, after integrating the three modules into the baseline, the YOLO-SBA achieved significant improvements in mAP. On the DIOR [14], DOTA [19], and RSOD [20] datasets, the mAP values increased by 3.9%, 4.14%, and 6.1%, respectively, compared to the baseline. The single-category detection AP values for targets increased by 0.19–29.94%. This indicates that the synergistic utilization of the three modules can lead to improved performance, thereby validating the effectiveness of the algorithm introduced in this study.

To analyze the single module and the interaction between modules in more detail, this paper conducted ablation experiments on the RSOD [20] dataset for the three module combinations: MBAFF+GMAP, MBAFF+ADDown, and GMAP+ADDown, as shown in Table 10.

MBAFF+GMAP. Combining Table 6 and Table 10, the key roles of each module become clearer. After combining the baseline with MBAFF+GMAP, the accuracy is significantly improved, but compared to using only MBAFF or GMAP, the accuracy on aircraft and oil tank decreases, and the mAP also decreases. This is because after densely inserting MBAFF in the backbone, the model’s perception of multi-scale information in the input image is enhanced, and then it is input into GMAP for redundant background information suppression. However, during the network transmission process, such as in the downsampling stage, there is no guidance for key information. Ordinary downsampling operations may weaken the transmission of target information. The perceived detail information is weakened after the original downsampling operation and is directly input into GMAP for redundancy suppression. It is easy to mistake the detail target information in the multi-scale information as noise, leading to a decrease in mAP compared to using only MBAFF.

MBAFF+ADDown. After combining the baseline with MBAFF+ADDown, the network performance is improved, but it is not as good as the result of using only MBAFF. The reason is that after densely inserting MBAFF into the backbone, the network’s perception of multi-scale information is enhanced, and with the effective information transmission of ADDown, the model learns a lot of image information, including redundant background information. Since there is no processing of redundant background information at the end of the backbone, the unfiltered complex information interferes with the detection effect. At this point, effective downsampling information transmission is not conducive to the model’s judgment of targets, so the performance is worse than using only MBAFF.

GMAP+ADDown. After combining the baseline with GMAP+ADDown, the model performance is significantly improved compared to using only GMAP or ADDown. This is because after adding GMAP+ADDown, the model’s perception capability remains unchanged, capturing the same amount of image information. However, through GMAP+ADDown, the model’s information transmission and filtering capabilities are both enhanced, so the model performance is improved after adding GMAP+ADDown.

As shown in Table 11, we evaluated the parameter quantity, computational complexity, and inference time of the modules and algorithms proposed in this paper. The results show that compared to the baseline, the proposed YOLO-SBA does not significantly increase the model complexity and computational load. Although the inference speed is slightly slower, the detection performance is significantly improved. Moreover, the inference speed remains in the milliseconds range, meeting the requirements for real-time detection. greenThis indicates that in practical application scenarios, such as autonomous driving, security surveillance, and human-computer interaction, YOLO-SBA can quickly respond and provide accurate detection results, ensuring the efficient operation and timely decision-making of the system.

To illustrate the basis for the value of the expansion rate d in the MBAFF module proposed in this paper, as shown in Figure 9, we conducted ablation experiments on the RSOD [20] dataset. We chose different expansion rates for the MBAFF module to determine the best option. The experimental results clearly show that when the expansion rate

d = 4

, the model achieves the best performance. Therefore, the expansion rate used in this paper is 4.

4.6. Visual Analysis

In Figure 10, we provide a visual demonstration of the YOLO-SBA algorithm’s performance across various tasks on the DIOR [14], RSOD [20], and DOTA [19] datasets. The tasks include detecting small, medium, and large objects, identifying targets in complex scenarios, and handling multi-scale detection within the same background. It is clear that the YOLO-SBA algorithm exhibits excellent performance in all five scenarios.

As shown in Figure 11, for different scenes (such as mountains and cities), compared to using only YOLOv10 [6], the heatmaps generated by YOLO-SBA more accurately highlight the target regions and reduce background noise interference. This indicates that YOLO-SBA has significant advantages in handling the dual challenges of complex backgrounds and multi-scale targets, providing more precise and robust detection results.

5. Discussion

This study introduces a multi-scale and complex background-aware network named YOLO-SBA, tailored for remote sensing target detection. The efficacy of the algorithm and its individual modules is validated through experimental evaluation. From the experimental results, it is evident that each module exhibits different advantages in the detection task. In this section, we analyze the proposed algorithm and the roles of each module, and provide explanations for the anomalous performance in the experiments.

MBAFF. Incorporating the MBAFF module into the baseline results in a notable enhancement of multi-scale target detection capabilities. On the DIOR [14] dataset, the single-category AP values increased by 0.35–7.17%. In other datasets, the detection accuracy for various categories also showed significant improvements, demonstrating the efficacy of this module in multi-scale target detection.

GMAP. Integrating the GMAP module into the baseline enhances detection accuracy across all categories, particularly for medium and small targets. The APs and APm metrics on the three datasets showed significant improvements. We believe this is primarily due to better suppression of complex backgrounds, which allows medium and small targets previously hidden in complex backgrounds to be detected, thus improving accuracy.

ADDown. After adding the ADDown module, the detection accuracy for all categories improved uniformly, with significant advantages in detecting medium and small targets. This indicates that ADDown effectively retains critical detail information during network inference. However, on the RSOD [20] dataset, after adding ADDown, the APs value is lower than the baseline, but the mAP value still increases. This is because the model’s weight update rule is designed based on mAP, which is a result of the model’s trade-off after training to achieve better mAP performance. It only indicates that the performance improvement of ADDown on the RSOD [20] dataset is not particularly significant, but it is still effective.

YOLO-SBA. YOLO-SBA, by integrating the advanced characteristics of three modules, exhibits excellent performance in detecting multi-scale targets and targets in complex backgrounds, significantly surpassing the baseline model. Specifically, the input remote sensing images, after target saliency enhancement by MBAFF, provide good guidance for background suppression by GMAP. The collaboration of these two modules improves the model’s capability to differentiate targets from backgrounds. Simultaneously, ADDown retains detail information during the downsampling process of network inference, ensuring that the effective information extracted by key modules is preserved. Its design optimizes the functionality of MBAFF and GMAP. The synergistic integration of these three modules improves the model’s capability to detect targets across multiple scales and mitigate complex backgrounds, thereby increasing the accuracy of remote sensing target detection.

The different performances of YOLO-SBA on the three datasets. In this paper, we conducted extensive testing on the proposed YOLO-SBA algorithm. For the DIOR [14] and RSOD [20] datasets, YOLO-SBA demonstrated excellent detection accuracy, achieving an mAP value of around 90 and significant improvements in detection accuracy for most categories, surpassing the current state-of-the-art algorithms. This indicates that YOLO-SBA performs exceptionally well in handling these datasets and can accurately detect and classify various targets.

However, on the DOTA [19] dataset, although YOLO-SBA improved the mAP value and achieved the best performance, the gap in accuracy compared to the DIOR [14] and RSOD [20] datasets was significant. This is mainly due to the characteristics of the DOTA [19] dataset itself and some limitations of the algorithm. The image resolution in the DOTA [19] dataset ranges from 800 × 800 to 20,000 × 20,000, which is much larger than other datasets. This not only increases the demand for computing resources but also affects the speed and efficiency of detection. In addition, the DOTA [19] dataset is large in scale, so researchers usually crop the original dataset to generate new datasets with resolutions below 1024 × 1024 for training and testing target detection algorithms. The original DOTA [19] dataset with 2806 images was expanded to 140,000 images after cropping and conversion, which undoubtedly increased the challenge of computing power. For the convenience of training and testing, this paper randomly selected 21,046 images, including 7492 background images (images without targets), from the expanded dataset. The total number of target instances in all selected images is close to 200,000, which is a huge challenge for the algorithm. Moreover, after cropping, some targets may be located at the edge of the image, only showing a corner or half of the target. In the process of stitching, there may be cases where corner pieces are stitched together to form an image, resulting in discontinuous backgrounds. These factors further increase the complexity of detection. Therefore, these factors together may result in the algorithm’s performance on the DOTA [19] dataset being less excellent than on other datasets, including the baseline detection results. This also provides us with directions for further improvement and optimization of the algorithm.

6. Conclusions

This paper proposes a multi-scale and complex background-aware network named YOLO-SBA for remote sensing target detection, focusing on addressing the challenges of multi-scale target diversity and complex background coexistence. Firstly, we designed the MBAFF module, which is tailored for multi-scale variation target detection tasks. It employs a multi-branch dilated convolution structure with differentiated receptive fields and utilizes a dual-path attention mechanism to adaptively fuse heterogeneous features, enhancing the model’s ability to express features of multi-scale variation targets. Next, to address the challenges of complex backgrounds, we proposed the GMAP module, which adopts an innovative architecture combining cascaded pooling with a dual attention reconstruction mechanism. It integrates the attention generated by a gating mechanism to guide the fusion process, achieving feature enhancement and background suppression, and significantly improving the model’s target detection accuracy in complex backgrounds. Additionally, we designed the ADDown module to ensure that the effective information extracted by key modules is not lost during network inference. It performs downsampling through decompositional depthwise separable convolution, better preserving the key information in the feature maps and enhancing the quality of the downsampled feature maps, collaboratively addressing the challenges of remote sensing image target detection with key modules.

The proposed YOLO-SBA tackles the challenges of multi-scale variations and complex backgrounds in remote sensing images, significantly improving the accuracy of remote sensing target detection. Experimental results on the DIOR [14], DOTA [19], and RSOD [20] datasets indicate that the algorithm achieves state-of-the-art detection accuracy, and ablation experiments demonstrate the effectiveness of the proposed modules, driving the development of remote sensing target detection tasks.

Author Contributions

Conceptualization: Y.Y. and Y.G.; Methodology: Y.Y. and X.Z.; Validation: Y.W., Y.G. and T.J.; Formal Analysis: Y.Y. and J.C.; Investigation: Y.W. and J.C.; Resources: Y.G. and X.Z.; Data Curation: T.J.; Writing—Original Draft Preparation: Y.Y.; Writing—Review & Editing: Y.W. and J.C.; Visualization: Y.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Natural Science Foundation of China (NSFC) under grant number 61873274.

Data Availability Statement

The data presented in this study are openly available at https://drive.google.com/drive/folders/1UdlgHk49iu6WpcJ5467iT-UqNPpx__CC (accessed on 7 July 2022) (DetectIon in Optical Remote sensing images, DIOR), https://captain-whu.github.io/DOTA/dataset.html (accessed on 26 January 2018) (Dataset for Object deTection in Aerial images, DOTA) and https://github.com/RSIA-LIESMARS-WHU/RSOD-Dataset- (accessed on 30 April 2019) (Remote Sensing Object Detection, RSOD).

Acknowledgments

I would like to express my gratitude to the National Natural Science Foundation of China (NSFC) for providing the financial support that made this research possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guo, Y.; Chen, S.; Zhan, R.; Wang, W.; Zhang, J. LMSD-YOLO: A lightweight YOLO algorithm for multi-scale SAR ship detection. Remote Sens. 2022, 14, 4801. [Google Scholar] [CrossRef]
Chen, J.; Liu, L.; Deng, W.; Liu, Z.; Liu, Y.; Wei, Y.; Liu, Y. Refining Pseudo Labeling via Multi-Granularity Confidence Alignment for Unsupervised Cross Domain Object Detection. IEEE Trans. Image Process. 2025, 34, 279–294. [Google Scholar] [CrossRef]
Ruan, Z.; Wei, Y.; Guo, Y.; Xie, Y. Hybrid attentive prototypical network for few-shot action recognition. Complex Intell. Syst. 2024, 10, 8249–8272. [Google Scholar] [CrossRef]
Pu, N.; Chen, W.; Liu, Y.; Bakker, E.M.; Lew, M.S. Dual gaussian-based variational subspace disentanglement for visible-infrared person re-identification. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2149–2158. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. Yolov6 v3. 0: A full-scale reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al. ultralytics/yolov5: v3. 0. Zenodo 2020. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Sun, H.; Yao, G.; Zhu, S.; Zhang, L.; Xu, H.; Kong, J. SOD-YOLOv10: Small Object Detection in Remote Sensing Images Based on YOLOv10. IEEE Geosci. Remote Sens. Lett. 2025, 22, 8000705. [Google Scholar] [CrossRef]
Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Zhu, X.; Wang, G.; Han, X.; Tang, X.; Jiao, L. Multistage enhancement network for tiny object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611512. [Google Scholar] [CrossRef]
Zhong, X.; Zhan, J.; Xie, Y.; Zhang, L.; Zhou, G.; Liang, M.; Yang, K.; Guo, Z.; Li, L. Adaptive Deformation-Learning and Multiscale-Integrated Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5611619. [Google Scholar] [CrossRef]
Yang, X.; Zhang, S.; Duan, S.; Yang, W. An effective and lightweight hybrid network for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5600711. [Google Scholar] [CrossRef]
Zhang, Z.; Mei, S.; Ma, M.; Han, Z. Adaptive composite feature generation for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5631716. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Dong, Z.; Wang, M.; Wang, Y.; Zhu, Y.; Zhang, Z. Object detection in high resolution remote sensing imagery based on convolutional neural networks with suitable object scale features. IEEE Trans. Geosci. Remote Sens. 2019, 58, 2104–2114. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Zeng, D.; Lin, W.; Ling, H. Adjacent context coordination network for salient object detection in optical remote sensing images. IEEE Trans. Cybern. 2022, 53, 526–538. [Google Scholar] [CrossRef]
Li, J.; Li, Z.; Chen, M.; Wang, Y.; Luo, Q. A new ship detection algorithm in optical remote sensing images based on improved R3Det. Remote Sens. 2022, 14, 5048. [Google Scholar] [CrossRef]
Teng, Z.; Duan, Y.; Liu, Y.; Zhang, B.; Fan, J. Global to local: Clip-LSTM-based object detection from remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5603113. [Google Scholar] [CrossRef]
Huang, H.; Huo, C.; Wei, F.; Pan, C. Rotation and scale-invariant object detector for high resolution optical remote sensing images. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1386–1389. [Google Scholar] [CrossRef]
Zhou, L.; Zheng, C.; Yan, H.; Zuo, X.; Liu, Y.; Qiao, B.; Yang, Y. RepDarkNet: A multi-branched detector for small-target detection in remote sensing images. ISPRS Int. J. Geo-Inf. 2022, 11, 158. [Google Scholar] [CrossRef]
Li, Y.; Huang, Q.; Pei, X.; Jiao, L.; Shang, R. RADet: Refine feature pyramid network and multi-layer attention network for arbitrary-oriented object detection of remote sensing images. Remote Sens. 2020, 12, 389. [Google Scholar] [CrossRef]
Guo, Y.; Wu, H.; Yang, S.; Cai, Z. Crater-DETR: A Novel Transformer Network for Crater Detection Based on Dense Supervision and Multiscale Fusion. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5614112. [Google Scholar] [CrossRef]
Zhang, L.; Wang, Y.; Huo, Y. Object detection in high-resolution remote sensing images based on a hard-example-mining network. IEEE Trans. Geosci. Remote Sens. 2020, 59, 8768–8780. [Google Scholar] [CrossRef]
Han, J.; Zhang, D.; Cheng, G.; Guo, L.; Ren, J. Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans. Geosci. Remote Sens. 2014, 53, 3325–3337. [Google Scholar] [CrossRef]
Wang, B.; Zhao, Y.; Li, X. Multiple instance graph learning for weakly supervised remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5613112. [Google Scholar] [CrossRef]
Cheng, B.; Li, Z.; Xu, B.; Yao, X.; Ding, Z.; Qin, T. Structured object-level relational reasoning CNN-based target detection algorithm in a remote sensing image. Remote Sens. 2021, 13, 281. [Google Scholar] [CrossRef]
Gao, T.; Liu, Z.; Zhang, J.; Wu, G.; Chen, T. A task-balanced multiscale adaptive fusion network for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5613515. [Google Scholar] [CrossRef]
Bai, Z.; Li, G.; Liu, Z. Global–local–global context-aware network for salient object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2023, 198, 184–196. [Google Scholar] [CrossRef]
Yang, X.; Hou, L.; Zhou, Y.; Wang, W.; Yan, J. Dense label encoding for boundary discontinuity free rotation detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 15819–15829. [Google Scholar]
Ye, X.; Xiong, F.; Lu, J.; Zhou, J.; Qian, Y. F3-Net: Feature Fusion and Filtration Network for Object Detection in Optical Remote Sensing Images. Remote Sens. 2020, 12, 4027. [Google Scholar] [CrossRef]
Wang, H.; Li, H.; Qian, W.; Diao, W.; Zhao, L.; Zhang, J.; Zhang, D. Dynamic pseudo-label generation for weakly supervised object detection in remote sensing images. Remote Sens. 2021, 13, 1461. [Google Scholar] [CrossRef]
Gao, T.; Li, Z.; Wen, Y.; Chen, T.; Niu, Q.; Liu, Z. Attention-free global multiscale fusion network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5603214. [Google Scholar] [CrossRef]
Ma, W.; Wu, Y.; Zhu, H.; Zhao, W.; Wu, Y.; Hou, B.; Jiao, L. Adaptive Feature Separation Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5639717. [Google Scholar] [CrossRef]
Son, S.; Kim, J.; Lai, W.S.; Yang, M.H.; Lee, K.M. Toward real-world super-resolution via adaptive downsampling models. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8657–8670. [Google Scholar] [CrossRef]
Yang, Z.; Leng, L.; Min, W. Extreme downsampling and joint feature for coding-based palmprint recognition. IEEE Trans. Instrum. Meas. 2020, 70, 5005112. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Wang, G.; Zhuang, Y.; Chen, H.; Liu, X.; Zhang, T.; Li, L.; Dong, S.; Sang, Q. FSoD-Net: Full-scale object detection from optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602918. [Google Scholar] [CrossRef]
Van Etten, A. You only look twice: Rapid multi-scale object detection in satellite imagery. arXiv 2018, arXiv:1805.09512. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Li, Q.; Mou, L.; Liu, Q.; Wang, Y.; Zhu, X.X. HSF-Net: Multiscale deep feature embedding for ship detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 7147–7161. [Google Scholar] [CrossRef]
Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Zhang, S.; Zhu, X.; Lei, Z.; Shi, H.; Wang, X.; Li, S.Z. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 192–201. [Google Scholar]
Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y.; Cai, L.; Ling, H. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January 2019; Volume 33, pp. 9259–9266. [Google Scholar]
Azimi, S.M.; Vig, E.; Bahmanyar, R.; Körner, M.; Reinartz, P. Towards multi-class object detection in unconstrained remote sensing imagery. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 150–165. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Liu, F.; Chen, R.; Zhang, J.; Xing, K.; Liu, H.; Qin, J. R2YOLOX: A lightweight refined anchor-free rotated detector for object detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5632715. [Google Scholar] [CrossRef]
Xu, Z.; Xu, X.; Wang, L.; Yang, R.; Pu, F. Deformable convnet with aspect ratio constrained nms for object detection in remote sensing imagery. Remote Sens. 2017, 9, 1312. [Google Scholar] [CrossRef]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
Dong, R.; Xu, D.; Zhao, J.; Jiao, L.; An, J. Sig-NMS-based faster R-CNN combining transfer learning for small target detection in VHR optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8534–8545. [Google Scholar] [CrossRef]
Guo, Y.; Ji, J.; Lu, X.; Xie, H.; Tong, X. Geospatial object detection with single shot anchor-free network. In Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 280–283. [Google Scholar] [CrossRef]
Zhang, J.; Zhu, H.; Wang, P.; Ling, X. ATT squeeze U-Net: A lightweight network for forest fire detection and recognition. IEEE Access 2021, 9, 10858–10870. [Google Scholar] [CrossRef]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]

Figure 1. The heatmap visualization results of YOLOv10 [6] for remote sensing image target detection, where the source image is from the DIOR [14] dataset. The first row shows the Ground Truth, with yellow boxes and text labels indicating the location and category information of the targets. The second row shows the heatmap visualization of model attention during the YOLOv10 [6] detection process. (a–c) show the detection results of YOLOv10 [6] in multi-scale target scenarios, complex background scenarios, and scenarios where multi-scale targets and complex backgrounds coexist. Red boxes highlight the targets not captured in the heatmap, indicating that the general object detection model YOLOv10 [6] does not perform well in remote sensing target detection tasks.

Figure 2. Network Structure Diagram of the YOLO-SBA. Among them, MBAFF, GMAP, and ADDown are modules proposed in this paper. MBAFF and ADDown are extensively integrated into both the backbone and neck components, whereas GMAP is incorporated solely into the backbone section.

Figure 3. Network Structure Diagram of the MBAFF Module.

Figure 4. Diagram of the D-Bottleneck Architecture.

Figure 5. Visual understanding of dilated convolutions enhancing the convolutional receptive field.

Figure 6. Schematic diagram of the BAFM Module.

Figure 7. Network structure diagram of Gated Multi-scale Attention Pyramid (GMAP).

Figure 8. ADDown Network Architecture Diagram.

Figure 9. Performance of the MBAFF module added on the RSOD [20] dataset under different expansion rates d.

Figure 10. Remote sensing object detection results based on YOLO-SBA in various scenarios. (a) dense small object detection; (b) medium object detection; (c) large object detection; (d) object detection in complex environments; (e) multi-scale object detection within the same background.

Figure 11. The heatmap visualization of model attention during the target detection process of YOLOv10 [6] and YOLO-SBA on common remote sensing images, corresponding to the images in Section 1. To more clearly perceive the changes in the heatmaps that reflect the performance of the model in target detection, we visualize the true annotations and the detection results of YOLOv10 [6] and YOLO-SBA together. The yellow boxes from top to bottom represent the true positions of the targets in the dataset, the detection results of YOLOv10 [6], and the detection results of YOLO-SBA.

Table 3. Comparison with state-of-the-art algorithms on the RSOD [20] dataset. The metric is mAP(%), and the bold values denote the top performance, while blue values signify the second-best results.

Method	mAP	Aircraft	Oil Tank	Overpass	Playground
Faster RCNN [45]	84.50	70.84	90.19	78.74	98.09
Cascade R-CNN [46]	91.30	94.20	96.10	83.20	99.00
Deformable R-FCN [57]	87.90	71.87	90.35	89.59	99.88
Soft-NMS [58]	86.60	76.10	90.30	81.30	98.80
Sig-NMS [59]	89.40	80.60	90.60	87.40	99.10
SSAFNet [60]	92.80	95.75	98.39	84.66	92.50
ATTS [61]	91.30	94.50	96.00	74.90	99.90
AGMF-Net [38]	94.30	96.02	99.02	82.43	99.70
YOLO-SBA (ours)	94.70	97.56	96.22	85.81	99.31

Table 4. Ablation study of different YOLO-SBA modules on the DIOR [14] dataset. Metrics are mAP(%) and AP(%) for each category, and the bold values denote the top performance, while blue values signify the second-best results.

Method	mAP	AL	AT	BF	BC	B	C	D	ESA	ETS	GC	GTF	HB	O	S	SD	ST	TC	TS	V	WM
Baseline	85.90	95.88	90.21	96.04	90.54	61.70	86.95	78.41	96.71	85.41	86.88	90.97	73.57	70.56	95.04	94.40	89.12	96.72	72.81	72.68	94.20
+MBAFF	88.40	96.90	92.15	96.35	91.05	68.48	88.21	81.67	96.62	88.11	88.62	93.27	75.52	75.61	95.35	97.15	92.21	97.45	77.12	79.85	96.68
+GMAP	88.70	97.20	91.05	96.27	91.51	68.71	88.79	82.41	97.15	91.72	88.26	92.87	75.75	75.73	95.55	97.43	92.41	97.07	79.11	79.85	96.14
+ADDown	86.90	95.62	90.22	94.60	89.80	65.67	86.86	78.67	95.43	89.81	86.81	91.27	74.80	73.62	94.11	95.12	91.18	95.56	76.58	78.45	95.16
YOLO-SBA	89.80	97.50	93.30	96.50	92.10	71.20	89.90	85.60	96.90	91.80	90.00	93.20	77.90	76.70	95.60	97.50	93.10	97.20	82.30	81.80	98.80

Table 5. Ablation study of different YOLO-SBA modules on the DOTA [19] dataset. Metrics are mAP(%) and AP(%) for each category, and the bold values denote the top performance, while blue values signify the second-best results.

Method	mAP	BD	BC	B	GTF	HB	H	LV	P	R	S	SV	SBF	ST	SP	TC
Baseline	72.00	77.87	68.59	50.68	58.39	83.89	36.46	88.25	94.88	57.05	90.20	76.42	54.46	83.11	63.04	96.11
+MBAFF	75.00	76.00	66.74	52.00	70.00	86.80	53.90	88.60	95.60	65.30	90.60	75.50	59.50	82.50	66.20	95.10
+GMAP	73.44	73.40	71.50	50.50	63.40	85.30	42.70	89.10	94.80	60.70	90.50	75.80	61.30	82.00	64.90	95.70
+ADDown	73.31	73.60	71.30	50.10	63.30	85.10	42.50	89.10	94.80	59.50	90.40	75.80	61.20	82.00	65.20	95.70
YOLO-SBA	76.14	75.40	68.50	52.40	67.80	86.80	66.40	88.70	95.90	65.40	90.80	74.40	65.10	83.00	65.90	95.60

Table 6. Ablation study of different YOLO-SBA modules on the RSOD [20] dataset. Metrics are mAP(%) and AP(%) for each category, and the bold values denote the top performance, while blue values signify the second-best results.

Method	mAP	Aircraft	Oiltank	Overpass	Playground
baseline	88.6	96.10	95.43	68.86	94.03
+MBAFF	93.3	95.97	95.15	85.06	97.09
+GMAP	89.1	95.87	94.68	70.76	95.00
+ADDown	89.4	95.65	95.37	71.31	95.08
YOLO-SBA	94.7	97.56	96.22	85.81	99.31

Table 7. Ablation study of different YOLO-SBA modules on the DIOR [14] dataset. Metrics include APs(%), APm(%), and APl(%), and the bold values denote the top performance, while blue values signify the second-best results.

Method	APs	APm	APl
baseline	26.2	52.3	76.4
+MBAFF	30.2	54.7	80.0
+GMAP	30.9	54.5	80.0
+ADDown	31.1	54.8	77.7
YOLO-SBA	32.7	56.7	81.2

Table 8. Ablation study of different YOLO-SBA modules on the DOTA [19] dataset. Metrics include APs(%), APm(%), and APl(%), and the bold values denote the top performance, while blue values signify the second-best results.

Method	APs	APm	APl
baseline	27.5	49.4	55.0
+MBAFF	27.8	49.7	59.5
+GMAP	28.3	49.7	57.2
+ADDown	28.4	49.6	56.5
YOLO-SBA	27.6	50.0	61.1

Table 9. Ablation study of different YOLO-SBA modules on the RSOD [20] dataset. Metrics include APs(%), APm(%), and APl(%), and the bold values denote the top performance, while blue values signify the second-best results.

Method	APs	APm	APl
baseline	35.5	67.0	66.0
+MBAFF	35.6	67.9	68.8
+GMAP	36.3	68.3	67.5
+ADDown	34.9	68.1	66.1
YOLO-SBA	38.4	69.2	72.0

Table 10. Ablation experiments on the RSOD [20] dataset with two modules combined.

Method	mAP	Aircraft	Oiltank	Overpass	Playground
Baseline	88.6	96.10	95.43	85.06	97.09
+MBAFF+GMAP	93.15	94.92	95.81	82.86	99.00
+MBAFF+ADDown	90.31	95.59	96.04	76.31	93.30
+GMAP+ADDown	90.66	95.98	96.61	74.2	95.86

Table 11. Evaluation of the proposed algorithm in terms of parameter quantity, GFLOPS, and inference speed.

Method	Parameters	GFLOPS	Inference Time
Method	Parameters	GFLOPS	DIOR [14]	DOTA [19]	RSOD [20]
Baseline	8.1 M	24.6	4.0 ms	4.0 ms	3.9 ms
+MBAFF	10.0 M	25.5	5.4 ms	5.4 ms	5.0 ms
+GMAP	8.0 M	24.9	4.2 ms	4.2 ms	3.7 ms
+ADDown	7.4 M	20.2	4.8 ms	4.8 ms	4.0 ms
YOLO-SBA	10.1 M	25.8	5.6 ms	5.6 ms	4.8 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, Y.; Wei, Y.; Zhou, X.; Guo, Y.; Chen, J.; Jiang, T. YOLO-SBA: A Multi-Scale and Complex Background Aware Framework for Remote Sensing Target Detection. Remote Sens. 2025, 17, 1989. https://doi.org/10.3390/rs17121989

AMA Style

Yuan Y, Wei Y, Zhou X, Guo Y, Chen J, Jiang T. YOLO-SBA: A Multi-Scale and Complex Background Aware Framework for Remote Sensing Target Detection. Remote Sensing. 2025; 17(12):1989. https://doi.org/10.3390/rs17121989

Chicago/Turabian Style

Yuan, Yifei, Yingmei Wei, Xiaoyan Zhou, Yanming Guo, Jiangming Chen, and Tingshuai Jiang. 2025. "YOLO-SBA: A Multi-Scale and Complex Background Aware Framework for Remote Sensing Target Detection" Remote Sensing 17, no. 12: 1989. https://doi.org/10.3390/rs17121989

APA Style

Yuan, Y., Wei, Y., Zhou, X., Guo, Y., Chen, J., & Jiang, T. (2025). YOLO-SBA: A Multi-Scale and Complex Background Aware Framework for Remote Sensing Target Detection. Remote Sensing, 17(12), 1989. https://doi.org/10.3390/rs17121989

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-SBA: A Multi-Scale and Complex Background Aware Framework for Remote Sensing Target Detection

Abstract

1. Introduction

2. Related Work

2.1. Remote Sensing Image Target Detection Algorithms for Multi-Scale Targets

2.2. Remote Sensing Image Target Detection Algorithms for Complex Backgrounds

2.3. Evolution of Downsampling Algorithm Improvements

3. Method

3.1. Multi-Branch Attention Feature Fusion Module (MBAFF)

3.2. Gated Multi-Scale Attention Pyramid (GMAP)

3.2.1. Cascaded Pooling Structure

3.2.2. Dual Reconstruction Mechanism

Channel Reconstruction

Space Reconstruction

3.2.3. Gated Fusion

3.3. Asymmetric Dynamic Downsampling (ADDown)

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Training Details

4.4. Comparison with State-of-the-Art Algorithms

4.5. Ablation Study

4.6. Visual Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI