A Progressive Semantic-Aware Fusion Network for Remote Sensing Object Detection

Li, Lerong; Wang, Jiayang; Liao, Yue; Qian, Wenbin

doi:10.3390/app15084422

Open AccessArticle

A Progressive Semantic-Aware Fusion Network for Remote Sensing Object Detection

by

Lerong Li

,

Jiayang Wang

^*

,

Yue Liao

and

Wenbin Qian

School of Software, Jiangxi Agricultural University, Nanchang 330045, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4422; https://doi.org/10.3390/app15084422

Submission received: 25 March 2025 / Revised: 12 April 2025 / Accepted: 15 April 2025 / Published: 17 April 2025

Download

Browse Figures

Versions Notes

Abstract

Object detection in remote sensing images has gained prominence alongside advancements in sensor technology and earth observation systems. Although current detection frameworks demonstrate remarkable achievements in natural imagery analysis, their performance degrades when applied to remote imaging scenarios due to two inherent limitations: (1) complex background interference, which causes object features to be easily obscured by noise, leading to reduced detection accuracy; (2) the variation in object scales leads to a decrease in the model’s generalization ability. To address these issues, we propose a progressive semantic-aware fusion network (ProSAF-Net). First, we design a shallow detail aggregation module (SDAM), which adaptively integrates features across different channels and scales in the early Neck stage through dynamically adjusted fusion weights, fully exploiting shallow detail information to refine object edge and texture representation. Second, to effectively integrate shallow detail information and high-level semantic abstractions, we propose a deep semantic fusion module (DSFM), which employs a progressive feature fusion mechanism to incrementally integrate deep semantic information, strengthening the global representation of objects while effectively complementing the rich shallow details extracted by SDAM, enhancing the model’s capability in distinguishing objects and refining spatial localization. Furthermore, we develop a spatial context-aware module (SCAM) to fully exploit both global and local contextual information, effectively distinguishing foreground from background and suppressing interference, thus improving detection robustness. Finally, we propose auxiliary dynamic loss (ADL), which adaptively adjusts loss weights based on object scales and utilizes supplementary anchor priors to expedite parameter convergence during coordinate regression, thereby improving the model’s positioning accuracy for targets. Extensive experiments on the RSOD, DIOR, and NWPU VHR-10 datasets demonstrate that our method outperforms other state-of-the-art methods.

Keywords:

feature fusion; contextual information; remote sensing images; object detection

1. Introduction

Serving as the fundamental pillar for automated analysis of earth observation imagery, object detection in remote sensing applications has attracted considerable research interest during the past decade, emerging as a critical intersection where computer vision methodologies converge with geospatial data processing paradigms [1]. The primary objective of this task is to accurately identify and localize specific objects in remote sensing images, providing essential technical support for various real-world applications, such as disaster response [2], urban monitoring [3] and traffic management [4]. However, compared to object detection in natural scenes, remote sensing object detection presents more complex challenges, as illustrated in Figure 1, including the diversity of scene environments, the complexity of object categories, and significant variations in object scales. In particular, high-resolution remote sensing images often contain small-sized objects with high-density distributions, which are further affected by illumination conditions, cloud cover, and atmospheric interference. These factors necessitate more robust and generalizable detection algorithms. Moreover, since remote sensing data typically encompass large-scale scenes with highly diverse object characteristics, many deep learning-based detection algorithms, despite their impressive performance in general object detection tasks, still encounter numerous challenges in complex remote sensing scenarios. Enhancing detection accuracy, model stability, and computational efficiency remains an urgent problem in this field.

At present, deep neural network-driven object detectors are systematically categorized into two primary groups: anchor-based and anchor-free methods [5,6,7]. Anchor-based approaches are further divided into two-stage [8,9,10] and one-stage detection frameworks [11,12,13,14]. A well-known example of the two-stage paradigm is Faster R-CNN [10], which initially employs a region proposal network (RPN) for candidate region generation, coupled with parallel computational pathways for object categorization and bounding box coordinate optimization. While these methods achieve high detection accuracy, their computational demands are substantial, thereby constraining their applicability in time-sensitive operational scenarios. In contrast, YOLO and RetinaNet [13] represent one-stage detection methods, which directly predict object categories and bounding box coordinates on feature maps without requiring a separate region proposal step, significantly improving efficiency. YOLO leverages convolutional neural networks (CNNs) to provide end-to-end predictions, balancing speed and accuracy, whereas RetinaNet addresses class imbalance between foreground and background objects. Meanwhile, anchor-free methods such as FCOS [5] and CenterNet [7] eliminate the reliance on predefined anchor boxes and instead predict object center points, widths, heights, or bounding box coordinates directly. These techniques reduce the number of hyperparameters and improve adaptability, making them particularly effective for remote sensing object detection, where object scales vary significantly.

Given the unique features of remote sensing imagery, Scholars have been actively exploring efficient multi-scale object detection and feature fusion strategies. For instance, Ruan et al. [15] introduced a hierarchical feature aggregation and saliency extraction architecture through multi-scale contextual integration while preserving detailed texture information through multi-scale pooling. Liu et al. [16] design an adaptive feature hierarchy architecture that employs dynamic channel-spatial attention mechanisms to contextually integrate multi-resolution feature representations across spectral and spatial domains, thereby extracting more discriminative representations. Similarly, Gao et al. [17] developed a global contextual fusion architecture that harvests and optimizes conceptual abstractions within hierarchical feature representations, mitigating the impact of complex backgrounds on foreground object detection. Moreover, Dong et al. [18] introduced a scale-adaptive deformable attention architecture that constructs attention mechanisms through adaptively modulated multi-resolution perceptual domains. This module effectively captures deformable multi-scale features, making it particularly well-suited for detecting remote sensing objects of varying shapes and sizes while also improving attention accuracy. Zhao et al. [19] proposed an adaptive spatial feature fusion (ASFF) structure, which assigns spatial feature weights at different levels to optimize feature fusion, enhancing remote sensing object representations while minimizing feature loss. Additionally, Zhang et al. [20] designed a semantic fusion (SF) module that reduces semantic discrepancies between multi-scale feature maps by computing intra-feature map pixel relationships and inter-feature map correlations through matrix multiplication operations. Furthermore, Wang et al. [21] introduced the full-scale object detection framework, an innovative single-phase architecture integrating a hierarchical multi-resolution feature extractor with a dimension-agnostic localization module. Despite these advancements, many existing approaches have yet to effectively address the fusion of shallow and deep features, nor have they incorporated loss functions specifically designed to handle objects of varying scales efficiently.

To overcome these challenges, we present a progressive semantic-aware fusion network (ProSAF-Net), which integrates multiple specialized components to enhance detection accuracy. Given that shallow feature layers retain essential fine-grained details crucial for deep feature interactions, we introduce the shallow detail aggregation module (SDAM). This module utilizes Mamba’s selective scanning and dynamic weighting mechanism to efficiently capture and refine fine details from low-level features. To ensure these details are effectively leveraged, we propose the deep semantic fusion module (DSFM), which progressively integrates and enhances fine-grained features from shallow layers with deep semantic representations. This fusion process strengthens feature discrimination and improves semantic expressiveness, contributing to more robust object detection. Furthermore, we introduce the spatial context awareness module (SCAM), designed to extract both global and local contextual information. By effectively distinguishing foreground objects from complex background noise, this module enhances detection precision. Finally, we develop an auxiliary dynamic loss function, which dynamically adjusts loss weights based on object scale variations, facilitating faster model convergence and improved detection performance across objects of different sizes.

In summary, the key contributions of our work are as follows:
- We propose ProSAF-Net, a progressive semantic-aware fusion network tailored for remote sensing object detection. This approach effectively addresses key challenges such as complex background interference and substantial variations in object scales.
- To leverage the complementary advantages of fine-grained details from shallow features and high-level semantics from deep features, we design the shallow detail aggregation module (SDAM) and deep semantic fusion module (DSFM). These modules enable effective collaboration between detail extraction and semantic enhancement, improving overall detection performance in complex backgrounds.
- To comprehensively exploit contextual information, we introduce a spatial context awareness module (SCAM) that filters out background noise effectively and optimizes the recognition of small targets.
- To better handle objects of varying scales, we design an auxiliary dynamic loss function that adaptively adjusts the loss weights for different object scales, thereby optimizing detection across diverse scale variations.

The structure of this paper is arranged as follows. Section 2 provides a review of existing studies on feature fusion and the utilization of contextual information in remote sensing. Section 3 introduces the proposed ProSAF-Net in detail. Section 4 presents the experimental results and discussions. Finally, Section 5 concludes the paper.

2. Related Work

2.1. Feature Fusion

In computer vision, integrating features across multiple levels and scales plays a crucial role in enhancing model performance. Early fusion techniques, such as direct concatenation or weighted summation, struggled to effectively capture complex dependencies between features. In contrast, convolutional neural network (CNN)-based multi-scale fusion methods utilize cascaded or parallel architectures to merge shallow texture details with deep semantic information, significantly improving the robustness and expressive power of the model. For example, the feature pyramid network (FPN) [22] enhances object detection across different scales by introducing a top-down fusion pathway that combines adjacent feature maps. However, this approach can lead to the gradual loss of high-resolution details in deeper layers. To address this issue, Liu et al. [23] introduced a path aggregation network (PANet), which builds upon FPN by incorporating an additional bottom-up pathway. This enhancement facilitates complementary fusion between high-level semantic features and low-level spatial details, thereby improving localization accuracy. Several FPN-based extensions have been developed to further enhance feature representation. Channel-enhanced feature pyramid network (CE-FPN) [24] integrates sub-pixel residual fusion, a fine-grained context refinement module, and a channel-aware attention mechanism to mitigate aliasing effects and extract richer feature representations. Liu et al. [25] proposed a gated ladder feature pyramid network (GLFPN), which gathers feature information from the backbone network at multiple scales and employs a gating mechanism to selectively fuse information from different layers, effectively reducing redundant information. To improve feature extraction in remote sensing object detection, Yu et al. [26] introduced a global-to-local feature fusion network (GLF-Net), which adopts a global-to-local fusion strategy to enhance multi-scale feature utilization. Jiang et al. [27] developed a bidirectional dense feature pyramid network (BDFPN), which expands the scale of feature pyramids while integrating skip connections for more efficient multi-scale fusion. Yi et al. [28] proposed the self-attention-guided global context feature fusion (SGCFF) strategy, which incorporates high-level contextual information across different spatial levels in FPN. Moreover, they incorporated deformable convolution and a channel-spatial attention mechanism to refine fine-grained details and mitigate spatial distortions across objects of varying scales. While these methods have explored various multi-level feature fusion techniques, they often fall short in effectively coordinating interactions between shallow and deep features. To overcome this limitation, we introduce the shallow detail aggregation module (SDAM) and the deep semantic fusion module (DSFM), which fully leverage fine-grained shallow details and high-level semantic representations. This design allows the model to capture additional learning opportunities and well-integrated feature representations.

2.2. Contextual Information Exploration

Contextual information denotes the environmental or background attributes of pixels or localized regions, enabling computational models to interpret spatial positioning, dimensional characteristics, and morphological properties of objects, along with their interdependencies within an image. This global semantic framework transcends reliance on isolated local features by integrating structural cues that reflect holistic scene composition. For instance, the context-aware detection network (CAD-Net) [29] innovatively combines macro-level scene context with micro-level object relationships through a dual-stream architecture. Its integrated spatial-scale attention module dynamically prioritizes informative regions while optimizing feature resolution across scales, thereby refining detection robustness. Concurrently, Zhou et al. [30] devised a context-aware pixel aggregation (CPA) strategy, employing multi-scale convolutional operations to adaptively characterize variably sized objects. Their complementary context-aware feature aggregation (CFA) framework further enhances semantic coherence through graph-based relational modeling. Zhang et al. [31] advanced this paradigm with a dynamic context exploration (DCE) framework, which synergizes three operational layers: (1) adaptive local context retrieval via environmental scanning, (2) relational feature optimization through object interaction mapping, and (3) global feature enrichment for cross-scale consistency. Parallel innovations include Zhao et al.’s study [32], which proposed a novel scene context detection network (SCDNet), disentangling scene–object interactions via dedicated feature decoders. Building upon these advancements, our proposed spatial context awareness module (SCAM) introduces a dual-branch architecture that hierarchically integrates global scene semantics with localized spatial dependencies, effectively suppressing background interference while amplifying discriminative object features.

3. Proposed Method

3.1. Overall Framework

As shown in Figure 2, we propose ProSAF-Net, which adopts YOLOv5 [33] as the baseline detector. While YOLOv12 [34] introduces more advanced architectural designs and achieves high performance on general object detection tasks, it also comes with significantly increased model complexity and computational costs. In contrast, YOLOv5 offers a favorable balance between accuracy and efficiency, which is critical for small object detection in remote sensing scenarios where high-resolution images and limited computational resources are common. Moreover, YOLOv5 provides better extensibility for integrating our proposed modules, making it a more practical choice for this study. The pipeline initiates with the backbone network extracting hierarchical features {C2, C3}, which capture multi-scale visual patterns critical for subsequent processing. To counteract the feature degradation inherent in deep convolutional operations, a shallow detail aggregation module (SDAM) is integrated during the feature fusion phase, where C3 features are synergistically combined with upsampled outputs to preserve high-frequency texture details. Building upon this, three cascaded deep semantic fusion modules (DSFM) progressively refine feature representations by establishing bidirectional interactions between low-level spatial details and high-level semantic abstractions, thereby enhancing cross-scale information propagation. Prior to detection heads, spatial context awareness modules (SCAM) implement multi-granularity context modeling through parallel global scene analysis and local pattern correlation, effectively suppressing background interference while amplifying discriminative object characteristics. To address the substantial scale variations of remote sensing targets, an auxiliary dynamic loss is introduced with a self-adjusting weight allocation mechanism that dynamically prioritizes optimization objectives based on target size distribution, ultimately enhancing cross-scale detection consistency. Technical implementations of SDAM, DSFM, SCAM, and the novel loss function will be systematically detailed in subsequent sections.

3.2. Shallow Detail Aggregation Module

Shallow feature details are crucial in remote sensing object detection. However, while shallow features contain rich edge and texture information, they are also highly susceptible to noise interference. Therefore, a well-designed fusion mechanism is needed to enhance discriminative features while suppressing background distractions. Mamba [35], an efficient sequence modeling approach based on state space models (SSM), captures long-range dependencies without explicitly computing global self-attention, providing benefits over conventional Transformer [36] models by enhancing computational efficiency and improving global perception of shallow features.

Therefore, we propose the shallow detail aggregation module (SDAM), which integrates the Mamba structure with channel-spatial attention mechanisms, enabling efficient modeling of global dependencies in shallow features while enhancing key region representation. As illustrated in Figure 3, two feature maps

X_{1} \in R^{C \times H \times W}

and

X_{2} \in R^{C \times H \times W}

of the same dimensions are first concatenated along the channel dimension, followed by a 1 × 1 convolution for feature compression, obtaining the compressed feature representation X. Based on VMamba [37], a linear projection is utilized to increase the feature channels to

λ

C (where

λ

represents a predefined expansion factor), then processed through depthwise convolution, SiLU activation, 2D-SSM processing, and layer normalization (LN). The process can be as follows:

\begin{matrix} X & = Conv (Concat (X_{1}, X_{2})) \end{matrix}

(1)

\begin{matrix} X_{4} & = LN (2 D - SSM (SiLU (DWConv (Linear (X))))) \end{matrix}

(2)

where Conv refers to a 1 × 1 convolution, while DWConv signifies depthwise convolution, and 2D-SSM is a 2D selective scanning module.

To further enhance feature selection capability, the second branch processes

X_{3}

(the initial concatenated feature map) through channel-based attention and spatial-based attention mechanisms, thereby improving the network’s ability to focus on key object regions. CA computes channel-wise importance weights using global average pooling (GAP) and an MLP transformation, strengthening essential channel features. SA extracts spatial features through max-pool and average-pool, producing an attention map that directs the network to emphasize relevant regions. This process can be as follows:

\begin{matrix} X_{5} & = SA (CA (X_{3})) \end{matrix}

(3)

where CA denotes channel attention, and SA represents spatial attention.

Finally, The Hadamard product is utilized to fuse features from both branches, followed by a linear projection to restore the original channel dimensions 2C. The process can be as follows:

\begin{matrix} X_{6} & = Linear (X_{4} ⊙ X_{5}) \end{matrix}

(4)

\begin{matrix} X_{o u t} & = Conv (X_{6} \oplus X_{3}) \end{matrix}

(5)

where ⊙ represents the Hadamard product, and Conv denotes a 1 × 1 convolution.

Unlike mainstream feature fusion structures, such as FPN, BiFPN, and PANet, which primarily focus on top-down or bidirectional propagation of multi-scale features, the proposed shallow detail aggregation module (SDAM) emphasizes the preservation and refinement of shallow visual details. Its architecture is inspired by a lightweight encoder–decoder structure, specifically tailored to enhance small object detection in remote sensing images. This design inherently supports multi-scale feature fusion: the encoder progressively extracts semantic representations, while the decoder gradually restores spatial resolution through upsampling operations, thereby retaining key positional cues and boundary information of small objects. Compared to traditional deep convolutional stacks, this architecture significantly reduces computational overhead while maintaining detection accuracy, which is crucial for large-scale remote sensing applications that demand both speed and efficiency. While FPN and its variants often suffer from the loss of fine-grained textures due to repeated downsampling and abstract semantic transformations, our SDAM addresses this limitation by introducing skip connections that enable contextual re-aggregation. This mechanism enhances the preservation of spatially localized information and strengthens detail representation. Consequently, SDAM effectively mitigates the semantic degradation of shallow features during deep processing, achieving a balanced integration of local detail awareness and global semantic understanding, which is particularly suited for detecting small targets in cluttered remote sensing scenarios.

3.3. Deep Semantic Fusion Module

Deep semantic information is critical for improving detection accuracy and robustness. This is particularly important in remote sensing images, where small targets are often embedded in cluttered or complex backgrounds. In such cases, shallow features, while rich in fine-grained details, may fail to capture sufficient semantic cues, making it difficult to accurately distinguish targets from the background. Deep semantic features, by contrast, provide essential category-level discrimination and global contextual understanding, effectively compensating for the limitations of low-level representations. Although SDAM efficiently captures detailed low-level features, it is essential to combine the benefits of shallow details with high-level semantics to achieve robust detection. To this end, we propose the deep semantic fusion module (DSFM), which employs a progressive fusion strategy for hierarchical feature integration. Unlike traditional fusion strategies that merge multi-scale features in a single step (e.g., FPN, BiFPN), DSFM progressively aligns and aggregates deep features in a stage-wise manner. This design ensures smooth semantic transitions across different levels of the network, effectively mitigating semantic gaps and preserving spatial coherence. In remote sensing applications, such gradual fusion is particularly beneficial, as it enables the model to build discriminative and spatially consistent feature representations, especially for small or occluded objects. Specifically, we incorporate three DSFMs in the later stages of the network’s neck. These modules facilitate semantic information flow from deep layers while maintaining close interactions with SDAM outputs. This design enhances the complementarity between detailed spatial features and global semantic cues, leading to improved detection performance across various types and scales of objects.

As shown in Figure 4, the two input features

Y_{1}

,

Y_{2}

of the DSFM with the same dimensions, and the squeeze-and-excitation (SE) mechanism, is applied to adaptively recalibrate the channel features, enhancing the focus on important features. Specifically, global average pooling (GAP) and global max pooling (GMP) are applied to

Y_{1}

and

Y_{2}

, respectively, to obtain global features along the channel dimension. The extracted features are concatenated and passed through a convolution layer to refine the channel attributes. Subsequently, an MLP module is employed to compress and then restore the features. Finally, a Sigmoid activation function then computes channel attention weights, serving to dynamically refine the original input features. The channel-enhanced features

Y_{3}

and

Y_{4}

are merged along the channel axis to form the final output Y. This process can be as follows:

\begin{matrix} Y_{3} & = SE (Y_{1}) \otimes Y_{1} \end{matrix}

(6)

\begin{matrix} Y_{4} & = SE (Y_{2}) \otimes Y_{2} \end{matrix}

(7)

\begin{matrix} Y & = Concat (Y 3, Y 4) \end{matrix}

(8)

where SE stands for the squeeze-and-excitation mechanism, and Concat(.) refers to concatenation along the channel dimension.

After enhancing the channel features, we incorporate the spatial attention (SA) mechanism to enhance the model’s capability in emphasizing target regions and strengthening the representation of spatial structural information. Unlike channel attention, which primarily contributes to global semantic modeling, spatial attention focuses on the distribution of targets across the spatial dimensions, thereby optimizing feature weighting across different regions and enhancing the model’s responsiveness to key areas. Specifically, the input feature Y is processed through both mean pooling and maximum pooling, each capturing different aspects of the information effectively: average pooling primarily extracts overall background information, facilitating global distribution modeling, whereas max pooling emphasizes the most salient local features, enhancing the representation of target regions. Subsequently, the pooled features

Y_{5}

and

Y_{6}

are combined along the channel axis, resulting in a fused representation

Y_{7}

that incorporates a variety of statistical information. Next, a convolution layer is applied to further compress the channel information, strengthening spatial correlations. Once the features are compressed, a Sigmoid activation function generates the spatial attention weight

Y_{8}

, which is used to adjust the spatial positions within the input feature map. This attention weight performs pixel-wise scaling, allowing the model to dynamically emphasize high-response regions and suppress background noise or irrelevant areas. This optimization improves the model’s spatial awareness of targets. This process can be as follows:

\begin{matrix} Y_{5} & = Mean (Y) \end{matrix}

(9)

\begin{matrix} Y_{6} & = Max (Y) \end{matrix}

(10)

\begin{matrix} Y_{7} & = Concat (Y_{5}, Y_{6}) \end{matrix}

(11)

\begin{matrix} Y_{8} & = Sigmoid (Conv (Y_{7})) \end{matrix}

(12)

\begin{matrix} Y_{o u t} & = Y_{8} \otimes Y \oplus Y \end{matrix}

(13)

where Mean and Max refer to average pooling and max pooling performed along the channel dimension, respectively, Conv represents a 1 × 1 convolution, and

Y_{o u t}

refers to the final output of the DSFM module.

DSFM effectively captures deep semantic information and adaptively adjusts feature representations across different scales and contextual environments, ensuring that target features maintain strong discriminative capabilities even in complex backgrounds. Moreover, by employing a progressive fusion strategy, DSFM refines semantic features across various detection stages, boosting the model’s overall perception, and ultimately enhancing the precision and reliability of object detection. The choice of the squeeze-and-excitation (SE) attention mechanism is motivated by its effectiveness in enhancing channel interdependencies with minimal computational overhead. In remote sensing scenarios, where targets are often small and embedded in cluttered backgrounds, SE helps emphasize category-relevant feature channels by adaptively recalibrating channel-wise responses. Unlike more complex attention mechanisms (e.g., non-local or transformer-based attention), SE is lightweight, easy to integrate into existing CNN backbones, and has been widely validated to improve performance without significantly increasing model size. Therefore, SE offers an ideal balance between effectiveness and efficiency for our deep semantic fusion design.

3.4. Spatial Context Awareness Module

The SDAM significantly improves the resolution of low-level features, thereby preserving critical edge definitions and textural patterns essential for detecting diminutive targets. Meanwhile, DSFM progressively integrates deep semantic information through channel enhancement and spatial attention mechanisms, thereby enhancing the model’s discriminative capacity across diverse object categories. However, in complex remote sensing scenarios, relying solely on shallow detail features and deep semantic representations remains insufficient for accurately delineating target regions, particularly under conditions of intense environmental noise and substantial fluctuations in target dimensions. In such cases, the synergistic interaction between local details and global contextual information becomes essential. To address this challenge, we propose SCAM to reinforce spatial dependencies within feature maps and enhance detection robustness by leveraging both global spatial information and local perception capabilities, as illustrated in Figure 5.

SCAM is composed of two branches: one for global spatial attention and the other for local spatial attention. Given an input feature map A, we first divide it along the channel dimension to obtain two feature maps of identical dimensions,

A_{1}

and

A_{2}

. In the first branch, a convolution layer is applied to extract the spatial distribution information of the input features, mapping them into a single-channel feature representation. A Softmax normalization is then performed to emphasize the attention weights of key regions. The normalized attention weights

A_{3}

are multiplied element-wise with the first branch’s input feature

A_{1}

to suppress background interference while enhancing the responses in target regions. Additionally, to further improve feature representation, a multilayer perceptron (MLP) is introduced for nonlinear transformation. Lastly, a residual connection is utilized to maintain gradient stability and retain the initial feature data. The result

A_{o u t 1}

from this branch is expressed as follows:

\begin{matrix} A_{3} & = Softmax (Conv (A_{1})) \end{matrix}

(14)

\begin{matrix} A_{o u t 1} & = MLP (A_{3} \otimes A_{1}) \oplus A_{1} \end{matrix}

(15)

where Conv refers to a 1 × 1 convolution, MLP consists of two 1 × 1 convolution layers, a batch normalization module followed by a ReLU activation. The transpose operation is omitted here for simplicity.

The second branch primarily consists of a convolution layer and a depthwise separable convolution, where the former is used for feature transformation, and the latter effectively captures local spatial information. Additionally, multiple matrix operations are performed to enhance the depth of attention learning. The output of this branch

A_{o u t 2}

can be as follows:

\begin{matrix} A_{4} & = DWConv (Conv (A_{2})) \end{matrix}

(16)

\begin{matrix} A_{5} & = Conv (A_{4} \oplus A_{1}) \otimes A_{1} \end{matrix}

(17)

\begin{matrix} A_{o u t 2} & = Sigmoid (A_{5}) \oplus A_{1} \end{matrix}

(18)

where DWConv indicates a depthwise separable convolution with a 3 × 3 kernel, Conv denotes a pointwise convolution with a 1 × 1 kernel, and

A_{4}

and

A_{5}

correspond to the intermediate features of this branch.

The global spatial attention branch captures global context information from an overall perspective, while the local spatial attention branch enhances the capability to represent features in localized regions. Finally, we fuse the two sets of features by channel concatenation to preserve both global and local information simultaneously.

\begin{matrix} A_{o u t} & = Concat (A_{o u t 1}; A_{o u t 2}) \end{matrix}

(19)

where Concat(.) signifies merging along the channel axis, while

A_{o u t}

denotes the SCAM output.

3.5. Auxiliary Dynamic Loss Function

IoU loss [38] accurately quantifies the spatial congruence between detected region proposals and their corresponding ground-truth annotations, enabling the model to capture spatial localization details during training. Serving as a key element in widely used loss functions used for bounding box prediction, IoU is defined as follows:

\begin{matrix} IoU & = \frac{A \cap B}{A \cup B} \end{matrix}

(20)

where IoU loss, which quantifies the overlap between the predicted bounding box (A) and the ground truth (B), is defined as follows:

\begin{matrix} L_{I o U} & = 1 - I o U \end{matrix}

(21)

To date, loss functions based on IoU have increasingly become the standard approach, with most approaches incorporating additional loss terms on top of IoU. Generalized IoU (GIoU) [39] addresses the issue of zero gradients when two bounding boxes do not overlap. Distance IoU (DIoU) [40] considers not only the overlap between two boxes but also their relative distance and scale. However, the aspect ratio, one of the three key factors in bounding box regression, has not yet been explicitly accounted for in the loss function. This was later introduced by Complete IoU (CIoU) [41], which incorporates both the overlap and shape similarity between two bounding boxes by considering their aspect ratio. In remote sensing images, the wide range of object sizes presents difficulties for conventional IoU-based methods. Inner-IoU [42] alleviates this issue by introducing supplementary bounding boxes of different sizes to calculate the loss, which is adaptively based on different regression samples. This approach inspired us to explore dynamically adjusting specific loss terms to better accommodate objects of different scales. Scale-aware Distance IoU (SDIoU) [43] achieves this by dynamically adjusting the influence of object size on scale and localization losses, thereby improving the model’s capability in detecting objects across multiple scales. Inspired by these advancements, we propose an auxiliary dynamic IoU Loss (ADIoU Loss), which integrates scale-adaptive IoU computation with auxiliary bounding boxes to optimize the bounding box alignment mechanism while improving target localization precision. The ADIoU loss function is formulated as follows:

\begin{matrix} L_{1} & = 1 - I o U^{i n n e r} + α v, L_{2} = \frac{ρ^{2} (b_{p}, b_{g t})}{c^{2}} \end{matrix}

(22)

\begin{matrix} R_{O C} & = \frac{w_{o} \times h_{o}}{w_{c} \times h_{c}} \end{matrix}

(23)

\begin{matrix} β_{B} & = min (\frac{B_{g t}}{B_{g t max}} \times R_{O C} \times δ, δ) \end{matrix}

(24)

\begin{matrix} β_{M} & = min (\frac{M_{g t}}{M_{g t max}} \times R_{O C} \times δ, δ) \end{matrix}

(25)

\begin{matrix} β_{L_{1}} & = 1 - δ + β_{B}, β_{L_{2}} = 1 + δ - β_{B} \end{matrix}

(26)

\begin{matrix} L_{A D} & = β_{L_{1}} \times L_{1} + β_{L_{2}} \times L_{2} \end{matrix}

(27)

where the

I o U^{i n n e r}

metric quantifies spatial alignment precision, while parameter v regulates aspect ratio consistency between predicted and annotated targets. Spatial displacement metric

ρ

(.) calculates the Euclidean distance separating the geometric centers of detected bounding boxes and their annotated references. The

R_{O C}

encodes scale correspondence between input imagery and feature representations, computed as the ratio of original image dimensions to current feature map resolutions.

B_{b}

and

B_{m}

are the influence coefficients for the bounding box and the mask mentioned in SDIoU. According to SDIoU,

B_{g t max}

=

M_{g t max}

= 81 is the maximum size defined by the Optical Instrument Engineers Association (Zhang, Cong, and Wang, 2003) [44]. Loss modulation parameters dynamically adjust based on target bounding area, with variation bounds controlled by hyperparameter

δ

,

β_{L_{1}}

and

β_{L_{2}}

are the influence factors for

L_{1}

and

L_{2}

, respectively.

4. Experimental Results and Discussion

To rigorously evaluate the performance of our methodology, we designed a comprehensive empirical validation framework. This section systematically outlines the experimental design, encompassing four critical components: (1) benchmark datasets employed for validation, (2) implementation protocols detailing technical configurations, (3) quantitative assessment criteria, and (4) comparative analysis of experimental outcomes with critical discussions. The methodological framework is structured across these dimensions to ensure thorough verification of the proposed approach.

4.1. Datasets

To comprehensively evaluate the performance of the proposed ProSAF-Net, we conduct experiments on three widely used remote sensing object detection datasets: RSOD [45], NWPU VHR-10 [46], and DIOR [47]. These datasets contain images with complex backgrounds, covering diverse object categories and varying scales, making them highly challenging. The specific information for each dataset is outlined below.

4.1.1. RSOD Dataset

The RSOD dataset, developed by Wuhan University in 2015 for object detection in remote sensing imagery, includes four distinct object categories: airplanes, oil tanks, overpasses, and playgrounds. Specifically, the airplane category includes 446 images with 4993 annotated instances; the oil tank category contains 165 images with 1586 annotated objects; the overpass category covers 176 images with 180 annotated instances; and the playground category comprises 189 images with 191 annotated instances. To assess the performance of our approach, we randomly divide the dataset into a training set and a validation set with a 75% to 25% ratio.

4.1.2. NWPU VHR-10 Dataset

The NWPU VHR-10 dataset, curated by Northwestern Polytechnical University, is an openly accessible benchmark for remote sensing detection tasks. Comprising 800 high-resolution images sourced from Google Earth and the Vaihingen database, this collection includes 650 object-bearing scenes and 150 background-only references. It contains 3755 annotated instances spanning ten categories: vehicle (VE), bridge (BR), harbor (HA), ground track field (GTF), basketball court (BC), tennis court (TC), baseball diamond (BD), storage tank (ST), ship (SH), and airplane (AP). All objects are manually labeled with horizontal bounding boxes. We adopt a 75–25% split for training and testing phases to ensure experimental reproducibility.

4.1.3. DIOR Dataset

The DIOR dataset serves as a comprehensive open-access benchmark for object detection in high-resolution aerial imagery. This extensively utilized resource contains 23,463 images (800 × 800 pixels) with 192,472 meticulously annotated instances across 20 land-use categories: windmill (WM), train station (TS), tennis court (TC), storage tank (ST), stadium (SD), ground track field (GF), ship (SP), golf course (GC), overpass (OP), harbor (HB), vehicle (VH), expressway toll station (ES), expressway service area (EA), dam (DM), chimney (CM), bridge (BG), basketball court (BC), baseball field (BF), airport (AT), and airplane (AL). Following standardized evaluation protocols, the dataset is organized into three subsets: 5862 training samples, 5863 validation samples, and 11,738 testing samples, maintaining methodological rigor and cross-study comparability.

4.2. Experimental Details

Our experiments operate within a Windows 11 environment, constructed using PyTorch 2.0.1. Algorithmic implementations leverage an NVIDIA RTX 4070 Ti GPU (12GB VRAM) for accelerated processing. Optimization protocols adopt stochastic gradient descent (SGD) with an initialized learning rate of 0.01, complemented by

L_{2}

regularization (

λ

= 0.0005) and momentum coefficient of 0.937. Training procedures utilize mini-batches of 16 samples, with input imagery standardized to 640 × 640 resolution across all benchmark datasets. Model evaluation employs the mean average precision (mAP@0.5) metric, calculated at the intersection-over-union threshold of 0.5 for detection quality assessment.

4.3. Evaluation Metrics

Our detection framework adopts the well-established average precision (AP) metric for holistic performance evaluation. AP quantification involves numerical integration of the precision–recall (PR) curve’s enclosed area, providing a robust measurement of detection efficacy. Precision quantifies the ratio of true positive detections to total predicted positives, whereas recall assesses detection completeness by measuring identified positives against all annotated targets. These metrics are mathematically defined as:

\begin{matrix} Precision & = \frac{TP}{TP + FP} \end{matrix}

(28)

\begin{matrix} Recall & = \frac{TP}{TP + FN} \end{matrix}

(29)

where true positive (TP), false positive (FP), and false negative (FN), respectively, denote correct detections, erroneous predictions, and undetected ground-truth instances. The intersection over union (IoU) metric calculates spatial alignment precision between detected and annotated bounding regions, establishing critical evaluation criteria for detection systems. For multi-object detection scenarios, we implement mean average precision (mAP@0.5) as the principal performance indicator, computed by averaging category-specific AP values at the 0.5 IoU threshold. The mathematical formulation for mAP derivation is expressed as:

\begin{matrix} AP & = \int_{0}^{1} P (R) d R \end{matrix}

(30)

\begin{matrix} mAP & = \frac{1}{n} \sum_{i = 1}^{n} {AP}_{i} \end{matrix}

(31)

where P corresponds to the precision metric, R signifies the recall rate, n indicates the total categorical classes, and APi quantifies the average precision metric specific to class index (where i in [1, n]) across all detection instances.

4.4. Experimental Results and Discussion

4.4.1. Results on the RSOD Dataset

Benchmark evaluations against ten state-of-the-art detectors in Table 1 demonstrate the superior performance of ProSAF-Net, which establishes a new benchmark with a peak mAP of 97.12%. This performance breakthrough validates the technical efficacy of our design, particularly excelling in small object detection with class-leading AP scores of 96.56% for aircraft (predominantly sub-pixel targets) and 98.37% for oil tanks. The technical superiority stems from the synergistic integration of SDAM and PSFM modules: SDAM preserves high-resolution spatial details through shallow feature retention, while PSFM hierarchically aggregates contextual semantics via cross-layer attention mechanisms. Their co-optimization generates discriminative representations that balance localization precision and semantic richness. Equally important, ProSAF-Net achieves second-ranked AP on large-scale overpass detection, confirming its multi-scale adaptability. Visualizations in Figure 6 empirically validate this capability, demonstrating simultaneous high-precision detection of macro-scale structures (e.g., playgrounds) and micro-scale targets (e.g., aircraft) under complex backgrounds. These results collectively evidence the framework’s robustness across extreme scale variations in remote sensing scenarios.

4.4.2. Results on the NWPU VHR-10 Dataset

In Table 2, we compare our ProSAF-Net with several advanced methods, including YOLOv3 [12], Faster R-CNN [10], FMSSD [54], CAD-Net [29], EGAT [55], Cascade R-CNN [49], YOLOv7 [14], CANet [56], ABNet [16], and TBNet [53]. Among these approaches, our ProSAF-Net achieves the highest mAP of 96.57%. Specifically, it achieves the best AP in three out of ten categories: ship, baseball field, and bridge. The ship category consists of relatively small objects in the images, empirically validating the operational efficacy of ProSAF-Net in sub-pixel target recognition. The outstanding capability in small object detection stems not only from efficient feature fusion but also from the integration of contextual information, which is essential. This is primarily achieved through SCAM, which leverages both global and local contextual information to effectively distinguish foreground objects from the background while suppressing background noise. For baseball field and bridge categories, the resemblance between their appearances and surrounding environments poses a significant challenge. The fact that ProSAF-Net performs well in these categories indicates its robustness in distinguishing objects from complex backgrounds. While our method may not yield the highest performance in every category, it ranks second in storage tank and vehicle categories, further demonstrating its competitive performance. To better illustrate the effectiveness of our approach, Figure 7 visualizes the detection results produced by ProSAF-Net. Some of the images have relatively complex backgrounds. For example, in the first row’s third image, small boats on the sea can be confused with ocean waves, which poses a challenge for object detection. However, our method effectively detects and localizes these targets. Additionally, objects such as basketball and tennis courts exhibit colors similar to their surrounding environments, further demonstrating the robustness of our approach in distinguishing objects from complex backgrounds.

4.4.3. Results on the DIOR Dataset

We conducted an evaluation of ProSAF-Net on the DIOR dataset and compared its performance with several state-of-the-art detectors, such as YOLOv3 [12], DFPN-YOLO [57], Faster R-CNN [10], CenterNet [7], ASSD [58], ABNet [16], CF2PN [59], CSFFNet [60], MSA R-CNN [61], SESA-Net [52], TMAFNet [62], SRAF-Net [63], FSoD-Net [21], MSFC-Net [64], and CoF-Net [65]. As shown in Table 3, ProSAF-Net is the only method surpassing 76% mAP, achieving the best AP in three out of twenty categories, namely airport (85.6%), highway toll station (80.6%), and overpass (62.5%). While our approach attains the top AP in merely three categories, it ranks second in bridge, ship, and tennis court, demonstrating its effectiveness in detecting objects of varying scales. This is particularly important for large-scale datasets like DIOR. Furthermore, the ADIoU loss function is instrumental in driving this performance boost, as it dynamically adjusts the scale and position-related loss terms based on object size, improving the model’s precision in pinpointing objects across various scales. Notably, a closer look at Table 3 reveals that CoF-Net also performs well, achieving the highest AP in three categories and ranking second in three others. Its overall mAP reaches 75.8%, which may be attributed to its mechanism of progressively enhancing feature representations and selecting stronger samples, thereby enabling the model to learn more discriminative features. However, our ProSAF-Net still outperforms CoF-Net by 0.3%, primarily due to the ADIoU loss function, which optimizes detection performance across different object scales. The detection outcomes of ProSAF-Net on the DIOR dataset are illustrated in Figure 8, showcasing diverse environments with intricate backgrounds and varying object scales. For example, when detecting storage tanks, both large and small instances are accurately identified with minimal precision discrepancy. This further demonstrates that our method can effectively detect objects under scale variations while maintaining high detection accuracy.

In addition to its competitive detection performance, ProSAF-Net holds significant practical value in real-world applications. For instance, it can be employed in precision agriculture for pest or crop monitoring, in military reconnaissance for detecting vehicles or aircraft under complex backgrounds, and in tasks such as anti-counterfeiting or encryption that require high-resolution small object recognition. These applications benefit from the model’s robustness to background noise and its effectiveness in handling small-scale targets. Furthermore, the three remote sensing datasets used in this study (RSOD, NWPU VHR-10, and DIOR) encompass diverse scene environments and varying object scales, such as urban areas, maritime regions, and mountainous terrains. As a result, the proposed model demonstrates strong performance across complex backgrounds, indicating its potential for deployment in a wide range of real-world domains.

4.4.4. Computational Efficiency Analysis

To better evaluate the practicality of ProSAF-Net in real-world applications, we further analyze its computational efficiency. Specifically, we assess the inference speed of our model and compare it with several representative state-of-the-art object detectors on the DIOR dataset. Figure 9 illustrates the trade-off between detection accuracy (measured by mAP) and inference speed (measured in FPS). ProSAF-Net achieves a favorable balance between accuracy and computational cost. Among all compared methods, ProSAF-Net maintains the highest mAP while also significantly outperforming other competitors in terms of inference speed. It is worth noting that although CoF-Net exhibits relatively high detection accuracy, its inference speed is notably lower, which may pose a challenge for deployment in time-sensitive applications. In addition, the lightweight design of SDAM and the progressive feature integration strategy in DSFM contribute to efficient feature reuse and reduced computational overhead. These results confirm that ProSAF-Net is not only accurate but also efficient, making it a practical solution for both offline analysis and real-time detection tasks in the field of remote sensing.

4.5. Ablation Study

To assess the contribution of each module in our proposed method, we perform ablation studies on the RSOD and NWPU VHR-10 datasets. The corresponding results are summarized in Table 4.

4.5.1. Effect of SDAM

The shallow detail aggregation module (SDAM) in ProSAF-Net is architecturally optimized to preserve high-frequency spatial features and refine edge localization accuracy. As shown in Table 4, when only SDAM is added, the mAP reaches 95.37% on the RSOD dataset and 94.39% on RSOD (+1.73% over baseline) and 94.39% on NWPU VHR-10 (+1.09% over baseline), empirically validating its efficacy in enhancing low-level feature discriminability. Furthermore, when SDAM is combined with DSFM, the mAP improves to 95.93% on RSOD and 95.30% on NWPU VHR-10, indicating that the hierarchical shallow-to-deep feature fusion mechanism implemented by SDAM and PSFM contributes to better detection performance.

4.5.2. Effect of DSFM

The integration of surface-level features and high-level semantic knowledge is vital for the accurate detection of objects in RSI. This is because complex backgrounds in remote sensing images often lead to significant information loss due to multiple downsampling operations, thereby significantly compromising the model’s capacity to develop discriminative feature representations through optimized representational learning. To address this issue, we introduce the deep semantic fusion module (DSFM). When only DSFM is used in the baseline model, yielding mAP scores of 95.08% on the RSOD benchmark and 94.47% on the NWPU VHR-10 validation set, achieving improvements of 1.44% and 1.17%, respectively. These results suggest that DSFM effectively extracts high-level deep semantic features and enhances detection accuracy even in the absence of SDAM’s complementary advantages.

4.5.3. Effect of SCAM

For remote sensing images with complex backgrounds, contextual information is essential, as it provides additional clues for object recognition. To exploit this, we propose the spatial context awareness module (SCAM). As shown in Table 4, when only SCAM is used, the mAP reaches 94.73% on RSOD and 94.71% on NWPU VHR-10, with improvements of 1.09% and 1.41%, respectively. Positioned before the detection head, SCAM effectively leverages both global and local contextual information, enhancing the model’s ability to classify and localize objects. When SCAM, SDAM, and DSFM are combined, the mAP further increases to 96.79% on RSOD and 96.01% on NWPU VHR-10, achieving improvements of 3.15% and 2.31% over the baseline model. This significant performance gain highlights the strong complementarity between these three modules. Their combined effect effectively alleviates issues like intricate backgrounds and variations in object scales, ultimately enhancing the model’s detection accuracy.

4.5.4. Effect of ADIoU

To tackle the challenges arising from object scale variations in RSI, the auxiliary dynamic IoU (ADIoU) loss function is proposed as the localization loss in our model. By incorporating ADIoU, the mAP further improves from 96.79% to 97.12% on RSOD and from 96.01% to 96.56% on NWPU VHR-10. These results confirm that ADIoU effectively enhances the model’s accuracy by dynamically adjusting the impact of scale and position during localization. To better showcase the interpretability of ADIoU, we perform a qualitative comparison on the RSOD dataset, as illustrated in Figure 9. This includes visualizing the detection outcomes of ProSAF-Net both with and without ADIoU, in addition to those from the baseline model. Additionally, to highlight the optimization effect of ADIoU on object scale variations, all selected images contain objects with varying scales. As observed in Figure 10, compared to the third-row images, the fourth-row images demonstrate overall improved detection performance after adopting ADIoU as the localization loss. Moreover, the results exhibit a noticeable enhancement in contrast to the baseline, providing additional evidence of ADIoU’s effectiveness.

5. Conclusions

In this study, we present a novel approach for detecting objects in RSI, ProSAF-Net. First, we introduce the shallow detail aggregation module (SDAM) to extract rich shallow-detail information and spatial structure features. Then, we design the deep semantic fusion module (DSFM) to effectively capture deep semantic information. By integrating SDAM and DSFM, our method facilitates the interaction between fine-grained shallow details and high-level semantic features, thereby enhancing the model’s ability to perform object classification and localization. Unlike conventional unidirectional feature propagation methods, which may lead to mismatches between semantic and detailed information, our progressive feature fusion mechanism allows shallow features to be refined under semantic guidance, while deep features are enriched with detailed information. This allows the model to acquire richer and more distinctive feature representations. Considering the complex backgrounds in RSI, we further design the spatial context awareness module (SCAM), which fully leverages both global and local contextual information through interactive mechanisms. This helps effectively distinguish foreground objects from background clutter, enhancing the detection robustness of the model. Finally, we propose a new loss function named ADIoU, which combines the benefits of Inner-IoU and SDIoU. ADIoU adapts the impact of scale and position losses based on object scale variations, while also integrating an auxiliary bounding box mechanism to speed up the bounding box regression. Experimental outcomes show that our model achieves mAP scores of 97.12% on RSOD, 96.56% on NWPU VHR-10, and 76.1% on DIOR, outperforming all comparison algorithms. The hierarchical integration of shallow details, deep semantics, and spatial context information in ProSAF-Net establishes a novel design paradigm for multi-scale object detection in complex remote sensing scenarios, demonstrating strong robustness and generalization capability. In the future, we plan to extend ProSAF-Net to multi-modal data and investigate lightweight variants for deployment on resource-constrained platforms.

Author Contributions

Conceptualization, L.L. and J.W.; methodology, L.L. and Y.L.; software, L.L. and W.Q.; validation, L.L. and Y.L.; writing—original draft preparation, L.L.; writing—review and editing, L.L. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Project of the Jiangxi Provincial Department of Education under Grant GJJ190240 and the Key Research and Development Project of the Jiangxi Provincial Department of Science and Technology under Grant 20192BBEL50039.

Data Availability Statement

All datasets used for training and evaluating the performance of our proposed approach are publicly available and can be accessed from [45,46,47].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ProSAF-Net	Progressive Semantic-Aware Fusion Network
SDAM	Shallow Detail Aggregation Module
DSFM	Deep Semantic Fusion Module
SCAM	Spatial Context-Aware Module
ADL	Auxiliary Dynamic Loss
RCNN	Region Convolutional Neural Network
RPN	Region Proposal Network
mAP	Mean Average Precision
GAP	Global Average Pooling
GMP	Global Max Pooling
RSI	Remote Sensing Image

References

Zhang, X.; Zhang, T.; Wang, G.; Zhu, P.; Tang, X.; Jia, X.; Jiao, L. Remote sensing object detection meets deep learning: A metareview of challenges and advances. IEEE Geosci. Remote Sens. Mag. 2023, 11, 8–44. [Google Scholar] [CrossRef]
Ganci, G.; Cappello, A.; Bilotta, G.; Del Negro, C. How the variety of satellite remote sensing data over volcanoes can assist hazard monitoring efforts: The 2011 eruption of Nabro volcano. Remote Sens. Environ. 2020, 236, 111426. [Google Scholar] [CrossRef]
Xie, W.; Zhang, X.; Li, Y.; Lei, J.; Li, J.; Du, Q. Weakly supervised low-rank representation for hyperspectral anomaly detection. IEEE Trans. Cybern. 2021, 51, 3889–3900. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Gao, J.; Yuan, Y. A joint convolutional neural networks and context transfer for street scenes labeling. IEEE Trans. Intell. Transp. Syst. 2018, 19, 1457–1470. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint Triplets for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Wu, Y.; Zhang, K.; Wang, J.; Wang, Y.; Wang, Q.; Li, X. GCWNet: A global context-weaving network for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5619912. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, C.; Zhang, T.; Liu, Y.; Zheng, Y. Self-attention guidance and multiscale feature fusion-based UAV image object detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6004305. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–23 June 2023; pp. 7464–7475. [Google Scholar]
Ruan, H.; Qian, W.; Zheng, Z.; Peng, Y. A Decoupled Semantic–Detail Learning Network for Remote Sensing Object Detection in Complex Backgrounds. Electronics 2023, 12, 3201. [Google Scholar] [CrossRef]
Liu, Y.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. ABNet: Adaptive balanced network for multiscale object detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5614914. [Google Scholar] [CrossRef]
Gao, T.; Niu, Q.; Zhang, J.; Chen, T.; Mei, S.; Jubair, A. Global to local: A scale-aware network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5615614. [Google Scholar] [CrossRef]
Dong, X.; Qin, Y.; Fu, R.; Gao, Y.; Liu, S.; Ye, Y.; Li, B. Multiscale deformable attention and multilevel features aggregation for remote sensing object detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6510405. [Google Scholar] [CrossRef]
Zhao, W.; Kang, Y.; Chen, H.; Zhao, Z.; Zhao, Z.; Zhai, Y. Adaptively attentional feature fusion oriented to multiscale object detection in remote sensing images. IEEE Trans. Instrum. Meas. 2023, 72, 5008111. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, T.; Yu, P.; Wang, S.; Tao, R. SFSANet: Multiscale Object Detection in Remote Sensing Image Based on Semantic Fusion and Scale Adaptability. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4406410. [Google Scholar] [CrossRef]
Wang, G.; Zhuang, Y.; Chen, H.; Liu, X.; Zhang, T.; Li, L.; Dong, S.; Sang, Q. FSoD-Net: Full-scale object detection from optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602918. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Luo, Y.; Cao, X.; Zhang, J.; Guo, J.; Shen, H.; Wang, T.; Feng, Q. CE-FPN: Enhancing channel information for object detection. Multimed. Tools Appl. 2022, 81, 30685–30704. [Google Scholar] [CrossRef]
Liu, N.; Celik, T.; Li, H.C. Gated ladder-shaped feature pyramid network for object detection in optical remote sensing images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 6001505. [Google Scholar] [CrossRef]
Yu, L.; Hu, H.; Zhong, Z.; Wu, H.; Deng, Q. GLF-Net: A target detection method based on global and local multiscale feature fusion of remote sensing aircraft images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4021505. [Google Scholar] [CrossRef]
Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. Mffsodnet: Multi-scale feature fusion small object detection network for uav aerial images. IEEE Trans. Instrum. Meas. 2024, 21, 5015214. [Google Scholar] [CrossRef]
Yi, Q.; Zheng, M.; Shi, M.; Weng, J.; Luo, A. AFANet: A Multi-Backbone Compatible Feature Fusion Framework for Effective Remote Sensing Object Detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6015805. [Google Scholar] [CrossRef]
Zhang, G.; Lu, S.; Zhang, W. CAD-Net: A context-aware detection network for objects in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef]
Zhou, Y.; Hu, H.; Zhao, J.; Zhu, H.; Yao, R.; Du, W.L. Few-shot object detection via context-aware aggregation for remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6509605. [Google Scholar] [CrossRef]
Zhang, Z.; Gong, P.; Sun, H.; Wu, P.; Yang, X. Dynamic local and global context exploration for small object detection. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Zhao, Z.; Du, J.; Li, C.; Fang, X.; Xiao, Y.; Tang, J. Dense tiny object detection: A scene context guided approach and a unified benchmark. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5606913. [Google Scholar] [CrossRef]
Ultralytics. ultralytics/yolov5: v7.0—YOLOv5 SOTA Realtime Instance Segmentation. 2022. [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, I.; Polosukhin, Ł. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. UnitBox: An Advanced Object Detection Network. In Proceedings of the 24th ACM International Conference on Multimedia (ACM MM), Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef]
Zhang, H.; Xu, C.; Zhang, S. Inner-iou: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped Convolution and Scale-based Dynamic Loss for Infrared Small Target Detection. arXiv 2024, arXiv:2412.16986. [Google Scholar] [CrossRef]
Zhang, W.; Cong, M.; Wang, L. Algorithms for optical weak small targets detection and tracking. In Proceedings of the International Conference on Neural Networks and Signal Processing, Nanjing, China, 14–17 December 2003; Volume 1, pp. 643–647. [Google Scholar]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote. Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS: Improving Object Detection with One Line of Code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Zhou, Z.; Zhu, Y. KLDet: Detecting tiny objects in remote sensing images via kullback-leibler divergence. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4703316. [Google Scholar] [CrossRef]
Guo, Y.; Tong, X.; Xu, X.; Liu, S.; Feng, Y.; Xie, H. An anchor-free network with density map and attention mechanism for multiscale object detection in aerial images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6516705. [Google Scholar] [CrossRef]
Ma, W.; Wang, X.; Zhu, H.; Yang, X.; Yi, X.; Jiao, L. Significant feature elimination and sample assessment for remote sensing small objects’ detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5615115. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Xu, D.; Gao, Y.; Zhao, T. TBNet: A texture and boundary-aware network for small weak object detection in remote-sensing imagery. Pattern Recogn. 2025, 158, 110976. [Google Scholar] [CrossRef]
Wang, P.; Sun, X.; Diao, W.; Fu, K. FMSSD: Feature-merged single-shot detection for multiscale objects in large-scale remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3377–3390. [Google Scholar] [CrossRef]
Tian, S.; Kang, L.; Xing, X.; Tian, J.; Fan, C.; Zhang, Y. A relation-augmented embedded graph attention network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1000718. [Google Scholar] [CrossRef]
Shi, L.; Kuang, L.; Xu, X.; Pan, B.; Shi, Z. CANet: Centerness-aware network for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5603613. [Google Scholar] [CrossRef]
Sun, Y.; Liu, W.; Gao, Y.; Hou, X.; Bi, F. A dense feature pyramid network for remote sensing object detection. Appl. Sci. 2022, 12, 4997. [Google Scholar] [CrossRef]
Xu, T.; Sun, X.; Diao, W.; Zhao, L.; Fu, K.; Wang, H. ASSD: Feature aligned single-shot detection for multiscale objects in aerial imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607117. [Google Scholar] [CrossRef]
Huang, W.; Li, G.; Chen, Q.; Ju, M.; Qu, J. CF2PN: A cross-scale feature fusion pyramid network based remote sensing target detection. Remote Sens. 2021, 13, 847. [Google Scholar] [CrossRef]
Cheng, G.; Si, Y.; Hong, H.; Yao, X.; Guo, L. Cross-scale feature fusion for object detection in optical remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 431–435. [Google Scholar] [CrossRef]
Sagar, A.S.; Chen, Y.; Xie, Y.; Kim, H.S. MSA R-CNN: A comprehensive approach to remote sensing object detection and scene understanding. Expert Syst. Appl. 2024, 241, 122788. [Google Scholar] [CrossRef]
Gao, T.; Liu, Z.; Zhang, J.; Wu, G.; Chen, T. A task-balanced multiscale adaptive fusion network for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5613515. [Google Scholar] [CrossRef]
Liu, J.; Li, S.; Zhou, C.; Cao, X.; Gao, Y.; Wang, B. SRAF-Net: A scene-relevant anchor-free object detection network in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5405914. [Google Scholar] [CrossRef]
Zhang, T.; Zhuang, Y.; Wang, G.; Dong, S.; Chen, H.; Li, L. Multiscale semantic fusion-guided fractal convolutional object detection network for optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5608720. [Google Scholar] [CrossRef]
Zhang, C.; Lam, K.M.; Wang, Q. Cof-net: A progressive coarse-to-fine framework for object detection in remote-sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5600617. [Google Scholar] [CrossRef]

Figure 1. Sample images from the DIOR dataset. (a) Images with complex backgrounds, (b,c) images with variations in object scales.

Figure 2. The overall framework of ProSAF-Net. SDAM represents the shallow detail aggregation module, DSFM denotes the deep semantic fusion module, and SCAM refers to the spatial context awareness module.

Figure 3. Structure of shallow detail aggregation module.

Figure 4. Structure of deep semantic fusion module.

Figure 5. Structure of spatial context awareness module.

Figure 6. Representative examples of detection outcomes from ProSAF-Net on the RSOD dataset.

Figure 7. Representative examples of detection outcomes from ProSAF-Net on the NWPU VHR-10 dataset. These images cover a variety of complex environments, such as mountainous areas, urban streets, and ocean surfaces. Additionally, they include objects of various scales, ranging from small objects like airplanes and ships to large objects such as ground track fields and basketball courts. Our method demonstrates high detection accuracy across all scenarios.

Figure 8. Representative examples of detection outcomes from ProSAF-Net on the DIOR dataset. The scenes in these images are more diverse and varied, with many small objects present. Moreover, it covers a wide range of categories in the dataset, which effectively demonstrates the generalization capability of our method. Overall, our method demonstrates high detection accuracy across all objects.

Figure 9. mAP–FPS comparison of eight detection models, where ProSAF-Net achieves the best balance between accuracy and speed.

Figure 10. Qualitative comparison of detection results: (a) original image, (b) baseline model detection, (c) ProSAF-Net detection without ADIoU, and (d) ProSAF-Net detection with ADIoU.

Table 1. Comparison with the current state-of-the-art on the RSOD dataset is presented. The top-performing results are shown in bold, and the runner-up results are underlined.

Method	mAP	Aircraft	Oil Tank	Overpass	Playground
soft-NMS [48]	86.60	76.10	90.30	81.30	98.80
Faster R-CNN [10]	88.10	71.30	90.7	90.90	99.70
YOLOv3 [12]	89.40	88.60	94.50	75.90	99.90
FPN [13]	90.91	90.58	94.47	80.18	98.49
Cascade R-CNN [49]	91.30	94.20	96.10	83.20	99.00
KLDet [50]	93.22	94.46	94.88	86.58	96.95
YOLOv5 [33]	93.64	95.58	97.79	81.68	99.50
ABNet [16]	94.17	91.49	96.14	89.61	99.44
DA²FNet [51]	94.78	95.72	97.73	90.56	95.12
SESA-Net [52]	95.15	95.69	97.89	90.23	96.82
TBNet [53]	96.98	95.24	96.76	95.92	100
ProSAF-Net (Ours)	97.12	96.56	98.37	94.71	98.86

Table 2. Comparison with the current state-of-the-art methods on the NWPU VHR-10 dataset. The top-performing results are shown in bold, and the runner-up results are underlined.

Method	mAP	AP	SH	ST	BD	TC	BC	GTF	HA	BR	VE
YOLOv3 [12]	87.27	99.55	81.82	80.30	98.26	80.56	81.82	99.47	74.31	89.61	86.98
Faster R-CNN [10]	88.30	99.70	85.58	99.27	95.93	88.22	92.08	99.73	92.11	43.37	86.60
FMSSD [54]	90.40	99.70	89.90	90.30	98.20	86.00	96.80	99.60	75.60	80.10	88.20
CAD-Net [29]	91.50	97.00	77.90	95.60	93.60	87.60	87.10	99.60	100	86.20	89.90
EGAT [55]	92.01	97.32	96.72	97.16	96.51	86.64	94.46	94.18	86.18	80.14	90.79
Cascade R-CNN [49]	92.19	99.54	88.53	95.98	94.46	94.02	88.21	97.16	91.45	82.30	90.25
YOLOv7 [14]	92.94	99.52	92.60	96.61	97.78	93.85	90.78	95.60	89.54	83.72	89.38
YOLOv5 [33]	93.30	99.50	94.80	84.75	97.94	97.36	94.79	99.00	89.22	86.76	88.84
CANet [56]	93.33	99.99	85.99	99.27	97.28	97.80	84.77	98.38	90.38	89.16	90.25
ABNet [16]	94.21	100	92.58	97.77	97.76	99.26	95.98	99.86	94.26	69.04	95.62
TBNet [53]	95.44	100	95.09	97.35	97.35	97.55	98.96	100	94.64	80.77	92.71
ProSAF-Net (Ours)	96.56	99.50	98.01	98.07	98.73	89.56	95.18	98.60	94.05	91.69	93.68

Table 3. Comparison with the current state-of-the-art methods on the DIOR dataset. The top-performing results are shown in bold, and the runner-up results are underlined.

Method	mAP	AL	AT	BF	BC	BG	CM	DM	EA	ES	GC	GF	HB	OP	SP	SD	ST	TC	TS	VH	WM
Faster R-CNN [10]	54.1	53.6	49.3	78.8	66.2	28.0	70.9	62.3	69.0	55.2	68.0	56.9	50.2	50.1	27.7	73.0	39.8	75.2	38.6	23.6	45.4
YOLOv3 [12]	57.1	72.2	29.2	74.0	78.6	31.2	69.7	56.9	48.6	54.4	31.1	61.1	44.9	49.7	87.4	70.6	68.7	87.3	29.4	48.3	78.7
CenterNet [7]	63.9	73.6	58.0	69.7	88.5	36.2	76.9	47.9	52.7	53.9	60.5	62.6	45.7	52.6	88.2	63.7	76.2	83.7	51.3	54.4	79.5
CF2PN [59]	67.3	78.3	78.3	76.5	88.4	37.0	71.0	59.9	71.2	51.2	75.6	77.1	56.8	58.7	76.1	70.6	55.5	88.8	50.8	36.9	86.4
CSFFNet [60]	68.0	57.2	79.6	70.1	87.4	46.1	76.6	62.7	82.6	73.2	78.2	81.6	50.7	59.5	73.3	63.4	58.5	85.9	61.9	42.9	86.9
DFPN-YOLO [57]	69.3	80.2	76.8	72.7	89.1	43.4	76.9	72.3	59.8	56.4	74.3	71.6	63.1	58.7	81.5	40.1	74.2	85.8	73.6	49.7	86.5
SRAF-Net [63]	69.7	88.4	76.5	92.6	87.9	35.8	83.8	58.6	86.8	66.8	76.4	82.8	16.2	58.0	59.4	80.9	55.6	90.6	52.0	53.2	91.0
MSFC-Net [64]	70.1	85.8	76.2	74.4	91.1	44.2	78.1	55.5	60.9	59.5	76.9	73.7	49.6	57.2	89.6	69.2	76.5	86.7	51.8	55.2	84.3
ASSD [58]	71.1	85.6	82.4	75.8	89.5	40.7	77.6	64.7	67.1	61.7	80.8	78.6	62.0	58.0	84.9	76.7	65.3	87.9	62.4	44.5	76.3
FSoD-Net [21]	71.8	88.9	66.9	86.8	90.2	45.5	79.6	48.2	86.9	75.5	67.0	77.3	53.6	59.7	78.3	69.9	75.0	91.4	52.3	52.0	90.6
YOLOv5 [33]	72.0	81.3	81.2	74.0	90.1	45.9	79.7	59.7	64.2	63.8	78.7	74.9	61.3	61.3	89.6	69.2	79.3	89.0	58.8	57.5	80.9
ABNet [16]	72.8	66.8	84.0	74.9	87.7	50.3	78.2	67.8	85.9	74.2	79.7	81.2	55.4	61.6	75.1	74.0	66.7	87.0	62.2	53.6	89.1
TMAFNet [62]	72.9	92.2	77.7	75.0	91.3	47.1	78.6	53.6	67.1	66.2	78.5	76.3	64.9	61.4	90.5	72.2	75.4	90.7	62.1	55.2	83.2
MSA R-CNN [61]	74.3	92.9	73.8	93.2	87.3	43.0	90.6	58.9	69.2	58.0	83.3	84.2	57.3	62.4	68.8	91.8	81.3	90.9	53.7	72.2	74.5
SESA-Net [52]	74.6	93.5	78.2	91.9	82.2	46.8	89.3	61.4	67.2	67.5	73.3	76.7	55.8	62.0	78.8	84.8	82.4	89.8	54.4	76.7	79.1
CoF-Net [65]	75.8	84.0	85.3	82.6	90.0	47.1	80.7	73.3	89.3	74.0	84.5	83.2	57.4	62.2	82.9	77.6	68.2	89.9	68.7	49.3	85.2
ProSAF-Net (Ours)	76.1	76.3	85.6	82.7	90.3	48.9	80.2	68.0	86.8	80.6	79.9	80.7	61.5	62.5	90.2	73.4	74.4	90.9	62.3	56.2	90.0

Table 4. Ablation study on the RSOD and NWPU VHR-10 datasets. The mAP(R) represents the mAP on the RSOD dataset, while mAP(N) represents the mAP on the NWPU VHR-10 dataset.

Exp	SDAM	PSFM	SCAM	ADIoU	mAP (R) (%)	mAP (N) (%)	Params(M)	FLOPs(G)
1	✗	✗	✗	✗	93.64	93.30	7.03	16.0
2	✓	✗	✗	✗	95.37	94.39	8.01	18.3
3	✗	✓	✗	✗	95.08	94.47	7.45	16.3
4	✗	✗	✓	✗	94.73	94.71	7.47	16.6
5	✓	✓	✗	✗	95.93	95.30	8.41	18.6
6	✓	✓	✓	✗	96.79	96.01	8.85	19.2
7	✓	✓	✓	✓	97.12	96.56	8.85	19.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, L.; Wang, J.; Liao, Y.; Qian, W. A Progressive Semantic-Aware Fusion Network for Remote Sensing Object Detection. Appl. Sci. 2025, 15, 4422. https://doi.org/10.3390/app15084422

AMA Style

Li L, Wang J, Liao Y, Qian W. A Progressive Semantic-Aware Fusion Network for Remote Sensing Object Detection. Applied Sciences. 2025; 15(8):4422. https://doi.org/10.3390/app15084422

Chicago/Turabian Style

Li, Lerong, Jiayang Wang, Yue Liao, and Wenbin Qian. 2025. "A Progressive Semantic-Aware Fusion Network for Remote Sensing Object Detection" Applied Sciences 15, no. 8: 4422. https://doi.org/10.3390/app15084422

APA Style

Li, L., Wang, J., Liao, Y., & Qian, W. (2025). A Progressive Semantic-Aware Fusion Network for Remote Sensing Object Detection. Applied Sciences, 15(8), 4422. https://doi.org/10.3390/app15084422

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Progressive Semantic-Aware Fusion Network for Remote Sensing Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Feature Fusion

2.2. Contextual Information Exploration

3. Proposed Method

3.1. Overall Framework

3.2. Shallow Detail Aggregation Module

3.3. Deep Semantic Fusion Module

3.4. Spatial Context Awareness Module

3.5. Auxiliary Dynamic Loss Function

4. Experimental Results and Discussion

4.1. Datasets

4.1.1. RSOD Dataset

4.1.2. NWPU VHR-10 Dataset

4.1.3. DIOR Dataset

4.2. Experimental Details

4.3. Evaluation Metrics

4.4. Experimental Results and Discussion

4.4.1. Results on the RSOD Dataset

4.4.2. Results on the NWPU VHR-10 Dataset

4.4.3. Results on the DIOR Dataset

4.4.4. Computational Efficiency Analysis

4.5. Ablation Study

4.5.1. Effect of SDAM

4.5.2. Effect of DSFM

4.5.3. Effect of SCAM

4.5.4. Effect of ADIoU

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI