Enhanced Remote Sensing Object Detection via AFDNet: Integrating Dual-Sensing Attention and Dynamic Bounding Box Optimization

Wang, Ziyan; Fang, Miao; Zhang, Xiaofei

doi:10.3390/a18120751

Open AccessArticle

Enhanced Remote Sensing Object Detection via AFDNet: Integrating Dual-Sensing Attention and Dynamic Bounding Box Optimization

by

Ziyan Wang

¹,

Miao Fang

^1,* and

Xiaofei Zhang

^2,*

¹

School of Computer and Communication Engineering, Northeastern University at Qinhuangdao, Qinhuangdao 066003, China

²

School of Management, Northeastern University at Qinhuangdao, Qinhuangdao 066003, China

^*

Authors to whom correspondence should be addressed.

Algorithms 2025, 18(12), 751; https://doi.org/10.3390/a18120751

Submission received: 2 November 2025 / Revised: 25 November 2025 / Accepted: 25 November 2025 / Published: 28 November 2025

(This article belongs to the Collection Feature Papers in Combinatorial Optimization, Graph, and Network Algorithms)

Download

Browse Figures

Versions Notes

Abstract

Existing remote sensing object detection methods struggle with challenges such as complex background interference, variable object scales, and class imbalance due to a lack of coordinated internal optimization. This paper proposes AFDNet, a novel RSOD algorithm that establishes an internal collaborative evolution mechanism to systematically enhance the model’s feature perception and localization capabilities in complex scenes. AFDNet achieves this through three tightly coupled, co-evolving components: (1) A channel–spatial dual-sensing module that adaptively focuses on crucial features and suppresses background noise. (2) A dynamic bounding box optimization module that integrates distance-aware and scale-normalization strategies, significantly boosting localization accuracy and regression robustness for multi-scale objects. (3) A Gaussian adaptive activation unit that enhances the model’s nonlinear fitting capability for better detail extraction under weak conditions. Extensive experiments on two public datasets, RSOD and NWPU VHR-10, verify the excellent performance of AFDNet. AFDNet achieved a leading 95.16% mAP@50 on the RSOD dataset and an astonishing 96.52% mAP@50 on the NWPU VHR-10 dataset, which is significantly better than the mainstream detection models. This study verifies the effectiveness of introducing internal co-evolution mechanisms and provides a novel and reliable solution for high-precision remote sensing target detection.

Keywords:

remote sensing object detection; coordinated internal optimization; channel-spatial dual-sensing module; bounding box optimization; gaussian adaptive activation unit

1. Introduction

With the rapid advancement of Earth observation technologies, high-resolution remote sensing imagery has become an indispensable data foundation for numerous fields including environmental monitoring, urban planning, and national defense security [1,2]. Among these, object detection—as the core technology for automatically identifying and locating objects of interest within vast image datasets—directly impacts the efficiency and accuracy of information extraction. Its performance remains a long-standing research focus and challenge in this domain [3,4].

Although deep learning-based object detection models, particularly single-stage detectors, have achieved remarkable success in natural image domains [5], their direct application to remote sensing imagery remains fraught with significant challenges. This stems primarily from the inherent characteristics of remote sensing images: First, the background exhibits high complexity with numerous ground object types, often causing objects to be obscured by vast quantities of similar objects or intricate textures, resulting in insufficient feature discriminability; second, the extreme diversity in object scales—where large facilities spanning hundreds of pixels coexist with small vehicles comprising only tens of pixels—places high demands on the model’s feature extraction and multi-scale perception capabilities. Finally, the pronounced class imbalance issue makes it difficult for models to adequately learn the features of minority classes [6,7].

Currently, although researchers have proposed a series of improvement methods from different perspectives to address these challenges, existing studies tend to focus on optimizing individual components of detection networks in isolation, such as feature extraction, regression strategies, or activation functions [8]. This research paradigm largely overlooks the potential synergistic effects among components within the network, failing to establish a coordinated, mutually reinforcing optimization mechanism at the system level. This has become a critical bottleneck limiting further improvements in model performance.

Based on this, this paper proposes an object detection algorithm named AFDNet. Its core concept lies in transcending traditional isolated optimization approaches by constructing an internal collaborative evolution mechanism. This mechanism aims to promote mutual reinforcement and co-optimization among the network’s three core components—feature perception, bounding box regression, and nonlinear transformation—during training, forming a powerful internal optimization cycle. The main contributions of this paper are as follows:

(1) Introduces a dual-channel–spatial-perception module that integrates contextual information from both spatial and channel dimensions through parallel attention paths, adaptively enhancing key object features while effectively suppressing complex background interference.

(2) Designs a dynamic bounding box optimization module that significantly improves localization accuracy and regression stability for multi-scale objects by incorporating distance-aware and scale-normalization strategies.

(3) A Gaussian adaptive activation unit is introduced, offering superior nonlinear fitting capabilities while maintaining computational efficiency, thereby enhancing the model’s representation of fine-grained features.

(4) The comprehensive experimental results on the public remote sensing datasets RSOD and NWPU VHR-10 show that AFDNet is significantly superior to the existing mainstream methods in many key indicators such as the mAP@50 and F1-Score, which fully verifies the effectiveness of the co-evolution mechanism and the advanced nature of the proposed algorithm.

This study proposes a novel approach to address the aforementioned challenges by establishing an internal co-evolution mechanism. The following sections will provide a detailed discussion on the specific design and implementation of the AFDNet algorithm, experimental validation, and analysis of results.

2. Related Work

2.1. Application of Attention Mechanisms in Remote Sensing Detection

Attention mechanisms have become a key technology for enhancing deep neural network performance by simulating the selective perception of the human visual system. In remote sensing image object detection, early works such as Squeeze-and-Excitation Networks (SENets) [9] enhanced feature representations by modeling inter-channel dependencies. Li et al. [10] further embedded the SENet’s channel attention mechanism into the lightweight DeepLabv3+, enabling the network to automatically enhance responses to farmland-related feature maps by explicitly modeling inter-channel dependencies. This significantly reduced false detections and improved boundary accuracy in remote sensing farmland object detection. Gong et al. [11] embedded an improved SENet into a road extraction network. Without increasing parameters, it strengthened the weights of road feature channels, enabling the network to maintain continuous responses even in occluded regions. This achieved complete and high-precision remote sensing road extraction.

However, pure channel attention methods like SENet focus solely on the importance of the channel dimension while completely ignoring spatial location information. Consequently, their performance is limited when handling remote sensing objects with complex spatial structures. Subsequent studies such as CBAM [12] and DANet [13] further integrated channel and spatial attention, enhancing the model’s ability to focus on key regions to some extent. Cheng et al. [14] proposed a novel YOLO variant by embedding CBAM at the end of the backbone. This approach enables the network to first highlight the sensitive features of small objects using channel attention, then suppress complex background noise with spatial attention, thereby significantly improving detection accuracy and robustness for small/weak objects. Addressing the challenge of embankment extraction in remote sensing imagery, Ai et al. [15] employs SE to reweight channel features for fine-line detection within its skip connections, followed by CBAM to reinforce spatial boundaries. This dual-attention approach synergizes line and edge detection, improving embankment extraction accuracy in the Youyi Farm remote sensing dataset. Ren et al. [16] concurrently computes spatial and channel weights for trunk output features within DANet’s position–channel dual-attention module. This pre-filters background noise and highlights changing edges, providing clean, high-order features for subsequent decoding upsampling. Consequently, it enhances the ability of remote sensing image change detection to capture small objects and boundaries.

It is worth noting that, in recent years, research has begun to explore new paradigms beyond the traditional attention mechanism. For example, the Mamba structure, as an efficient state space model, is being introduced into the field of computer vision and remote sensing detection to replace or supplement the Transformer/Attention structure. Li et al. [17] and You et al. [18] combine the Mamba structure with the advanced YOLOv10 model to realize UAV-assisted crop remote sensing monitoring. These works show that, in the context of remote sensing, the combination of Mamba‘s efficient sequence modeling capabilities and YOLO’s powerful detection framework has become a current research hotspot, focusing on achieving efficient and accurate detection in complex farmland environments, bringing a new direction for the attention mechanism of remote sensing detection.

However, these methods typically employ serial or simple parallel structures, failing to fully exploit the deep semantic interactions between spatial and temporal dimensions while incurring significant computational overhead. Existing attention modules are often embedded as isolated components within networks, lacking synergistic design with other detector components such as bounding box regression. The proposed channel–spatial dual-perception module achieves more efficient feature enhancement through parallelized interaction structures and lightweight design. It serves as a crucial component of the collaborative evolution mechanism, forming deep synergies with subsequent detection tasks.

2.2. Bounding Box Regression Optimization Method

The accuracy of bounding box regression directly impacts the localization performance of object detection. To address the issue of IoU metrics being insensitive to non-overlapping regions, researchers have successively proposed loss functions such as GIoU [19], DIoU [20], and SIoU [21]. These introduce geometric factors like the center point distance and aspect ratio to optimize the regression process. Huang et al. [22] incorporates scale factors into CIOU loss through inner-layer IOU and adjustable-alpha alpha-IOU. This enables regression boxes to generate tighter auxiliary boxes for high-aspect-ratio, densely packed small vessels, while enhancing robustness against complex backgrounds, significantly improving ship localization accuracy and convergence speed. Chen et al. [23] replaces YOLOv5’s original regression loss with SIOU loss. By introducing angle penalties and scale-adaptive weights, anchor boxes more rapidly and accurately approximate the boundaries of small floating objects against complex water surfaces, enabling the real-time precision detection of small objects in drone imagery.

Although these methods have made progress in general object detection, their regression stability for extreme-scale objects remains insufficient in remote sensing images with vast scale variations. Recently, some studies have attempted to introduce dynamic sample weighting strategies or reconstruct regression losses to mitigate optimization challenges caused by low-quality samples. To address training bias caused by imbalanced positive and negative anchor pairs in aerial images, Chen et al. [24] designed a dynamic training sample selection module. At each iteration, it reweights easy and difficult samples based on ground-truth annotations, ensuring that the network consistently focuses on scarce positive examples and high-quality negative examples. This approach effectively handles drastic scale variations. Butler et al. [25] proposes a size-aware Gaussian IoU multiplier that dynamically amplifies IoU for small object anchor-ground-truth pairs during training. This compensates for insufficient matching in aerial imagery due to minute scales, enhancing small object recall while reducing false positives.

In addition, the latest research also focuses on combining advanced feature extraction and detection models in challenging remote sensing or aquatic environments to improve the accuracy of positioning and recognition. Zhao et al. [26] used the Mamba structure in the framework of super-resolution and semi-supervised YOLOv10 for the detection of freshwater mussels. This work highlights the necessity of combining sequence modeling, super-resolution, and powerful YOLOv10 detectors in the special ‘remote sensing’ scenes of underwater acoustic video to deal with the challenges of bounding box regression and detection brought by small targets and low-quality images and provides new ideas for dealing with extreme-scale and blurred targets in remote sensing images.

However, most of these approaches focus on data distribution or single geometric factors, failing to systematically integrate distance-aware and scale-normalization mechanisms. They also lack synergistic feedback with the feature extraction stage. Our dynamic bounding box optimization module achieves robust regression for multi-scale objects by constructing a distance-sensitive weighting function and scale-adaptive normalization factor, forming a synergistic evolution with the feature-aware module.

2.3. Activation Functions and Nonlinear Representations

Activation functions, as the core components introducing nonlinearity in neural networks, have profoundly impacted model performance through their evolution. From the traditional ReLU [27] to Swish [28] and Mish [29], researchers have focused on mitigating the issue of neuron death and enhancing expressive power by achieving smoother gradient characteristics. Rajanand et al. [30] combines the error function with a linear rectifier unit to propose “ErfReLU”, which trains alongside the network. This enables ReLU to exhibit smooth differentiability even in negative regions, thereby enhancing feature expressiveness. By replacing ReLUs with Swish in networks [31], preserves residual outputs in negative regions with smooth, non-monotonic gradients. This allows subtle leaf lesion features to propagate forward, boosting recognition accuracy without increasing model depth. Kalim et al. [32] introduced bounded negative outputs and non-monotonicity to Swish, preserving smooth differentiability while suppressing gradient vanishing in deep layers. This validated the feasibility of replacing traditional activation functions with Swish in convolutional neural networks (CNNs) for visual tasks. GELU [33] enhances the nonlinear fitting capability of models like Transformers [34] by incorporating stochastic regularization principles. Taghavizade et al. [35] replaces GELU with a serial algorithm prioritizing the most significant bits, directly reducing area and power consumption in Transformer encoders and decoders. Experiments demonstrate that the area shrinks to approximately half the original size, while power consumption drops by 70% with model accuracy remaining unchanged. However, its high computational complexity limits deployment in dense prediction tasks.

When considering the nonlinear representation ability in remote sensing image processing, we must also pay attention to the processing requirements of data in high dynamic and difficult scenes. Although Han et al. [36] focused on object tracking rather than pure detection, they used Event Cameras to cope with high-speed and harsh lighting environments (similar to complex lighting and rapidly changing backgrounds common in remote sensing). The sparse and high temporal resolution data generated by the event camera puts forward extremely high requirements for network activation function and nonlinear robustness. This implies that, in the complex remote sensing background, the optimization of nonlinear representation ability should not only pay attention to smoothness and gradient characteristics, but it also needs to consider the robustness of features and the ability of information fusion under extreme conditions.

Existing research primarily focuses on activation function performance across general tasks, lacking objected optimization for the complex textures and high inter-class differences in remote sensing images. Furthermore, activation functions are typically treated as fixed components independent of the network architecture, leaving untapped potential for synergistic optimization with tasks like feature extraction and localization regression. The Gaussian adaptive activation unit introduced herein preserves GELU’s smoothing advantages while optimizing computational pathways. Serving as a foundational component of the co-evolutionary mechanism, it collectively enhances the model’s feature discrimination capabilities.

In summary, existing improvement methods predominantly follow an “isolated optimization” paradigm. While enhancing the independent performance of individual components, they overlook the intrinsic inter-dependencies between modules. This paper overcomes this limitation by proposing a collaborative evolution mechanism. It integrates feature perception, bounding box regression, and nonlinear transformation within a unified framework, enabling the three major components to mutually reinforce and co-optimize. This approach provides a novel solution for remote sensing object detection.

3. Algorithm Model

The overall architecture of the AFDNet algorithm is shown in Figure 1. Following the general design paradigm of single-stage detectors, it primarily consists of three components: the backbone network, the Feature Pyramid Network (Neck), and the detection head. The core of this algorithm lies in the objected design and optimization of the backbone network and detection head through a collaborative evolution mechanism.

Backbone: The backbone is responsible for extracting multi-level feature representations from input images. This algorithm adopts a CSPDarknet-based structure as its foundation, integrating CSDAM and GeLU to enhance its feature extraction capabilities. The network input size is 640 × 640 × 3. Through convolution and downsampling operations, it progressively generates feature maps at different scales (e.g., 160 × 160, 80 × 80, 40 × 40, and 20 × 20), providing foundational features for subsequent multi-scale detection.

Feature Pyramid Network: The Feature Pyramid Network (FPN) receives outputs from the backbone network, fusing features from different levels to enhance the model’s detection capability for multi-scale objects. This algorithm employs the Path Aggregation Network (PANet) structure in this component. By utilizing bidirectional pathways (top-down and bottom-up) combined with lateral connections, this architecture effectively merges deep semantic information with shallow localization information, constructing an enhanced multi-scale feature pyramid (P3, P4, and P5).

Detection Head: The detection head receives multi-scale features from the feature pyramid and performs the final classification and localization tasks. This algorithm employs a decoupled design for the detection head, comprising two independent branches.

Classification Branch (Cls): This branch consists of Conv + BN + GeLU operations and is responsible for predicting the confidence scores of object categories contained within each anchor box.

Regression Branch (Bbox): This branch integrates DBBOM to predict the precise location and dimensions of the object’s bounding box.

Through deep feature extraction from the backbone network, multi-scale information fusion via the feature pyramid, and precise classification and localization by the detection head, this architecture constructs a comprehensive co-evolutionary detection framework. It provides robust technical support for object recognition in complex remote sensing scenarios.

3.1. Channel–Space Dual-Perception Module

In remote sensing image object detection, complex terrain backgrounds often intertwine with key objects, posing a severe challenge to the model’s discrimination capabilities. Traditional attention mechanisms typically employ serial structures or simple parallel approaches, struggling to fully capture the deep interactions between the channel and spatial dimensions. Furthermore, these methods often normalize attention weights using the softmax function, which is prone to vanishing or exploding gradients when encountering extreme input values, thereby limiting the model’s convergence stability and robustness. To overcome these limitations, we designed the Channel and Spatial Dual-Attention Module (CSDAM). Through a parallel dual-path architecture and an innovative feature fusion mechanism, CSDAM achieves the precise enhancement of key features while effectively suppressing background interference.

While preserving the efficiency of lightweight attention mechanisms, this module achieves performance breakthroughs by reconstructing feature interaction mechanisms. Compared to classical methods like CBAM, CSDAM employs a collaborative attention paradigm based on batch normalization (BN), establishing a dynamically coupled weight generation mechanism between the channel and spatial dimensions.

In the channel attention branch, the module innovatively utilizes the scaling factor inherent in the batch normalization operation to characterize the importance of each channel. The magnitude of this factor is learned during training, and its physical meaning is to measure the degree of variation in the corresponding channel features across batches of data. The batch normalization operation itself is defined as shown in Equation (1).

B_{out} = BN (B_{in}) = γ \frac{B_{in} - μ ε}{\sqrt{τ_{ε}^{2} + θ}} + Γ

(1)

The channel attention sub-module, as shown in Figure 2, serves as the scaling factor for each channel. It takes the input feature map as the input and outputs the feature map as shown in Equation (2).

I_{c} = sigmoid (ω_{a} (BN (M_{1})))

(2)

In the spatial attention branch, we extend the normalization concept by evaluating the importance of each position in the feature map space. This branch first processes the input features through a standard convolutional layer to obtain the intermediate feature

M_{2}

. Subsequently, it applies batch normalization to the spatial dimension and generates spatial attention weights via the sigmoid function with pixel normalization, as shown in Equation (3). The pixel attention map is illustrated in Figure 3.

{BN}_{s}

denotes batch normalization in the spatial dimension, and

ω_{β}

represents the convolutional layer weights.

I_{s} = sigmoid (ω_{β} ({BN}_{s} (M_{2})))

(3)

This parallel fusion mechanism, combined with a weight generation strategy based on batch-normalized statistics, enables the CSDAM to reliably enhance key features and suppress background interference while maintaining lightweight computational properties.

3.2. Bounding Box Regression Optimization Module

The performance of object detection largely depends on the accuracy of bounding box regression. Traditional IoU and its variants face significant challenges when handling extreme scale variations and complex scene distributions in remote sensing images: on one hand, training data inevitably contains numerous low-quality samples, and traditional geometric penalty mechanisms excessively amplify the loss weights of these samples, causing the model optimization direction to deviate; on the other hand, the enormous scale differences make it difficult for traditional methods to establish balanced regression gradients across objects of different sizes.

To address this issue, we propose the dynamic bounding box optimization module (DBBOM). This module significantly enhances the accuracy and robustness of bounding box regression by establishing a synergistic mechanism that is both distance-aware and scale-adaptive. Unlike traditional static loss functions, DBBOM introduces a dynamic weight adjustment strategy, enabling the model to adaptively shift its regression focus based on object characteristics.

The core innovation of DBBOM lies in constructing a dual-attention mechanism: the distance attention module and the scale-aware module. The distance attention module dynamically adjusts regression weights for samples at different positions by establishing a spatial relationship model between the predicted boxes and ground-truth boxes. This mechanism is represented as shown in Equation (4).

ζ_{W I o U v 1} = Γ_{W I o U} ζ_{W I o U}

(4)

Among these,

Γ_{W I o U}

is the constructed distance attention mechanism, whose calculation method is shown in Equation (5).

Γ_{W I o U} = \exp (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{(W_{g}^{2} + H_{g}^{2})}^{*}})

(5)

This design embodies three theoretical advantages. First, the distance attention weight

Γ_{W I o U} \in [1, e)

is introduced as a dynamic gradient scaling factor, drawing motivation from established loss modification techniques like Focal Loss, which prioritizes hard or misclassified samples, and the

β

-scaling factor in IOU-based losses that smooths the loss for small gradients. Specifically, our

Γ_{W I o U}

serves as an adaptive mechanism to amplify the prediction gradient of samples with moderate optimization potential (i.e., those that are “average quality” or “less confident”), guiding the model to focus effectively on these challenging yet trainable predictions. This mechanism is theoretically supported as a way to achieve balanced sample optimization, preventing the gradient signal from being dominated by already high-quality or very low-quality samples. Second, by setting the denominator term

{(W_{g}^{2} + H_{g}^{2})}^{*}

as a gradient clipping mechanism, it mitigates the impact of abnormal gradients on training stability. Finally, when prediction boxes exhibit high overlap with object boxes, the mechanism automatically enhances the optimization of center-point distances, enabling intelligent switching between different regression stages.

W_{g}

and

H_{g}

denote the minimum bounding box sizes, while * indicates its separation from the computational graph to prevent convergence slowdowns.

During training, this module exhibits unique adaptive properties: for high-quality anchor boxes, term

ζ_{W I o U}

approaches zero, naturally reducing the influence of distance attention weights; whereas for average-quality samples, the amplification effect of

Γ_{W I o U}

ensures that the model receives sufficient optimization signals. This dynamic balancing mechanism effectively mitigates the issue of low-quality samples dominating the training process, significantly enhancing the model’s generalization capability in complex remote sensing scenarios.

This module forms deep synergy with the previously mentioned CSDAM feature enhancement module. The refined feature representations provided by CSDAM lay a solid foundation for bounding box regression, while this module further optimizes regression accuracy through an intelligently weighted loss function. The two modules mutually reinforce each other, constituting a complete co-evolutionary system: high-quality features generate more accurate initial predictions, and precise regression loss in turn guides the optimization direction of the feature extraction network.

3.3. Gaussian Linear Error Module

In deep neural networks, activation functions serve as the core components introducing nonlinearity, and their design decisively impacts the model’s representational capacity and training stability. From traditional ReLU to Swish and Mish, researchers have focused on enhancing network expressiveness by refining activation functions. However, existing methods still face significant limitations when handling complex feature patterns in remote sensing images: the hard zero boundary of the ReLU function leads to dead neuron issues; functions like Swish mitigate gradient vanishing but exhibit high computational complexity; and while the GeLU function performs exceptionally well in Transformers, it struggles to be efficiently deployed in computationally intensive detection tasks.

To address these challenges, this paper proposes the Gaussian Error Linear Unit (GeLU) based on the Gaussian error function. The core idea of this unit is to integrate the concept of stochastic regularization into the activation process, achieving more elegant nonlinear transformations through probabilistic modeling. The complete mathematical definition of GeLU is shown in Equation (6).

GeLU (x) = x \cdot P (< x) = x \cdot ϕ (x)

(6)

Here,

ϕ (x)

denotes the cumulative distribution function of the standard normal distribution, whose specific expression is given by Equation (7).

ϕ (x) = \frac{1}{2} + e r f (\frac{x}{\sqrt{2}})

(7)

Here,

e r f (x)

is the Gaussian error function. From a probabilistic perspective, this function can be further expressed in the expectation form, as shown in Equation (8):

x \cdot P (X \leq x) = x \int_{- \infty}^{x} \frac{e^{- \frac{{(t - μ)}^{2}}{2 σ^{2}}}}{\sqrt{2 π} σ} d t

(8)

where

μ

and

σ

represent the mean and standard deviation of the normal distribution, respectively. Considering the difficulty of directly computing the above integrals, we derive two efficient computational approximations, as shown in Equation (9).

GeLU (x) = 0.5 x [1 + \tanh (\sqrt{\frac{2}{π}} (x + 0.044715 x^{3}))]

(9)

A more concise sigmoid approximation is derived, as shown in Equation (10).

GeLU (x) = x \cdot σ (1.702 x)

(10)

Compared to the ReLU function, the GeLU function possesses a non-zero gradient in the negative value region, thereby avoiding the issue of dead neurons. Additionally, GeLU exhibits greater smoothness near zero than ReLU, facilitating easier convergence during training. It is worth noting that GeLU involves more complex computations, thus requiring greater computational resources. Graphs of the ReLU and GeLU functions are shown in Figure 4 and Figure 5.

The core advantages of GeLU lay its effectiveness in remote sensing object detection. Its design concept based on probability distribution makes the function maintain a non-zero gradient in the negative region, which effectively alleviates the problem of Neuron Dropout. At the same time, the continuous differentiability of the function near the origin ensures the stable propagation of the gradient. These excellent nonlinear fitting and activation characteristics enhance the ability of the model to extract detailed information under low-resolution or weak object conditions and achieve deep collaboration with the CSDAM feature enhancement module and the dynamic bounding box optimization module, jointly building a high-performance detection framework.

Within AFDNet’s co-evolutionary framework, GeLU serves as a foundational component. This unit forms a deep complementarity with the CSDAM: GeLU’s smooth nonlinear transformation provides a robust foundation for feature enhancement, while its stable gradient flow ensures training convergence for the dynamic bounding box optimization module.

4. Experimental Results and Analysis

4.1. Experimental Setup

Table 1 details the hardware and software configurations used in this experiment. The experiment employs a GPU setup, NVIDIA RTX 4090 GPUs (NVIDIA, Santa Clara, CA, United States), providing a robust computational foundation for model training. The software environment runs on the Windows 11 operating system, utilizing the PyTorch 2.1.0 and CUDA 12.1 deep learning frameworks, with data analysis performed using libraries such as Matplotlib 3.7.1. Regarding training parameter settings, input image dimensions were uniformly adjusted to 640 × 640 pixels. The initial learning rate was set to 0.01, employing the Adam optimizer with a batch size of four and a training cycle of 100 epochs.

In terms of implementation details, the AFDNet model is built upon the PyTorch framework, adopting the standard workflow of single-stage detectors. This includes a backbone feature extraction network, a multi-scale feature fusion module, and a detection head. While the foundational implementation references industry-recognized mature design paradigms to ensure fair comparisons, the core contribution of this paper lies in proposing a collaborative evolution mechanism and instantiating three key modules: CSDAM, DBBOM, and GeLU. These modules constitute the core innovation of AFDNet, whose design philosophy is versatile enough to be independently transferred to other detection frameworks. A progressive learning strategy is employed during training, dynamically adjusting loss function weights to balance classification and regression tasks, ensuring the three innovative modules can be co-optimized.

4.2. Dataset Introduction

The RSOD dataset is an open-source dataset for object detection in remote sensing images, which has attracted significant attention due to its substantial number of image samples and diverse object categories. It covers four representative types of ground objects: aircraft, oil tanks, overpasses, and playgrounds, providing researchers with abundant resources for training and testing. Specifically, the dataset includes approximately 446 aircraft images (containing 4993 objects), 165 oil tank images (containing 1586 objects), 176 overpass images (containing 180 objects), and 189 playground images (containing 191 objects).

To enable a more comprehensive evaluation of algorithm performance, this paper also utilizes the NWPU VHR-10 dataset. As a highly influential benchmark in the field of remote sensing, it contains 10 common object categories: aircraft, ships, oil tanks, baseball fields, tennis courts, basketball courts, athletic tracks, harbors, bridges, and vehicles. The dataset consists of over 800 high-resolution remote sensing images, including 650 images with objects and 150 background images. With a wider variety of object types and more complex scenes, it is often used to validate the generalization capability and robustness of object detection algorithms in diverse remote sensing scenarios.

Figure 6 shows sample remote sensing images from both datasets, visually illustrating the characteristics of typical ground objects such as aircraft, oil tanks, overpasses, and playgrounds in remote sensing imagery. Together, the RSOD and NWPU VHR-10 datasets provide essential support for remote sensing object detection research—the former focuses on the fine-grained detection of specific categories, while the latter offers broader category coverage and scene diversity.

4.3. Indicator Analysis

To comprehensively evaluate model performance, this paper employs seven key metrics that are widely recognized in the object detection field, covering both detection accuracy and computational efficiency.

mAP (mean average precision) serves as a key metric for evaluating a model’s overall performance in multi-class object detection. This paper employs mAP@50, which represents the mean average precision value at an Intersection over Union (IoU) threshold of 0.5. This metric first calculates the average precision (AP) for each class, then averages all class AP values to obtain mAP@50.

Precision measures the proportion of True Positive samples correctly predicted as positive, reflecting the reliability of detection results.

Recall evaluates a model’s ability to identify positive samples, calculated as the proportion of correctly detected positives relative to all true positives.

The F1-Score is the harmonic mean of precision and recall, providing a balanced and unified assessment of both.

The evaluation metric formulas are shown in Table 2.

FLOPs (Floating Point Operations) quantify the computational complexity of a model by measuring the total number of Floating Point Operations required for a single forward pass. This metric reflects the theoretical computational cost and is hardware-independent, making it crucial for evaluating model efficiency and suitability for resource-constrained environments.

The number of parameters indicates the model’s scale and complexity, representing the total learnable weights in the network. Fewer parameters generally suggest lower memory requirements and reduced risk of overfitting, while more parameters may indicate higher model capacity.

FPS (Frames Per Second) measures the practical inference speed by counting how many images a model can process per second. This hardware-dependent metric is crucial for real-time applications and is significantly influenced by the implementation environment, including GPU capability and software optimization.

These metrics collectively provide comprehensive insights into both the detection performance and practical deployment potential of the evaluated models.

4.4. Comparative Experimental Analysis

The performance of the proposed AFDNet model was comprehensively evaluated through comparative experiments against several mainstream object detection models on the RSOD and NWPU VHR-10 datasets. The comparison included single-stage detectors (like the YOLO series and SSD) and Transformer-based detectors (DETR, RT-DETR).

Table 3 shows the performance comparison of various object detection models including AFDNet on the RSOD dataset. On the whole, the AFDNet algorithm shows a significant advantage, with 95.16% on the mAP@50 indicator, surpassing all other models, including the latest YOLO series (such as 92.91% of YOLOv10x and 93.26% of YOLOv11) and other high-performance detectors (such as 93.51% of RT-DETR). This shows that the AFDNet has the highest detection accuracy in the RSOD dataset. In terms of important detection accuracy indicators, AFDNet’s precision is as high as 93.57%, second only to YOLOv5x, and its recall also reaches 82.02%, which is the highest value in all models that list complete indicators, which means that it can find more targets. For the F1-Score of the comprehensive precision and recall rate, AFDNet also reached 85.50%, putting it in a leading position. It is worth mentioning that, while maintaining top performance, AFDNet ‘s GFLOPs (109.0) and Parameters/M (87.31) also have better balance than some high-precision models (such as YOLOX’s 281.97 GFLOPs and 99.11 M parameters). Although its FPS is 63.84, which is lower than the lightweight model, considering its huge lead on mAP@50, AFDNet provides the best precision–recall trade-off and highest detection accuracy in this dataset.

Table 4 summarizes the performance of each object detection model in the more challenging NWPU VHR-10 dataset. In this dataset, the AFDNet also performs well, especially in key detection accuracy and recall rate. With a mAP@50 value of 96.52%, AFDNet surpassed all the comparison models again, further verifying its superior generalization ability and robustness, even higher than RT-DETR (95.89%) and YOLOv11 (95.63%), which performed best on this dataset. In terms of detailed performance indicators, AFDNet shows the advantage of compaction: its recall reaches 92.89%, which is the highest value of all models that list the complete indicators, far exceeding other models (e.g., YOLOv11 is 86.41%), meaning that AFDNet has a very low missed detection rate in complex scenarios. At the same time, the precision of the AFDNet is as high as 95.42%, which ensures the high accuracy of the test results, and the F1-Score is also ranked first at 93.80%. This fully proves the powerful ability of AFDNet in accurately identifying and locating remote sensing image targets. In terms of computational overhead, AFDNet’s FLOPs (109.0) and Parameters/M (87.31) remain within a reasonable range, indicating that its high performance does not simply depend on the huge model volume. In summary, in the benchmark test of NWPU VHR-10, a high-resolution remote sensing target detection, AFDNet, is the best comprehensive performance algorithm, especially in ensuring high detection accuracy (mAP@50) while significantly improving the recall and F1-Score of the object.

On the whole, the significant advantages of AFDNet in all evaluation indicators verify the effectiveness of the co-evolutionary mechanism it adopts. It has successfully achieved a significant increase in the recall rate while maintaining high accuracy, breaking through the classic trade-off between the accuracy and recall rate in target detection. This breakthrough performance stems from the synergistic contribution of three core innovation modules: CSDAM strengthens feature representation, DBBOM optimizes bounding box regression accuracy, and the GeLU unit guarantees the nonlinear expression ability of the network.

4.5. Melting Analysis Experiment

In order to fully verify the independent contributions of the three core innovation modules of CSDAM, DBBOM, and GeLU proposed in this paper and their synergistic optimization effects, we use a general architecture of a single-stage detector as the baseline model and conduct a series of systematic ablation experiments on the RSOD and NWPU VHR-10 datasets. The results are presented in Table 5 and Table 6, respectively.

Table 5 shows the independent contribution and cumulative effect of the three core innovation modules, CSDAM, DBBOM, and GeLU, proposed in this paper on the RSOD dataset. Taking Step 1’s baseline model (mAP@50 is 90.64%) as a starting point, the gradual introduction of each module has brought continuous and significant performance improvements. Firstly, CSDAM (Step 2) is introduced to immediately increase the mAP@50 to 91.49% and recall to 1.70%, which verifies the efficiency of CSDAM in enhancing feature representation. On this basis, after further introduction of DBBOM (Step 3), mAP@50 jumped to 93.05%, a significant increase (+1.56%), which clearly shows the key role of DBBOM in optimizing target positioning accuracy. Finally, on the basis that the module combination has been improved, the nonlinear activation function is optimized. The results show that GeLU (Step 6) exceeds Swish (94.61%) and Mish (95.02%) with a 95.16% mAP@50 performance, which proves that GeLU provides the most stable and efficient nonlinear transformation. Finally, the synergy of the three modules makes the mAP@50 of AFDNet reach 95.16%, which is a significant improvement of 4.52% compared with the baseline model, which fully proves the effectiveness of all innovative modules and their strong optimization potential on the RSOD dataset.

Table 6 shows the ablation experimental results on the more challenging NWPU VHR-10 dataset, which further confirms the robustness of each module. Starting from the baseline model (Step 1, mAP@50 is 92.87%), the introduction of CSDAM (Step 2) brought an amazing performance improvement, mAP@50 reached 95.13% (+2.26%), while recall surged from 70.15% to 85.71%, highlighting the excellent ability of CSDAM to accurately recall targets in complex high-resolution remote sensing images. Subsequently, the integration of DBBOM (Step 3) further increased mAP@50 to 95.77% and recall to 89.18%, confirming the continuous optimization effect of DBBOM on the accuracy of bounding box regression. Consistent with the results of RSOD experiments, GeLU (Step 6) again outperformed Swish (96.24%) and Mish (96.43%), with a maximum mAP@50 performance of 96.52% when comparing the activation functions. This shows that GeLU can provide the most favorable nonlinear optimization while maintaining high positioning accuracy. In summary, the three innovative modules have formed a strong positive optimization closed-loop on the NWPU VHR-10 dataset, making the mAP@50 of the model stable to the optimal level of 96.52%, which verifies its excellent performance and complementary synergy in complex remote sensing target detection tasks.

Comprehensive analysis shows that CSDAM successfully enhances the discrimination of features; DBBOM focuses on optimizing the precise positioning of the target bounding box; and GeLU provides stable and efficient nonlinear support. These three innovative modules each play a unique and complementary key role. Their organic combination makes the performance of the entire detection system reach the optimal state, which fully proves the effectiveness and innovation of the proposed method.

4.6. Detection Effect

Figure 7 and Figure 8 present the visual detection results of four ablation experiment models: (a) baseline, (b) Baseline-CSDAM, (c) Baseline-CSDAM-DBBOM, and (d) AFDNet.

Figure 7 showcases the detection performance on the RSOD dataset, demonstrating results across four typical categories: aircraft (red), oil tank (green), overpass (cyan), and playground (purple). The models’ detection capabilities for remote sensing objects of varying scales and shapes are clearly visible, with each category accurately annotated by bounding boxes of distinct colors.

Figure 8 showcases the detection performance on the NWPU VHR-10 dataset, demonstrating results across four categories: aircraft (red), baseball field (green), ship (blue), and tennis court (cyan/red). These visualizations provide an intuitive representation of the models’ actual performance in complex remote sensing scenarios, offering crucial visual references for subsequent quantitative analysis.

5. Conclusions

This paper addresses critical challenges in high-resolution remote sensing image object detection—including background interference, scale variability, and class imbalance—by proposing the AFDNet detection algorithm based on a collaborative evolution mechanism. Through the design of three core components—a channel–spatial dual-perception module, a dynamic bounding box optimization module, and a Gaussian adaptive activation unit—we establish a collaborative evolution framework integrating feature perception, localization optimization, and nonlinear representation. The experimental results on the RSOD dataset and the NWPU VHR-10 dataset show that the proposed algorithm is superior to the existing mainstream detection methods in the four key indicators of mAP@50, F1-Score, precision, and recall. The ablation experiments further verify the effectiveness and necessity of each innovative module. Considering the practical application requirements of remote sensing object detection in unmanned aerial vehicle reconnaissance and edge computing, AFDNet has certain feasibility in deployment: due to the relatively lightweight attention mechanism and efficient optimization strategy adopted in the model design, AFDNet has advantages in computational complexity. Through further model compression techniques such as quantization and pruning, it is expected to achieve efficient reasoning on resource-constrained embedded or UAV-borne chips to meet real-time requirements. Future work will focus on optimizing detection performance for complex structural objects and exploring the applicability of this collaborative evolution mechanism to other computer vision tasks.

Author Contributions

Methodology, Z.W. and M.F.; conceptualization, Z.W. and M.F.; software, Z.W.; formal analysis, X.Z. and Z.W.; validation, M.F. and X.Z.; investigation, X.Z. and M.F.; data curation, Z.W.; resources, M.F.; funding acquisition, M.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by four funding sources: The Fundamental Research Funds for the Central Universities, Ministry of Education of China (Grant No. N2423042); The National Social Science Fund of China (Grant No. 23BGL239); The Laboratory of Language and Artificial Intelligence, Center for Linguistics and Applied Linguistics, Guangdong University of Foreign Studies (Grant No. LAI202306); and The College Students’ Innovation and Entrepreneurship Plan Project of Northeastern University at Qinhuangdao (Grant No. CX260326).

Data Availability Statement

The data are available upon request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Wang, K.; Wang, Z.; Li, Z.; Su, A.; Teng, X.; Pan, E.; Liu, M.; Yu, Q. Oriented object detection in optical remote sensing images using deep learning: A survey. Artif. Intell. Rev. 2025, 58, 350. [Google Scholar] [CrossRef]
Mei, S.; Lian, J.; Wang, X.; Su, Y.; Ma, M.; Chau, L. A comprehensive study on the robustness of deep learning-based image classification and object detection in remote sensing: Surveying and benchmarking. J. Remote Sens. 2024, 4, 0219. [Google Scholar] [CrossRef]
Wang, R.; Ma, L.; He, G.; Johnson, B.A.; Yan, Z.; Chang, M.; Liang, Y. Transformers for remote sensing: A systematic review and analysis. Sensors 2024, 24, 3495. [Google Scholar] [CrossRef] [PubMed]
Ganga, B.; Lata, B.T.; Venugopal, K.R. Object detection and crowd analysis using deep learning techniques: Comprehensive review and future directions. Neurocomputing 2024, 597, 127932. [Google Scholar] [CrossRef]
Wei, W.; Cheng, Y.; He, J.; Zhu, X. A review of small object detection based on deep learning. Neural Comput. Appl. 2024, 36, 6283–6303. [Google Scholar] [CrossRef]
Zheng, X.; Wang, H.; Shang, Y.; Chen, G.; Zou, S.; Yuan, Q. Starting from the structure: A review of small object detection based on deep learning. Image Vis. Comput. 2024, 146, 105054. [Google Scholar] [CrossRef]
Dao, M.Q.; Berrio, J.S.; Frémont, V.; Shan, M.; Héry, E.; Worrall, S. Practical collaborative perception: A framework for asynchronous and multi-agent 3d object detection. IEEE Trans. Intell. Transp. Syst. 2024, 25, 12163–12175. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Chen, Z.; Ma, Y.; Liu, L. A Lightweight high-resolution remote sensing image cultivated land extraction method integrating transfer learning and SENet. IEEE Access 2024, 12, 113694–113704. [Google Scholar] [CrossRef]
Gong, X.; Ma, Y.; Ma, A.; Hou, Z.; Zhang, M.; Zhong, Y. HSRoadNet: Hard-swish activation function and improved squeeze-excitation module network for road extraction using satellite remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4907–4920. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Cheng, A.; Xiao, J.; Li, Y.; Sun, Y.; Ren, Y.; Liu, J. Enhancing remote sensing object detection with K-CBST YOLO: Integrating CBAM and swin-transformer. Remote Sens. 2024, 16, 2885. [Google Scholar] [CrossRef]
Ai, H.; Zhu, X.; Han, Y.; Ma, S.; Wang, Y.; Ma, Y.; Qin, C.; Han, X.; Yang, Y.; Zhang, X. Extraction of levees from paddy fields based on the SE-CBAM UNet model and remote sensing images. Remote Sens. 2025, 17, 1871. [Google Scholar] [CrossRef]
Ren, H.; Xia, M.; Weng, L.; Hu, K.; Lin, H. Dual-Attention-Guided Multiscale Feature Aggregation Network for Remote Sensing Image Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4899–4916. [Google Scholar] [CrossRef]
Li, H.; Zhao, F.; Xue, F.; Wang, J.Q.; Liu, Y.Y.; Chen, Y.J.; Wu, Q.Y.; Tao, J.H.; Zhang, G.C.; Xi, D.H.; et al. Succulent-YOLO: Smart UAV-Assisted Succulent Farmland Monitoring with CLIP-Based YOLOv10 and Mamba Computer Vision. Remote Sens. 2025, 17, 2219. [Google Scholar] [CrossRef]
You, S.C.; Li, B.H.; Chen, Y.J.; Ren, Z.Y.; Liu, Y.Y.; Wu, Q.Y.; Tao, J.H.; Zhang, Z.J.; Zhang, C.Y.; Xue, F.; et al. Rose-Mamba-YOLO: An enhanced framework for efficient and accurate greenhouse rose monitoring. Front. Plant Sci. 2025, 16, 1607582. [Google Scholar] [CrossRef] [PubMed]
Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; LI, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 19 November 2019. [Google Scholar]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
Huang, P.; Tian, S.; Su, Y.; Tan, W.; Dong, Y.; Xu, W. IA-CIOU: An improved iou bounding box loss function for sar ship target detection methods. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 10569–10582. [Google Scholar] [CrossRef]
Chen, F.; Zhang, L.; Kang, S.; Chen, L.; Dong, H.; Li, D.; Wu, X. Soft-NMS-enabled YOLOv5 with SIOU for small water surface floater detection in UAV-captured images. Sustainability 2023, 15, 10751. [Google Scholar] [CrossRef]
Chen, L.; Liu, C.; Li, W.; Xu, Q.; Deng, H. DTSSNet: Dynamic training sample selection network for UAV object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5902516. [Google Scholar] [CrossRef]
Butler, J.; Leung, H. A Heatmap-supplemented R-CNN trained using an inflated IoU for small object detection. Remote Sens. 2024, 16, 4065. [Google Scholar] [CrossRef]
Zhao, F.; Xu, D.C.; Ren, Z.Y.; Shao, X.L.; Wu, Q.Y.; Liu, Y.Y.; Wang, J.Q.; Song, J.; Chen, Y.J.; Zhang, G.C.; et al. Mamba-based super-resolution and semi-supervised YOLOv10 for freshwater mussel detection using acoustic video camera: A case study at Lake Izunuma, Japan. Ecol. Inform. 2025, 90, 103324. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar] [CrossRef]
Misra, D. Mish: A self regularized non-monotonic activation function. In Proceedings of the 31st British Machine Vision Virtual Conference, Virtual Event, UK, 7–10 September 2020. [Google Scholar]
Rajanand, A.; Singh, P. ErfReLU: Adaptive activation function for deep neural network. Pattern Anal. Appl. 2024, 27, 68. [Google Scholar] [CrossRef]
Bhuvanya, R.; Kujani, T.; Padmavathy, R.; Matheswaran, P.; Punitha, P. Beyond ReLU: Unlocking superior plant disease recognition with swish. Int. J. Innov. Technol. Manag. 2024, 21, 2441001. [Google Scholar] [CrossRef]
Kalim, H.; Chug, A.; Singh, A.P. Modswish: A new activation function for neural network. Evol. Intell. 2024, 17, 2637–2647. [Google Scholar] [CrossRef]
Jin, X.; Xu, C.; Feng, J.; Wei, Y.; Xiong, J.; Yan, S. Deep learning with s-shaped rectified linear activation units. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligenc (AAAI), Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 12 June 2017. [Google Scholar]
Taghavizade, A.; Rahmati, D.; Gorgin, S.; Lee, J.A. GELU-MSDF: A hardware accelerator for transformer’s GELU activation function using most significant digit first computation. In Proceedings of the IEEE 37th International System-on-Chip Conference (SOCC), Dresden, Germany, 16–19 September 2024. [Google Scholar]
Han, Y.; Yu, X.; Luan, H.; Suo, J.L. Event-Assisted Object Tracking on High-Speed Drones in Harsh Illumination Environment. Drones 2024, 8, 22. [Google Scholar] [CrossRef]

Figure 1. Overall model diagram of the AFDNet algorithm.

Figure 2. Channel attention mechanism.

Figure 3. Pixel attention mechanism.

Figure 4. ReLU function diagram.

Figure 5. GeLU function diagram.

Figure 6. RSOD and NWPU VHR-10 remote sensing image sample examples.

Figure 7. Schematic diagram of detection effect on the RSOD dataset.

Figure 8. Schematic diagram of detection effect on the NWPU VHR-10 dataset.

Table 1. Experimental hardware and software configuration.

Configuration	Setting
Input size	640 × 640
Batch size	4
Optimizer	Adam
Learning rate	0.01
GPU	RTX 4090
Tools	pytorch 2.1.0, torchvision 0.16, cuda 12.1

Table 2. Evaluation metrics.

Metric	Formula	Description
mAP@50	$mAP @ 50 = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}$	N denotes the total number of categories, and $A P_{i}$ represents the average accuracy of the i-th category.
Precision	$Precision = \frac{T P}{T P + F P}$	TP (True Positive) represents the number of True Positives, where the model predicts a positive sample and it is actually positive. TP (True Positive) represents True Positives: the number of samples correctly predicted as positive by the model that are actually positive. FN (False Negative) represents False Negatives: the number of samples that the model failed to correctly predict as positive, which are actually positive but were incorrectly predicted as negative.
Recall	$Recall = \frac{T P}{T P + F N}$
F1-Score	$F_{1} - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}$

Table 3. Comparison of experimental results for the RSOD dataset.

Model\Index	mAP@50/%	F1-Score/%	Precision/%	Recall/%	GFLOPs	Parameters/M	FPS
SSD	91.22	80.00	89.74	76.84	34.93	26.54	220.18
DETR	86.35	79.61	73.51	80.42	166.41	41.34	70.23
YOLOv4	92.33	93.50	88.37	78.96	122.92	52.43	120.17
YOLOX	91.75	86.74	90.67	80.33	281.97	99.11	100.41
YOLOv5x	90.64	79.25	93.23	78.32	181.94	86.73	110.36
YOLOv6	91.38	89.36	87.61	80.19	-	-	-
YOLOv7	90.80	85.50	89.54	80.93	-	-	-
YOLOv8	92.38	84.25	88.67	79.64	165.23	43.74	130.16
YOLOv10x	92.91	90.25	90.37	80.13	160.47	29.53	150.32
YOLOv11	93.26	91.52	90.65	80.41	180.06	80.08	140.44
RT-DETR	93.51	-	-	-	-	-	-
YOLO-MS	93.15	-	-	-	-	-	-
EfficientDet	89.50	-	-	-	-	-	-
AFDNet	95.16	85.50	93.57	82.02	109.0	87.31	63.84

Table 4. Comparison of experimental results for the NWPU VHR-10 dataset.

Model\Index	mAP@50/%	F1-Score/%	Precision/%	Recall/%	FLOPs	Parameters/M	FPS
SSD	93.57	82.59	91.93	79.43	34.93	26.54	220.18
DETR	90.17	86.43	80.34	85.91	166.41	41.34	70.23
YOLOv4	94.88	94.72	90.16	85.08	122.92	52.43	120.17
YOLOX	95.03	91.07	93.84	86.39	281.97	99.11	100.41
YOLOv5x	92.87	74.10	97.02	70.15	181.94	86.73	110.36
YOLOv6	94.46	-	-	-	-	-	-
YOLOv7	93.92	-	-	-	-	-	-
YOLOv8	94.97	87.91	91.56	85.44	165.23	43.74	130.16
YOLOv10x	95.28	92.83	92.94	85.11	160.47	29.53	150.32
YOLOv11	95.63	93.59	93.66	86.41	180.06	80.08	140.44
RT-DETR	95.89	-	-	-	-	-	-
YOLO-MS	95.41	-	-	-	-	-	-
EfficientDet	91.54	-	-	-	-	-	-
AFDNet	96.52	93.80	95.42	92.89	109.0	87.31	63.84

Table 5. RSOD ablation experiment results.

Step	CSDAM	DBBOM	GeLU	Swish	Mish	mAP@50/%	F1-Score/%	Precision/%	Recall/%
1	×	×	×	×	×	90.64	79.25	93.23	78.32
2	√	×	×	×	×	91.49	83.50	92.85	80.02
3	√	√	×	×	×	93.05	84.00	90.98	80.97
4	√	√	×	√		94.61	83.25	93.96	80.33
5	√	√	×	×	√	95.02	84.25	93.99	81.20
6	√	√	√	×	×	95.16	85.50	93.57	82.02

Table 6. NWPU VHR-10 ablation experiment results.

Step	CSDAM	DBBOM	GeLU	Swish	Mish	mAP@50/%	F1-Score/%	Precision/%	Recall/%
1	×	×	×	×	×	92.87	74.10	97.02	70.15
2	√	×	×	×	×	95.13	88.60	96.42	85.71
3	√	√	×	×	×	95.77	91.00	94.85	89.18
4	√	√	×	√		96.24	93.00	95.68	91.43
5	√	√	×	×	√	96.43	93.79	95.79	92.65
6	√	√	√	×	×	96.52	93.80	95.42	92.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Fang, M.; Zhang, X. Enhanced Remote Sensing Object Detection via AFDNet: Integrating Dual-Sensing Attention and Dynamic Bounding Box Optimization. Algorithms 2025, 18, 751. https://doi.org/10.3390/a18120751

AMA Style

Wang Z, Fang M, Zhang X. Enhanced Remote Sensing Object Detection via AFDNet: Integrating Dual-Sensing Attention and Dynamic Bounding Box Optimization. Algorithms. 2025; 18(12):751. https://doi.org/10.3390/a18120751

Chicago/Turabian Style

Wang, Ziyan, Miao Fang, and Xiaofei Zhang. 2025. "Enhanced Remote Sensing Object Detection via AFDNet: Integrating Dual-Sensing Attention and Dynamic Bounding Box Optimization" Algorithms 18, no. 12: 751. https://doi.org/10.3390/a18120751

APA Style

Wang, Z., Fang, M., & Zhang, X. (2025). Enhanced Remote Sensing Object Detection via AFDNet: Integrating Dual-Sensing Attention and Dynamic Bounding Box Optimization. Algorithms, 18(12), 751. https://doi.org/10.3390/a18120751

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Remote Sensing Object Detection via AFDNet: Integrating Dual-Sensing Attention and Dynamic Bounding Box Optimization

Abstract

1. Introduction

2. Related Work

2.1. Application of Attention Mechanisms in Remote Sensing Detection

2.2. Bounding Box Regression Optimization Method

2.3. Activation Functions and Nonlinear Representations

3. Algorithm Model

3.1. Channel–Space Dual-Perception Module

3.2. Bounding Box Regression Optimization Module

3.3. Gaussian Linear Error Module

4. Experimental Results and Analysis

4.1. Experimental Setup

4.2. Dataset Introduction

4.3. Indicator Analysis

4.4. Comparative Experimental Analysis

4.5. Melting Analysis Experiment

4.6. Detection Effect

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI