HAF-YOLO: Dynamic Feature Aggregation Network for Object Detection in Remote-Sensing Images

Zhang, Pengfei; Liu, Jian; Zhang, Jianqiang; Liu, Yiping; Shi, Jiahao

doi:10.3390/rs17152708

Open AccessArticle

HAF-YOLO: Dynamic Feature Aggregation Network for Object Detection in Remote-Sensing Images

by

Pengfei Zhang

,

Jian Liu

^*,

Jianqiang Zhang

,

Yiping Liu

and

Jiahao Shi

College of Weaponry Engineering, Naval University of Engineering, Wuhan 430030, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2708; https://doi.org/10.3390/rs17152708

Submission received: 23 June 2025 / Revised: 27 July 2025 / Accepted: 29 July 2025 / Published: 5 August 2025

(This article belongs to the Special Issue New Insights in Remote Sensing Image Interpretation with Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

The growing use of remote-sensing technologies has placed greater demands on object-detection algorithms, which still face challenges. This study proposes a hierarchical adaptive feature aggregation network (HAF-YOLO) to improve detection precision in remote-sensing images. It addresses issues such as small object size, complex backgrounds, scale variation, and dense object distributions by incorporating three core modules: dynamic-cooperative multimodal fusion architecture (DyCoMF-Arch), multiscale wavelet-enhanced aggregation network (MWA-Net), and spatial-deformable dynamic enhancement module (SDDE-Module). DyCoMF-Arch builds a hierarchical feature pyramid using multistage spatial compression and expansion, with dynamic weight allocation to extract salient features. MWA-Net applies wavelet-transform-based convolution to decompose features, preserving high-frequency detail and enhancing representation of small-scale objects. SDDE-Module integrates spatial coordinate encoding and multidirectional convolution to reduce localization interference and overcome fixed sampling limitations for geometric deformations. Experiments on the NWPU VHR-10 and DIOR datasets show that HAF-YOLO achieved mAP50 scores of 85.0% and 78.1%, improving on YOLOv8 by 4.8% and 3.1%, respectively. HAF-YOLO also maintained a low computational cost of 11.8 GFLOPs, outperforming other YOLO models. Ablation studies validated the effectiveness of each module and their combined optimization. This study presents a novel approach for remote-sensing object detection, with theoretical and practical value.

Keywords:

remote sensing; deep learning; feature aggregation; small-object detection; multiscale representation

Graphical Abstract

1. Introduction

In recent years, the rapid development of remote-sensing technologies, coupled with a significant reduction in manufacturing costs, has driven their widespread adoption in various fields, such as agricultural monitoring [1], geological exploration [2], infrastructure inspection [3], and smart-city management [4]. Remote-sensing satellites provide wide-area geographical information from a high-altitude perspective, offering efficient data support for real-time monitoring [5], object tracking [6], and data analysis. However, images captured by unmanned aerial vehicles (UAVs) are often highly complex. This complexity arises from several factors, including intricate background interference, severe illumination variations, extremely small size of some objects, and wide variability in object scales. These factors make small-object detection one of the biggest challenges in the field of remote sensing [7]. Specifically, small objects in remote-sensing images, such as vehicles, small buildings, or vegetation, typically occupy only a tiny fraction of the image (e.g., fewer than 32 × 32 pixels). Their blurred feature representations and susceptibility to background interference further exacerbate the difficulty of accurate detection [8].

Traditional object-detection methods rely on handcrafted feature extraction algorithms, such as histogram of oriented gradients (HOG) [9] and scale-invariant feature transform (SIFT) [10]. Despite moderate success in earlier stages, these methods adapt poorly to complex scenes, particularly when addressing issues such as low object resolution and scale variation [11]. With breakthroughs in deep learning, convolutional neural networks (CNNs) [12] have significantly improved the robustness of object detection by automatically learning hierarchical feature representations. CNNs extract semantic information through convolution and down-sampling operations. However, multiple down-sampling stages often cause a loss of spatial details, thereby limiting detection precision [13].

To solve this issue, researchers have proposed multiscale feature fusion approaches, such as feature pyramid networks (FPNs) [13] and bidirectional feature pyramid networks (Bi-FPNs) [14], which enhance feature representation for small objects by integrating high-resolution shallow features with deep semantic information. Among single-stage detection algorithms, the You Only Look Once (YOLO) series has stood out because of its efficient end-to-end framework, which achieves a balance between accuracy and speed by integrating deeper backbone networks with attention mechanisms. However, object detection in remote-sensing images continues to pose unique challenges: Certain objects in satellite images are extremely small and densely distributed, leading to high false-negative rates; at the same time, the wide range of object scales imposes stringent requirements on the network’s multiscale detection capabilities.

To solve these issues, we propose a hierarchical adaptive fusion network (HAF-YOLO). The network consists of three main modules: the dynamic-cooperative multimodal fusion architecture (DyCoMF-Arch), multiscale wavelet-enhanced aggregation network (MWA-Net), and spatial-deformable dynamic enhancement module (SDDE-Module). The DyCoMF-Arch module constructs a hierarchical feature pyramid through multistage spatial compression and expansion operations and uses a dynamic weight allocation mechanism to accurately extract salient features from objects at multiple scales. The MWA-Net employs wavelet-transform-based convolution to decompose features and establishes both a high-frequency detail preservation mechanism and a dynamic weighted fusion strategy. The SDDE-Module integrates spatial coordinate encoding with multidirectional convolutional feature extraction to improve the model’s adaptability to the spatial distribution of small objects.

These innovations effectively address challenges related to scale variation, complex backgrounds, and the difficulty of small-object detection in remote-sensing data, while enhancing the network’s multiscale feature representation and small-object localization accuracy. Compared with existing methods, including traditional multiscale fusion networks such as FPN and Bi-FPN, our approach introduces more dynamic and adaptive mechanisms tailored for remote sensing complexities. FPN and Bi-FPN architectures are effective in fusing features across different scales, but they often adopt fixed fusion strategies, which may not fully adapt to the spatial and contextual variations of small objects in high-resolution remote sensing imagery. Similarly, YOLOv8 improves detection performance through architectural refinements and decoupled heads, yet it still suffers from limited adaptability in handling dense and geometrically variant small objects due to its fixed convolutional design and standard upsampling modules. In contrast, our proposed HAF-YOLO architecture introduces adaptive fusion strategies, including dynamic weight allocation, wavelet-enhanced multiscale decomposition, and spatial-deformable attention mechanisms. These enable our model to dynamically recalibrate feature importance across layers and regions, improving robustness against background clutter and geometric deformations.

The contributions of this study are summarized as follows:

We propose DyCoMF-Arch, which constructs a multigranularity feature space through multilevel progressive sampling operations. By integrating BiFPN-Concat for cross-level feature interaction, the architecture enhances multiscale object-detection capabilities under complex remote-sensing scenes.
We design MWA-Net as a replacement for the original C2f module. This network constructs multipath branches to fully extract image features and employs a dynamic fusion branch based on the BiFPN structure to achieve dynamic weighted feature fusion. This approach enhances the representation of small objects in complex backgrounds and solves the issue of feature detail loss encountered by the C2f module in remote-sensing scenes.
To address the limitations of the YOLOv8 detection head in detecting small objects in remote-sensing images, we propose SDDE-Module. This module introduces a coordinate enhancement layer to embed absolute coordinate information and adds a multibranch convolutional architecture comprising horizontal, vertical, and deformable convolution branches. This design improves the localization of dense small objects under interference conditions and overcomes the limitations of fixed sampling patterns in adapting to geometric deformations.

2. Related Work

2.1. Traditional Object Detection Methods

The development of traditional object-detection methods dates to the early 21st century. The early approaches mainly relied on manual processing combined with classifiers. The Viola–Jones framework [15], which utilized Haar features and an AdaBoost classifier for face detection, was limited to specific object types and struggled in complex scenes. Subsequently, methods based on sliding windows and feature extraction became mainstream. The HOG proposed by Dalal et al. [9], when combined with a support vector machine (SVM) classifier, enhanced robustness in pedestrian detection. However, its limited feature representation capabilities hindered its ability to handle multiscale objects and complex backgrounds.

In 2014, Girshick et al. introduced region-based convolutional neural networks (R-CNN) [16], marking the beginning of the deep learning era in object detection. Its core pipeline included selective search for generating region proposals, independent CNN-based feature extraction, and classification with an SVM coupled with bounding box regression. Although it significantly improved accuracy on the Pascal VOC dataset, it suffered from computational redundancy and complex training. To address these issues, Girshick proposed Fast R-CNN [17], which introduced the region of interest (ROI) pooling layer. With Fast R-CNN, the entire image could be input into a CNN to generate shared feature maps and extract features of region proposals. In addition, the multitask loss function was employed to further improve training efficiency. In 2015, Ren et al. advanced the design with Faster R-CNN [18], replacing selective search with a region proposal network (RPN). Utilizing an anchor-box mechanism to directly generate region proposals, it enabled joint optimization of proposal generation and object detection. This end-to-end architecture set the standard for two-stage detection algorithms, although its computational complexity still limited real-time application. To improve detection precision and address sample imbalance, Cai et al. proposed Cascade R-CNN [19] in 2018. This method employed a multistage cascade of detectors to progressively increase the intersection-over-union (IoU) thresholds for candidate boxes and ground-truth objects. The iterative optimization reduced false-positive rates and achieved outstanding performance in complex backgrounds and densely packed small-object scenes. It proved particularly effective in detecting small objects such as vehicles and ships in remote-sensing images [8]. However, the inherent dependency on region proposal generation in two-stage methods limited their real-time applicability, prompting researchers to shift toward single-stage detection algorithms.

In 2016, Liu et al. proposed the single-shot multibox detector (SSD) [20], a representative single-stage detection algorithm. By predefining multiscale anchor boxes on feature maps at various levels, it directly predicted object classes and bounding box offsets. Using VGG16 as the backbone network and incorporating a multiscale feature fusion strategy, SSD achieved a balance between detection speed and recall rate for small objects. Nevertheless, its performance in detecting objects at extreme scales was constrained by the depth of the feature pyramid. To improve feature transmission and gradient flow, Huang et al. introduced DenseNet [21], where each layer was directly connected to all subsequent layers. This dense connectivity enhanced feature reuse and alleviated the vanishing gradients in deep networks, providing detailed features for small-object detection in high-resolution remote-sensing images. However, its high memory consumption limited its application in resource-constrained environments.

2.2. YOLO Series Object Detection Methods

The YOLO series, as a canonical representative of single-stage object-detection algorithms, has consistently maintained a technological edge in real-time detection through its efficient end-to-end architecture. In 2016, Redmon et al. introduced YOLOv1 [22], which reformulated object detection as a regression task. By dividing the input image into an S × S grid and directly predicting bounding box coordinates, confidence scores, and class probabilities, it achieved a breakthrough in detection speed. However, this version exhibited poor performance in detecting dense small objects and suffered from limited localization precision due to coarse grid partitioning. To overcome these limitations, YOLOv2 (YOLO9000) [23], released in 2017, introduced the anchor box mechanism. It utilized predefined multiscale priors to enhance multiscale object detection, adopted batch normalization and a multiscale training strategy, and achieved a favorable balance between accuracy and speed on the Pascal VOC and COCO datasets. In 2018, YOLOv3 [24] improved upon its predecessors by employing the enhanced Darknet-53 backbone with residual connections to mitigate vanishing gradients. It also integrated an FPN to facilitate cross-level feature fusion, increasing the average precision (AP) for small-object detection by approximately 15%.

Released in 2020, YOLOv4 [25] adopted the CSPDarknet53 backbone to reduce computational redundancy, introduced the Mish activation function to enhance nonlinear expressiveness, and utilized the CIoU loss function to refine bounding box regression accuracy. Around the same time, YOLOv5 [26] emerged with adaptive anchor-box computation and hybrid data augmentation strategies and demonstrated greater deployment flexibility in industrial applications. Its modular design supported multiple model sizes suitable for various hardware configurations. In 2021, YOLOX innovatively removed the anchor-box design and employed a decoupled detection head along with the SimOTA dynamic label assignment strategy, improving the detection precision of small objects on the COCO dataset. In 2022, YOLOv6 [27] implemented a RepVGG-style re-parameterization design and channel distillation technique, which reduced computational costs by 30% while maintaining accuracy. YOLOv7 [28] integrated dynamic label assignment with a cascaded feature scaling strategy. Its extended version, YOLOv7-X, achieved 46.8% mAP on the VisDrone2019 remote-sensing dataset, representing a 7.3% improvement over YOLOv5. In 2023, YOLOv8 [29] introduced the task-aligned assigner (TAA) and dynamic non-maximum suppression (NMS), excelling in dense object scenes. YOLOv9 [30] improved feature propagation paths using programmable gradient information (PGI) and enhanced multiscale feature interaction with the general efficient layer aggregation network (GELAN), effectively mitigating semantic degradation in deep networks. YOLOv10 [31], released in 2024, adopted a training paradigm without NMS. By applying a consistency loss function to suppress redundant prediction boxes directly, it significantly reduced the false-positive rate in remote sensing tasks.

In this study, YOLOv8 was selected as the baseline detection model for three primary reasons. First, it achieves an optimal balance between accuracy and efficiency on remote-sensing datasets. As shown in our baseline experiments (Section 4.4.1), YOLOv8 achieved a strong mAP50 performance of 75.0% on the DIOR dataset while maintaining a relatively low computational cost of 8.2 GFLOPs. Second, YOLOv8 provides a modular and extensible architecture that facilitates seamless integration of the three proposed modules: DyCoMF-Arch, MWA-Net, and SDDE-Module. This structural flexibility enables targeted enhancements in feature fusion, scale sensitivity, and spatial adaptability. Given these advantages, YOLOv8 serves as a practical and technically sound foundation for further enhancement in the remote-sensing object-detection domain.

2.3. Application of YOLO Series Algorithms in Remote-Sensing Image Detection

In remote-sensing image object detection, detecting small and multiscale objects presents multiple technical challenges, which mainly stem from limitations associated with object properties, imaging conditions, and algorithmic design. Small objects typically occupy only a minuscule proportion of image pixels. Their sparse feature information is easily overwhelmed by complex background noise [32]. In remote-sensing images, vehicles or small buildings often appear as blurry patches due to resolution constraints, lacking prominent texture or shape features [33]. Traditional CNNs extract semantic information through layer-by-layer down-sampling, but this process leads to the loss of small-object details. This issue is particularly prominent in deep networks, where low-level features of objects are gradually diluted, making it difficult for detectors to distinguish objects from the background [34].

Complex background interference is another key challenge for small-object detection. Natural scene elements, such as clouds, vegetation, and shadows, share similar color and texture characteristics with small objects, leading to rising false-positive rates [35]. Koyun et al. [36] demonstrated that in UAV images over farmland or forested areas, local textures of crops or trees may be misidentified as small man-made objects. Wu et al. [37] proposed the AAPW-YOLO network, which enhances the YOLOv8 backbone network by replacing standard convolution with AKConv to improve small-object feature extraction. In addition, they employed the Wise-IoU gradient enhancement strategy to balance the gradient contributions of anchor boxes of varying quality, thereby improving regression precision and reducing false-positive and false-negative rates caused by small object size. Ji et al. [38] proposed the YOLO-TLA network, which integrates a small-object detection layer into the neck network of YOLOv5 to improve small-object detection precision. Yin et al. [39] redefined the channel partitioning strategy in CSPNet and integrated it into the neck network of YOLOv4. By combining a bidirectional multiscale feature weighting mechanism, they significantly improved small-object detection performance. However, the method underperforms in rotated object detection.

The considerable scale variation among multiscale objects further complicates detection. Remote-sensing images cover extensive areas. The same scene may simultaneously contain buildings spanning several meters and sub-meter vehicles. As a result, object sizes may cover multiple orders of magnitude. Although FPNs enhance scale adaptability by fusing features across layers, their top-down feature propagation mechanism may cause the loss of small-object information from shallow features during fusion. Liu et al. [40] determined that when fusing high-resolution shallow features, FPNs are susceptible to strong responses from large objects, which weakens the feature representation of small objects. Furthermore, the dense distribution of objects in remote-sensing images exacerbates the complexity of multiscale detection. Occlusion and overlap among adjacent objects interfere with the detector’s localization and classification accuracy. To resolve the issue, Zhao et al. [41] proposed MS-YOLOv7, which introduces a fourth detection head based on YOLOv7 to extract features at various scales and enhance multiscale object-detection capability. However, this approach significantly increases the number of model parameters. Li et al. [42] proposed YOLO-DRS, which integrates a lightweight multiscale module (LEC) to fuse multiscale feature information and improve the model’s ability to extract and recognize multiscale objects.

These studies have contributed various improvements to YOLO-based object detection schemes. However, challenges remain in the algorithm improvement process, including insufficient preservation and utilization of spatial information, rigid feature fusion strategies, limited feature detection capabilities, and increased model complexity. To overcome these limitations, we propose HAF-YOLO in this study.

3. Methodology

To address the challenges identified in remote-sensing satellite image detection, we propose HAF-YOLO, based on the YOLOv8 architecture. This framework establishes a collaborative mechanism for multiscale feature capture, cross-level information fusion, and adaptive feature enhancement to improve multiscale feature representation and small-object detection capabilities. The overall architecture of the proposed method is illustrated in Figure 1. First, we designed DyCoMF-Arch to enhance responses to salient objects. This module performs multilevel spatial compression to capture cross-scale remote-sensing features and integrates shallow textures and deep semantics using BiFPN-Concat with weighted fusion. Then, addressing the difficulty of small-object detection in remote-sensing scenes using the C2f module in YOLOv8, we replaced the C2f module with the MWA-Net, which constructs multiple feature extraction paths and dynamically fuses them to strengthen the representation of small objects in complex backgrounds. Finally, at the detection head, we introduced the SDDE-Module, which is positioned before the detection head. This module integrates multiple perception mechanisms to enhance the detector’s ability to identify objects in remote-sensing images.

3.1. DyCoMF-Arch

To address the challenges of significant object scale variation and high background complexity in remote-sensing data, we designed the DyCoMF-Arch. This architecture establishes a synergistic mechanism for multiscale feature capture, cross-level information fusion, and adaptive feature enhancement, thereby improving the network’s multiscale feature representation capabilities. The core design philosophy behind this architecture is to construct a hierarchical feature pyramid using multistage spatial compression and expansion operations, combined with a dynamic weight allocation mechanism for the precise extraction of salient features in remote-sensing images.

As illustrated in Figure 1, during the feature extraction phase, the network first constructs a multigranularity feature space through three stages of progressive sampling. Average pooling operations with strides of 8, 4, and 2 are applied to down-sample image layers of varying resolutions from the backbone network. Large-stride pooling captures broad contextual information, medium-stride pooling extracts regional structural features, and small-stride pooling preserves fine-grained local details. Concurrently, features from the spatial pyramid pooling fast (SPPF) layer are upsampled by a factor of two to align high-level semantic information with spatial details, forming a multiresolution feature map covering multiple scales. Cross-level feature interaction is achieved using BiFPN-Concat, which performs weighted concatenation of five heterogeneous feature pathways and dynamically adjusts the contribution weights of each scale. Compared to conventional FPNs that propagate information unidirectionally, this design enables shallow spatial details to retroactively refine deep semantic features while simultaneously guiding high-level semantics to enhance low-level representations. After feature interaction, a 1 × 1 convolution is applied to compress and rearrange information along the channel dimension, eliminating redundancies introduced by multiscale concatenation. Subsequently, we introduced the HiGA-Module of our method to further enhance the discriminative power of features. Figure 2 illustrates the network architecture of the HiGA-Module.

The core innovation of the HiGA-Module lies in establishing a synergistic optimization mechanism across both the channel and spatial dimensions, enabling balanced extraction of deep semantic and fine-grained spatial features in remote-sensing images. The module adopts a dual-path parallel architecture. In the main path, a channel compression module reduces dimensionality. This operation can be mathematically defined as

C^{'} = F_{c o m p r e s s} (C_{1}) = σ ({C o n v}_{1 \times 1} * C_{1}),

(1)

where

C_{1} \in ℝ^{H \times W \times c 1}

is the input feature map; σ is the Hard–Swish activation function. This design compresses the channel dimension to one-quarter of the original input, reducing the computational complexity of subsequent operations. In contrast to traditional squeeze-and-excitation (SE) modules that reweight channels, this method utilizes linear dimensionality reduction in combination with a non-saturating activation function, thus preserving key feature information while mitigating gradient vanishing issues. It is suitable for handling high-dimensional feature spaces in high-resolution remote-sensing images. For spatial feature extraction, the module employs a decoupled depth-wise separable convolution architecture as follows:

S = {Conv}_{1 \times 1} * ({DWConv}_{3 \times 3} * C^{'}) .

(2)

This module first uses grouped depth-wise convolution to extract spatial features, followed by 1 × 1 convolution to expand channels. The design enables localized modeling of spatial correlations by depth-wise convolution and global integration of channel information by pointwise convolution. It maintains a large receptive field while enhancing the network’s ability to represent geometric objects in remote-sensing images, such as building edges and road networks that have strong directional characteristics.

To address the prevalent issues of significant intra-class variance and strong background interference in remote-sensing object detection, we further propose chunked gated adaptive attention (CGAA) based on the multihead self-attention (MHSA) framework [43]. This module employs a chunk-wise serial processing strategy to decompose the input feature

x \in ℝ^{B \times N \times C} (N = H \times W)

into m sub-blocks for parallel computation. Each sub-block corresponds to localized-global fusion across different spatial scales. The attention weight matrix is computed as

Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d k}}) V,

(3)

where

Q, K, V \in ℝ^{m \times n \times d} (n = N / m)

are the query, key, and value matrices after chunking, respectively. This chunk-wise computation establishes a hierarchical attention field across scales, allowing small-scale objects to obtain stronger feature activation within localized chunks, whereas large-scale objects capture long-range dependencies through inter-chunk interactions. The final output is combined with the input by residual connections, ensuring training stability. Compared with existing feature fusion structures in the YOLO series, such as the PANet used in YOLOv7, DyCoMF-Arch introduces a more novel approach by constructing a multilevel spatial compression and expansion mechanism to dynamically establish a multiscale contextual representation. Furthermore, it innovatively incorporates BiFPN-Concat into remote-sensing object-detection tasks to achieve bidirectional feature stream interaction. This approach addresses the limitation of traditional single-channel or single-dimensional spatial attention mechanisms, allowing the network to retain critical edge information in densely distributed object regions while effectively suppressing background interference.

3.2. MWA-Net

The conventional C2f module employs a serially stacked bottleneck architecture for feature transformation. However, its single-path design tends to cause the progressive attenuation of shallow detail features during the propagation toward deeper layers. In remote-sensing images, small objects occupy a limited number of pixels and feature high texture complexity. As a result, conventional convolution with down-sampling operations exacerbates the loss of high-frequency information, which is manifested as blurred object boundaries and localization inaccuracies. We propose the MWA-Net to address these limitations of the C2f module in YOLOv8, namely, the loss of fine-grained details, inadequate multiscale modeling, and weakened semantic information for small objects. MWA-Net was specifically designed to enhance the representation of small objects under complex backgrounds by introducing mechanisms for high-frequency detail preservation, multigranularity feature interaction paths, and dynamic weighted fusion strategies. The overall architecture is illustrated in Figure 3.

The frequency-domain decomposition branch replaces conventional convolutional layers with wavelet-transform-based convolution (WTConv2d) [44], decomposing the input feature maps into low-frequency approximation and high-frequency detail components. This process can be formulated as a convolutional implementation of the two-dimensional discrete wavelet transform:

W T {(X)}_{L L}, W T {(X)}_{L H}, W T {(X)}_{H L}, W T {(X)}_{H H} = W (X * ψ_{d b 1}),

(4)

where

ψ_{d b 1}

is the Daubechies wavelet basis function. The LL component retains the macroscopic structural features of the object, and the LH, HL, and HH components capture vertical, horizontal, and diagonal high-frequency details, respectively. This explicit frequency-domain separation strategy allows the network to preserve edge oscillation characteristics of small objects at shallow layers, effectively mitigating the high-frequency information loss inherent to traditional convolution operations.

The feature refinement path employs a “compression–expansion” architecture in conjunction with depth-wise separable convolutions to strike a balance between computational efficiency and representational capacity. First, a 1 × 1 convolution expands the channel dimension to

p \times 2 c

, forming a high-dimensional feature space to enhance nonlinear representational power. Next, a depth-wise separable convolution (DWConv) is applied for spatial feature extraction, which reduces computational complexity

\begin{matrix} O (d_{k}^{2} \cdot C_{i n} \cdot H \cdot W + C_{i n} \cdot C_{o u t} \cdot H \cdot W) \end{matrix}

by a factor of approximately

1 / C_{o u t}

compared to standard convolution, significantly reducing the processing overhead for high-resolution remote-sensing images. Finally, another 1 × 1 convolution compresses the channels back to the original dimensionality, completing the feature distillation process. This “expansion–compression” mechanism strengthens local feature interaction through a temporary high-dimensional mapping space, while DWConv retains a 3 × 3 receptive field and captures edge oscillation patterns of small objects using per-channel convolution. The process is formalized as

F_{r e f i n e} (X) = C o n v_{1 \times 1} (D W C o n v (C o n v_{1 \times 1} (X))) .

(5)

The nonlinear expansion path consists of multiple bottleneck modules, each constructed with dual 3 × 3 convolutional residual structures. By cascading multiple bottleneck blocks, the network constructs a deep nonlinear mapping space, thereby enhancing its discriminative ability in scenes with both complex backgrounds and co-existing small objects. The operation is defined by

F_{b o t t l e n e c k} (X) = X + C o n v_{3 \times 3} (C o n v_{3 \times 3} (X)) .

(6)

Features from all three paths are fused using BiFPN-Concat. During feature concatenation, the outputs from the WTConv decomposition (y01, y02), the feature refinement path (y1), the original feature partitions (y2, y3), and the n feature tensors from the bottleneck expansion are combined into a total of 5 + n feature tensors. Unlike equal-weighted concatenation (Concat), this mechanism utilizes gradient competition during backpropagation to autonomously strengthen the fusion weights of high-frequency detail features while suppressing interference from redundant background features. In the frequency-domain decomposition path, the high-frequency components generated by the wavelet transform carry edge oscillation information of small objects. Meanwhile, the original feature partition path retains more spatial contextual information. Using a cross-domain correlation model, the dynamic weighting strategy adaptively adjusts the contribution ratios of features across different frequency bands. This design overcomes the limitations of fixed-weight fusion where high-frequency features are prone to smoothing effects in deep networks, thereby providing more stable object-detection performance under complex backgrounds.

In contrast to methods such as AAPW-YOLO, which replaces standard convolution with AKConv to improve edge extraction, MWA-Net utilizes WTConv2d to perform explicit decomposition of feature maps into high- and low-frequency components, thus preserving the texture boundaries of small objects. Meanwhile, its dynamic weight allocation strategy mitigates the over-processing of high-frequency information typically encountered in Bi-FPN-based multifrequency fusion. The traditional C2f module suffers from limited capacity in multipath modeling and fine-grained semantic retention; MWA-Net addresses this issue with a three-path structure that facilitates comprehensive feature extraction across different representation levels.

3.3. SDDE-Module

We also propose the SDDE-Module to address the limitations of the YOLOv8 detection head in small-object detection within remote-sensing images, such as its insufficient spatial sensitivity and limited adaptability to deformation features. This module enhances the model’s adaptability to the spatial distribution of small objects by integrating spatial coordinate encoding, multidirectional convolutional feature extraction, and an adaptive deformation-aware mechanism. The architecture is illustrated in Figure 4. The detection head in YOLOv8 lacks explicit encoding of absolute coordinate information in remote sensing scenes, which makes the localization of dense small objects susceptible to interference. Furthermore, its fixed sampling patterns are inadequate for adapting to the geometric deformations inherent in object imaging. This module systematically addresses these limitations through a hierarchical feature enhancement strategy.

In remote-sensing images, the spatial distribution characteristics of objects are critical for detection tasks. Traditional convolution operations lack explicit positional encoding. As a result, the spatial information of small objects is diluted in deep networks. To address this issue, the SDDE-Module introduces a coordinate enhancement layer. The structure of this layer is shown in Figure 5. This layer embeds absolute coordinate information into the feature map using a normalized grid generation algorithm (Figure 5). Specifically, given an input feature map

X \in ℝ^{B \times C \times H \times W}

, a normalized coordinate grid is first generated as follows:

\{\begin{matrix} x_{norm} = \frac{2 x}{W - 1} - 1, x \in [0, W - 1] \\ y_{norm} = \frac{2 y}{H - 1} - 1, y \in [0, H - 1] \end{matrix} .

(7)

Then, the polar coordinate

r = \sqrt{x_{norm}^{2} + y_{norm}^{2}}

and azimuthal angle

θ = \arctan (y_{norm} / x_{norm})

are further expanded and concatenated with the original input to form the enhanced feature

X_{enhanced} = Concat (X, x_{norm}, y_{norm}, r, θ)

. This design is well-suited for detecting densely arranged objects in remote-sensing images, such as windmills and ships. Given the highly regular spatial distribution of these objects, coordinate priors can be utilized to enhance localization accuracy. Therefore, the design is particularly effective for dense small objects in remote-sensing images.

To address the diversity of object shapes and interference from complex backgrounds, the SDDE-Module adopts a multibranch convolutional architecture. This targeted design enables the extraction of feature information from various directions and dimensions, thereby improving the model’s object recognition capabilities. The two branches in the architecture use (1 × 3) and (3 × 1) convolution kernels to extract directional features along the horizontal and vertical axes, enhancing the discriminability of spatial features. Their mathematical formulas are given as follows:

F_{horizontal} = {Conv}_{1 \times 3} (X_{enhanced}),

(8)

F_{vertical} = {Conv}_{3 \times 1} (X_{enhanced}),

(9)

where F_horizontal and F_vertical are the features extracted in the horizontal and vertical directions, respectively; Conv_1×3 and Conv_3×1 are the convolution operations with (1 × 3) and (3 × 1) kernels, respectively; and X_enhanced is the input feature map.

Another deformable convolution branch uses a two-layer convolutional network to effectively predict the sampling offset

Δ p \in ℝ^{2 k^{2}}

and modulation factor

m \in {[0, 1]}^{k^{2}}

. This mechanism dynamically adjusts the receptive field according to the object’s shape and location, allowing for more accurate feature capture. Specifically, the branch first samples features at initial locations using a conventional kernel and then predicts the offset through the two-layer convolutional network. Afterwards, the sampling points are moved to better locations to dynamically adjust the receptive field:

F_{deform} = \sum_{k} m_{k} \cdot X (p_{0} + p_{k} + Δ p_{k}),

(10)

where

p_{k}

is the original sampling position of the standard convolution kernel;

Δ p_{k}

is the predicted offset. By dynamically adjusting the receptive field, this structure effectively captures irregular object contours, enhancing the model’s capability to represent deformed objects. Finally, the outputs from the multiple branches are fused through element-wise summation:

F_{out} = SiLU (BN (F_{horizontal} + F_{vertical} + F_{deform})) .

(11)

Unlike YOLO-TLA, which enhances small-object detection by simply adding a shallow detection head, the SDDE-Module introduces a more sophisticated design by jointly optimizing coordinate encoding and deformable convolutions. This is further augmented by a multibranch directional convolution mechanism, resulting in a unified structure that integrates spatial awareness and morphological flexibility. Unlike YOLO-DRS, which relies on lightweight attention mechanisms, the SDDE-Module embeds polar coordinate priors through the coordinate enhancement layer to impose explicit spatial constraints. This approach significantly improves localization precision in scenes with densely aligned objects. Additionally, the three-branch convolutional design—comprising horizontal, vertical, and deformable kernels—enables precise modeling of nonlinear deformations caused by viewpoint and angle variations, capturing detailed object structures more effectively.

4. Experiments and Results

4.1. Experimental Environment Configuration

We utilized PyTorch as the deep learning framework and adopted YOLOv8n as the baseline YOLOv8 model for training. All experiments were conducted on a 64-bit Windows 10 system powered by an Intel Core i9 CPU and an NVIDIA GeForce RTX 3090 GPU. Python 3.8 was chosen as the programming language, and PyTorch 2.0.0 and CUDA 11.6 were employed as the deep-learning frameworks.

In this study, we adopted the YOLOv8 architecture and implemented it using the Ultralytics framework. During model training, the stochastic gradient descent (SGD) optimizer was utilized with an initial learning rate of 0.01, a cosine annealing learning rate schedule, and a weight decay of 0.0005. To improve training stability, a three-epoch warm-up phase was employed, during which the momentum was set to 0.8, and the initial bias learning rate was set to 0.1.

For the loss function, the default configuration was used, which includes the bounding box regression loss (box loss), classification loss (cls loss), and distribution focal loss (DFL). In addition, to enhance the diversity of the training data, various data augmentation techniques were applied, including hue variation, saturation adjustment, brightness variation, random translation, scaling, horizontal flipping, mosaic augmentation, and random erasing, to improve the robustness of the model. Mosaic augmentation was disabled after the first 10 epochs to achieve more stable convergence outcomes.

Table 1 summarizes the detailed training environment, and Table 2 presents the training parameters.

4.2. Experimental Datasets

To comprehensively evaluate the recognition capability of the proposed network across datasets of different scales, we selected the NWPU VHR-10 [45] dataset and the DIOR [46] dataset as our experimental benchmarks. The class labels for both datasets are illustrated in Figure 6.

The NWPU VHR-10 dataset, constructed by Northwestern Polytechnical University, is a medium-scale, high-resolution remote-sensing object-detection dataset comprising 800 optical remote-sensing images. Among them, 650 images form the “positive set” containing objects, while the remaining 150 images constitute the “negative set” with no objects. The dataset covers 10 classes of geospatial objects, such as airplanes, ships, storage tanks, stadiums, and transportation infrastructure, totaling 3651 instances. All objects were manually annotated using horizontal bounding boxes. The annotations were formatted as the coordinates of the top-left and bottom-right corners along with the corresponding class label. This dataset is particularly suitable for evaluating small-object detection performance and algorithm robustness.

The DIOR dataset is a large-scale optical remote-sensing object-detection dataset containing 23,463 images of 800 × 800 pixels and 192,472 instances. It covers 20 object classes, including airplanes, airports, bridges, and windmills, encompassing both transportation infrastructure and complex scenes. Data were collected from over 80 regions worldwide, capturing a wide variety of seasons, weather conditions, and imaging settings. The spatial resolution ranges from 0.5 m to 30 m, offering pronounced intra-class variance and inter-class similarity. Its large-scale diversity makes it a key benchmark for assessing deep-learning model performance in complex remote-sensing scenes.

4.3. Evaluation Metrics

We employed precision, recall, average precision (AP), and mean average precision (mAP) as the metrics to evaluate model performance. Model scale was quantified using the number of parameters (Para) and computational cost measured in GFLOPs. The mAP metric was further divided into mAP50(%) and mAP50–95(%): mAP50 was computed at an IoU threshold of 0.5 to reflect general detection capability, and mAP50–95 averaged AP values over multiple IoU thresholds from 0.5 to 0.95, offering a more rigorous assessment of model performance. The following metrics were used: True positive (TP) refers to correctly predicted positive samples; false positive (FP) stands for negative samples incorrectly classified as positive; true negative (TN) represents correctly predicted negative samples; false negative (FN) denotes positive samples incorrectly classified as negative. The formulas for precision and recall are

Precision = \frac{TP}{TP + FP},

(12)

Recall = \frac{TP}{TP + FN} .

(13)

Average precision (AP) is calculated as the area under the precision–recall curve (AUC) for a single class. Mean average precision (mAP) is the average of AP values across all classes, computed as

AP = \int_{0}^{1} P (R) d R,

(14)

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(15)

4.4. Experimental Results and Analysis

4.4.1. Ablation Experiments

To verify the effectiveness of the three proposed modules (DyCoMF-Arch, MWA-Net, and SDDE-Module), ablation experiments were conducted using both the DIOR and NWPU VHR-10 datasets. The fundamental YOLOv8 model was taken as the reference framework, and different combinations of the proposed modules were integrated into the network to investigate their mutual effects. The results of the ablation experiments on the DIOR and NWPU VHR-10 datasets are presented in Table 3 and Table 4, respectively. A check mark (√) indicates that the corresponding module was used, whereas a cross (×) denotes that it was not used.

According to the ablation results presented in Table 3 and Table 4, the proposed DyCoMF-Arch, MWA-Net, and SDDE-Module had significant performance gains in remote-sensing object-detection tasks and displayed synergistic optimization effects when combined. On the DIOR dataset, the benchmarks of YOLOv8 model achieved an mAP50 of 75.0%. After the introduction of the DyCoMF-Arch module, the mAP50 increased to 76.2% (+1.2%) with only a 0.3 M parameter increment (from 3.0 M to 3.3 M), demonstrating high parameter efficiency. The MWA-Net module, utilizing wavelet-transform-based convolution to preserve high-frequency detail features, achieved an mAP50 of 76.1% (+1.1%) with a 0.7 M parameter increment (3.0 M to 3.7 M). The SDDE-Module enhanced spatial adaptability through a multibranch deformable convolution structure and reached 76.3% mAP50 (+1.3%) with just a 0.2 M increase in parameters (3.0 M to 3.2 M), indicating strong adaptability to geometric deformations. When two modules were combined, the joint integration of DyCoMF-Arch and SDDE-Module achieved an mAP50 of 77.5% (+2.5%) on the DIOR dataset, surpassing the linear summation of their individual contributions. This result suggests complementarity between multiscale feature fusion and deformation-aware mechanisms. When all three modules were jointly employed, the model attained mAP50 values of 78.1% and 85.0% on the DIOR and NWPU datasets, respectively—improvements of 3.1% and 4.8% over the baseline. The associated computational cost increased from 8.2 GFLOPs to 11.8 GFLOPs (+44%). These results reflect a good balance between performance gains and computational cost.

Notably, different module combinations had dataset-specific behaviors. On the DIOR dataset, the SDDE-Module alone yielded the highest individual performance boost (+1.3% mAP50). In contrast, on the NWPU dataset, the combined DyCoMF-Arch and SDDE-Module configuration achieved a 3.6% improvement in mAP50 (from 80.2% to 83.8%). Thus, a greater demand for multiscale feature interaction was observed under complex background conditions. Furthermore, the combination of MWA-Net and SDDE-Module achieved 77.2% mAP50 on DIOR (+2.2%), while requiring fewer additional parameters than the sum of each module individually, suggesting the presence of parameter sharing and optimization between modules. Overall, the architectural design of the proposed network delivered significant improvements in remote-sensing object-detection performance at an acceptable computational cost.

4.4.2. Comparative Analysis Between HAF-YOLO and YOLOv8

The confusion matrices before and after the improvements are shown in Figure 7, where (a) and (c) represent the confusion matrices of the original YOLOv8 model on the NWPU VHR-10 and DIOR datasets, respectively, and (b) and (d) represent the confusion matrices of the improved HAF-YOLO model on the NWPU VHR-10 and DIOR datasets, respectively. In each confusion matrix, the horizontal axis denotes the ground-truth class labels, and the vertical axis represents the predicted class labels. The values within the cells indicate the proportion of predicted labels corresponding to each ground-truth class. A deeper color indicates a higher proportion, and blank cells signify a value of zero.

By comparing the confusion matrices across the two datasets, it can be observed that matrices (b) and (d), representing the improved model, had a higher concentration of correct predictions along the diagonal. The detection precision for all object classes was enhanced. Moreover, the values in the bottom row of the improved model’s confusion matrices were markedly reduced, indicating a lower false-negative rate across all objects. This comparative analysis of confusion matrices confirms that the proposed model achieved higher accuracy in detecting diverse object classes while reducing both false-negative and false-positive rates.

Figure 8 and Figure 9 illustrate the training curves for precision, recall, mAP50, and mAP50–95 of both YOLOv8 and HAF-YOLO across the NWPU VHR-10 and DIOR datasets. From the analysis of these figures, it can be observed that, because of the smaller scale of the NWPU VHR-10 dataset, both YOLOv8 and HAF-YOLO had relatively large training fluctuations. Nonetheless, HAF-YOLO consistently maintained a slightly higher overall performance trend compared to the original YOLO model. On the larger DIOR dataset, training curves were more stable, and HAF-YOLO consistently demonstrated a much better trend in all metrics compared to YOLOv8. The training curves on both datasets proved the effectiveness of HAF-YOLO.

To comprehensively evaluate the performance of the proposed HAF-YOLO architecture in remote-sensing object-detection tasks, we selected a variety of test samples from two representative remote sensing image datasets: NWPU VHR-10 and DIOR. These samples cover scenes involving small objects, multiscale objects, and densely arranged objects, thereby enabling a thorough assessment of the model’s capabilities in multiple dimensions. Figure 10, Figure 11 and Figure 12 present visual comparisons of detection results between YOLOv8 and HAF-YOLO across these datasets. In the original images, red bounding boxes indicate the ground-truth objects with precise position and class information. In the post-detection results from both networks, circular markers highlight missed or falsely detected instances, offering an intuitive and detailed basis for analyzing model detection performance.

Figure 10 demonstrates the enhanced detection performance of the improved network in scenes with densely arranged objects. Two classes of typical, densely distributed objects were selected: (a) a high-density storage tank array characterized by regular grid-like geometric patterns, and (b) a cluster of airplanes with significant intra-class variance and tight spacing between instances. Quantitative analyses suggest that both the baseline and the improved models were capable of initially identifying dense objects. However, the improved model showed a marked advantage in terms of detection confidence. This performance gain was attributed to the proposed feature separation mechanism, which introduces a spatial attention module to enhance the model’s ability to resolve the edge features of overlapping objects.

Figure 11 validates the effectiveness of HAF-YOLO across three representative small-object detection scenes: (a) low-resolution vehicle detection, where the average object size in pixels is small; (b) windmill detection, characterized by strong directional features and background texture similarity; and (c) small-ship detection in navigational channels, where the object color closely resembles the background. The baseline YOLOv8 network consistently had missed detections in all three scenes, primarily due to feature attenuation effects from conventional convolution operations. MWA-Net introduced a wavelet-domain decomposition mechanism that explicitly separated input features into four sub-bands: LL, LH, HL, and HH. Among them, high-frequency components effectively preserved edge oscillation features of small objects. Within the feature refinement path, the use of a depth-wise separable convolution architecture maintained a 3 × 3 receptive field while enabling efficient extraction of high-frequency detail features. Experimental results suggest that HAF-YOLO achieved full detection of all samples in the three small-object scenes. Its detection confidence significantly outperformed the baseline. It was demonstrated that HAF-YOLO enhanced small-object representation capabilities.

Figure 12 shows a comparison of the performance of HAF-YOLO and YOLOv8 in multiscale object-detection scenes. The test images include object groups with significant scale differences. The baseline model experienced notable performance degradation. The main reason is that the unidirectional information flow in traditional feature pyramids limits effective cross-scale feature interaction. The proposed DyCoMF-Arch constructs a hierarchical feature space through three-stage progressive sampling and leverages BiFPN-Concat for cross-level feature weighted fusion, establishing a bidirectional optimization mechanism between deep semantic features and shallow texture details. The results show that HAF-YOLO successfully detected all scale levels of objects and surpassed YOLOv8 in detection precision.

To provide a more intuitive demonstration of the proposed modules’ effectiveness, we compared activation heatmaps of the baseline and improved networks for detecting small objects, densely packed objects, and multiscale objects. As shown in Figure 13, in representative remote sensing scenes (a) and (b), involving small objects such as ships and windmills, the baseline model had high-frequency information loss at early convolutional stages, leading to weak activation of object regions and susceptibility to false activations from background textures such as road patterns. In contrast, HAF-YOLO, using the wavelet decomposition mechanism in MWA-Net, achieved frequency-space decoupling and enhancement of low-frequency contour information and high-frequency edge oscillation features, boosting average activation strength in object regions. In (c), for the densely arranged storage tank detection, the SDDE-Module’s spatial coordinate encoding effectively established priors on object spatial distributions and reinforced positional consistency through polar coordinate information. In (d), within a mixed multiscale object scene, the HiGA-Module’s chunked gated attention mechanism demonstrated its hierarchical feature selection capabilities: for large-scale building clusters, global attention fields were used to capture long-range dependencies, whereas for small-scale vehicle objects, local attention focused on key texture regions, reducing interference between features of different scales. Notably, the improved network excelled in background suppression. It substantially reduced false activations in interference-prone areas such as cloud shadows and road networks. Therefore, the effectiveness of the dynamic weight fusion strategy in eliminating feature redundancy was fully verified.

To better assess HAF-YOLO’s detection improvements across categories, Table 5 and Table 6 detail Precision, Recall, AP@50, and AP@50–95 for YOLOv8 and HAF-YOLO on DIOR and NWPU VHR-10 datasets. Table 5 shows HAF-YOLO consistently outperforms YOLOv8, especially for small or dense targets like vehicle, windmill, airplane, airport, and storage tank, with AP@50 gains of 2.9%, 2.1%, 3.9%, 4.1%, and 1.8%, and AP@50–95 gains of 3.2%, 3.3%, 4.3%, 7.3%, and 1.5%. Complex-structure categories (e.g., bridge, overpass, dam) also see marked improvements, indicating stronger feature representation and localization. Table 6 further confirms HAF-YOLO’s superiority on NWPU VHR-10, with notable AP@50–95 boosts of 4.0%, 6.2%, 2.7%, and 4.1% in storage tank, basketball court, bridge, and vehicle. Combining these quantitative results with the visual analysis in Figure 10, Figure 11, Figure 12 and Figure 13, we can systematically conclude that HAF-YOLO demonstrates stronger detection capabilities in scenarios involving small object sizes, high object densities, and large-scale variations, further proving its practicality in the field of remote-sensing object detection.

4.4.3. Comparative Experiments with Other Models

We compared the proposed HAF-YOLO model with several classical models in the field of object detection. Specifically, we selected representative multistage detectors, including Fast R-CNN and Faster R-CNN, along with classical single-stage detectors such as SSD, and several versions of the YOLO series: YOLOv3, YOLOv5, YOLOv7, YOLOv9, YOLOv10, YOLOv11, and YOLOv12. Specifically, state-of-the-art YOLO variants optimized for remote-sensing image object detection were selected, as discussed in Section 2.3. These include AAPW-YOLO [37] (enhancing YOLOv8), YOLO-TLA [38] (based on YOLOv5), MS-YOLOv7 [41], and YOLO-DRS [42]. All experiments were conducted under the same experimental settings. Table 7 and Table 8 present the comparative results across models on the NWPU VHR-10 and DIOR datasets, respectively, in terms of parameters, floating-point operations (FLOPs), precision, recall, and mAP metrics.

According to the comparative results listed in Table 7 and Table 8, the proposed HAF-YOLO demonstrated a significant advantage in overall performance for object detection in remote-sensing images. On the NWPU VHR-10 dataset, HAF-YOLO achieved 87.6% precision, 79.8% recall, 85.0% mAP@50, and 50.2% mAP@50:95 with only 4.3 M parameters and a computational cost of 11.8 GFLOPs, outperforming all contrastive models across all four key metrics. Compared to the most recent version in the YOLO series, YOLOv11 (9.4 M parameters, 21.6 GFLOPs), HAF-YOLO not only achieved a +0.8% improvement in mAP@50:95 but also compressed the model size by 54.3% and reduced computation by 45.8%. These results verify that the DyCoMF-Arch effectively integrates cross-scale features while reducing computational redundancy by using multistage spatial compression and a BiFPN-based concatenated weighted fusion strategy.

On the large-scale DIOR remote-sensing dataset, HAF-YOLO attained a mAP@50:95 of 55.4%, representing a 1.3-point improvement over YOLOv11, while also boosting precision by 2.7% to 87.1%. Despite the presence of more complex background interference and a higher number of small-object instances in this dataset, HAF-YOLO maintained a recall of 71.9%, surpassing YOLOv11 by 3%. The SDDE-Module, through its integration of spatial coordinate encoding and deformable convolution, significantly enhanced the spatial localization capability for dense small objects. Notably, the traditional two-stage detector Faster R-CNN achieved a mAP@50:95 of only 45.1% on DIOR, lagging HAF-YOLO by 10.3%. This result verifies the applicability of single-stage detection frameworks combined with hierarchical adaptive fusion strategies in remote-sensing detection tasks.

From a model efficiency perspective, HAF-YOLO achieved a superior balance between parameter count and computational complexity. Compared with YOLOv5 (2.5 M parameters, 7.1 GFLOPs), HAF-YOLO achieved a 3.1% improvement in mAP@50:95 on the NWPU VHR-10 dataset at the cost of only a 1.7-fold increase in parameters. Therefore, substantial performance gain was brought by the multigranularity feature capture mechanism of the DyCoMF-Arch. Compared to the lightweight design of YOLOv9 (1.9 M parameters, 7.6 GFLOPs), HAF-YOLO improved the mAP@50:95 by 2.1% on the DIOR dataset. In addition, the computational cost of HAF-YOLO was constrained to 11.8 GFLOPs, which was significantly lower than that of comparable models such as YOLOv10 (21.4 GFLOPs) and YOLOv12 (21.2 GFLOPs).

Compared to the state-of-the-art YOLOv11, HAF-YOLO achieved a 1.3% increase in mAP@50:95 on the DIOR dataset while utilizing only 54.7% of the parameters. This result validates the effectiveness of the proposed synergistic optimization mechanism in balancing model efficiency and detection accuracy. Moreover, HAF-YOLO exhibited stronger generalization capability in cross-dataset evaluations: The degradation rate of mAP@50:95 from NWPU VHR-10 to DIOR was only 10.2%, which was significantly lower than that of YOLOv8 (14.7%) and YOLOv7 (22.8%). This result demonstrates that the multistage spatial compression and adaptive feature enhancement strategies substantially improve model robustness to scene variations in remote sensing.

To further evaluate the applicability of the proposed model in real-world environments, we analyzed the frames per second (FPS) performance of HAF-YOLO in comparison to other models. The relevant data is shown in Table 7 and Table 8. On the NWPU VHR-10 dataset, HAF-YOLO achieved 161 FPS, outperforming high-precision models such as YOLOv11 (128 FPS) and YOLOv12 (90 FPS), and approaching YOLO-DRS (169 FPS). On the DIOR dataset, HAF-YOLO reached 164 FPS, surpassing the inference speed of most similar models while maintaining superior detection accuracy (mAP50:95 of 55.4%). Compared to YOLOv11, which has a slightly higher mAP50 on DIOR (76.8%), its speed is only 132 FPS, whereas our model improved by 32 FPS, increasing mAP50:95 by 1.3%, while reducing the number of parameters by 54.7%. Similarly, compared to AAPW-YOLO (158 FPS) and YOLO-TLA (195 FPS), our method provided significantly better detection accuracy on NWPU VHR-10 with only a slight reduction in speed.

In addition to the standard YOLO series detectors, comparisons with specialized remote-sensing-optimized models provide deeper insights. On the NWPU VHR-10 dataset, AAPW-YOLO achieved an mAP@50 of 83.6% because of its adaptive convolution targeting small objects; however, HAF-YOLO surpassed it with a 1.4% margin. Similarly, YOLO-TLA, which incorporates an additional detection layer tailored for small-object detection based on YOLOv5, reached an mAP@50 of 82.1%, but still lagged behind HAF-YOLO by 2.9%. MS-YOLOv7, by incorporating an additional detection head to handle multiscale objects, achieved an mAP@50 of 81.2%, but at the cost of higher computational overhead (13.3 GFLOPs) compared to HAF-YOLO’s 11.8 GFLOPs. YOLO-DRS, focusing on lightweight multiscale fusion, achieved an mAP@50 of 83.6% with fewer parameters and GFLOPs, but its mAP@50 was still 1.4% lower than HAF-YOLO’s, and its mAP@50:95 was 0.6% lower. On the DIOR dataset, HAF-YOLO demonstrated even more significant advantages over specialized models. AAPW-YOLO achieved an mAP@50 of 76.8%, whereas HAF-YOLO outperformed it by 1.3%. MS-YOLOv7 achieved an mAP@50 of 75.5%, lower than HAF-YOLO’s 78.1%. YOLO-DRS achieved an mAP@50 of 76.8%, also 1.3% lower than HAF-YOLO.

The comparative experimental results on the two datasets validate the effectiveness of HAF-YOLO. The visual comparison in Figure 14 further corroborates the superiority of HAF-YOLO. The test images shown in Figure 14 include a large-scale target of a bridge and four small-scale targets of vehicles. For the detection of small vehicle targets, the HAF-YOLO network successfully detected all the small vehicle targets. Although some of the comparative networks achieved higher detection precision on certain targets than HAF-YOLO, they failed to detect all the small vehicle targets. The comparative experimental results highlight the advantages of the HAF-YOLO network in detecting small-scale targets and multiscale, cross-scale objects.

Figure 15 presents radar plots comparing the AP@50 values for different YOLO series models and HAF-YOLO across object categories. Figure 15a shows results for 20 object classes on the DIOR dataset, and Figure 15b shows results for 10 categories on the NWPU VHR-10 dataset. Different colors represent different models, as indicated in the legend at the top of each plot, enabling a clear visual comparison of model performance across object types. The radar plots and the model-wise per-class performance show that the detection performance varied with object classes. For large-scale objects (e.g., bridges, dams, and airports), HAF-YOLO showed obvious improvements in detection accuracy. Similarly, for small objects such as vehicles and ships, it achieved the highest mAP values, validating its enhanced capability in small-object detection. Overall, HAF-YOLO consistently outperformed other models across all object classes and achieved broader coverage, as shown by the radar plots for both datasets.

5. Conclusions

In this study, we proposed the HAF-YOLO model to address challenges in object detection for remote-sensing images, such as the vulnerability of small-object features to loss, complex background interference, diverse object scales, and densely distributed objects. Three core modules were proposed: DyCoMF-Arch, MWA-Net, and SDDE-Module, which respectively refine the conventional YOLOv8 framework from the perspectives of feature fusion, scale enhancement, and spatial adaptability.

DyCoMF-Arch constructs a hierarchical feature pyramid by multistage spatial compression and expansion operations and combines a dynamic weight allocation mechanism to achieve precise cross-scale feature fusion. MWA-Net employs wavelet-transform-based convolution to decompose features, explicitly separating low-frequency contours and high-frequency details. This approach, combined with a dynamic weighted fusion strategy, enhances the feature representation for small objects and effectively mitigates the high-frequency information loss common in traditional convolution operations. The SDDE-Module integrates absolute coordinate information through a coordinate enhancement layer and couples it with multidirectional convolutions and a deformable sampling mechanism to enhance the model’s adaptability to the spatial distribution of small objects. Experiments on the NWPU VHR-10 and DIOR datasets show that HAF-YOLO achieved mAP50 scores of 85.0% and 78.1%, respectively, surpassing the baseline YOLOv8 by 4.8% and 3.1%, while maintaining a computational cost of only 11.8 GFLOPs, which was lower than that of other models in the YOLO series. Comparative experiments show that HAF-YOLO achieved a more optimal balance between parameter efficiency and detection accuracy. Visualization results further confirm that HAF-YOLO significantly improved detection confidence in scenes with dense objects, small objects, and multiscale scenes while markedly reducing both the false-negative rate and the false-positive rate. This model provides efficient and reliable technical support for remote-sensing image analysis in domains such as agricultural monitoring and urban planning.

In this investigation, although the HAF-YOLO network demonstrated superior detection performance, it remained constrained by a relatively high number of computational parameters. In future, we will explore the integration of synthetic aperture radar (SAR) image datasets to improve model robustness under complex conditions through dual-modality data, as well as investigate more lightweight enhancements to better balance accuracy and detection speed, thereby maintaining superior performance in future aerial image-detection tasks.

Author Contributions

Conceptualization, J.L. and P.Z.; methodology, P.Z., J.L. and J.Z.; software, P.Z., Y.L. and J.S.; validation, P.Z., J.Z., Y.L. and J.S.; formal analysis, P.Z., Y.L. and J.L.; investigation, P.Z., J.Z., Y.L. and J.S.; resources, P.Z., J.L. and J.Z.; data curation, Y.L., P.Z. and J.S.; writing—original draft preparation, P.Z. and J.Z.; writing—review and editing, P.Z., J.L. and Y.L.; visualization, Y.L., P.Z. and J.S.; supervision, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

I would like to express my heartfelt gratitude to all the authors whose works have contributed to this paper. Their invaluable insights, rigorous research, and scholarly discussions have greatly enriched my understanding and helped shape the foundation of this study. Without their efforts and dedication to advancing knowledge, this paper would not have been possible.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AP	Average precision
CGAA	Chunked gated adaptive attention
CNN	Convolutional neural network
FLOP	Floating-point operation
FN	False negative
FP	False positive
FPN	Feature pyramid network
GELAN	General efficient layer aggregation network
HOG	Histogram of oriented gradients
IoU	Intersection over union
MHSA	Multihead self-attention
NMS	Non-maximum suppression
PGI	Programmable gradient information
ROI	Region of interest
RPN	Region proposal network
SAR	Synthetic aperture radar
SDDE	Spatial-deformable dynamic enhancement
SE	Squeeze-and-excitation
SPPF	Spatial pyramid pooling fast
SSD	Single-shot multibox detector
SVM	Support vector machine
TAA	Task-aligned assigner
TN	True negative
TP	True positive
UAV	Unmanned aerial vehicle
YOLO	You Only Look Once

References

Zheng, Z.; Yuan, J.; Yao, W.; Kwan, P.; Yao, H.; Liu, Q.; Guo, L. Fusion of UAV-acquired visible images and multispectral data by applying machine-learning methods in crop classification. Agronomy 2024, 14, 2670. [Google Scholar] [CrossRef]
Zhang, Z.; Yao, F.; Li, J. Dynamic Penetration Test Based on YOLOv5. In Proceedings of the 2022 3rd International Conference on Geology, Mapping and Remote Sensing (ICGMRS), Zhoushan, China, 22–24 April 2022. [Google Scholar]
Morita, M.; Kinjo, H.; Sato, S.; Tansuriyavong; Sulyyon; Anezaki, T. Autonomous Flight Drone for Infrastructure (Transmission Line) Inspection. In Proceedings of the 2017 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 24–26 November 2017. [Google Scholar]
Menkhoff, T.; Tan, E.K.B.; Ning, K.S.; Hup, T.G.; Pan, G. Tapping Drone Technology to Acquire 21st Century Skills: A Smart City Approach. In Proceedings of the 2017 IEEE SmartWorld, San Francisco, CA, USA, 4–8 August 2017. [Google Scholar]
Ryoo, D.-W.; Lee, M.-S.; Lim, C.-D. Design of a Drone-Based Real-Time Service System for Facility Inspection. In Proceedings of the 2023 14th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 11–13 October 2023. [Google Scholar]
Sun, X.; Zhang, W. Implementation of Target Tracking System Based on Small Drone. In Proceedings of the 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chengdu, China, 20–22 December 2019. [Google Scholar]
Wang, Z.; Xia, F.; Zhang, C. FD_YOLOX: An Improved YOLOX Object Detection Algorithm Based on Dilated Convolution. In Proceedings of the 2023 IEEE 18th Conference on Industrial Electronics and Applications (ICIEA), Ningbo, China, 18–22 August 2023. [Google Scholar]
Zong, H.; Pu, H.; Zhang, H.; Wang, X.; Zhong, Z.; Jiao, Z. Small Object Detection in UAV Image Based on Slicing Aided Module. In Proceedings of the 2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 29–31 July 2022. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005. [Google Scholar]
Lowe, D.G. Object Recognition from Local Scale-Invariant Features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Corfu, Greece, 20–27 September 1999. [Google Scholar]
Pan, W.; Huan, W.; Xu, L. Improving High-Voltage Line Obstacle Detection with Multi-Scale Feature Fusion in YOLO Algorithm. In Proceedings of the 2024 6th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT), Guangzhou, China, 19–21 July 2024. [Google Scholar]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Viola, P.; Jones, M. Rapid Object Detection Using a Boosted Cascade of Simple Features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Kauai, HI, USA, 8–14 December 2001. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. arXiv 2017, arXiv:1712.00726. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
Huang, G.; Liu, Z.; Weinberger, K.Q. Densely connected convolutional networks. arXiv 2016, arXiv:1608.06993. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Farhadi, A.; Redmon, J. YOLOv3: An Incremental Improvement. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Stoken, A.; Chaurasia, A.; Borovec, J.; Kwon, Y.; Michael, K.; Changyu, L.; Fang, J.; Skalski, P.; Hogan, A.; et al. YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 27 August 2024).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Joher, G.; Chaurasia, A.; Qiu, J. YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 27 August 2024).
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Hu, J.; Pang, T.; Peng, B.; Shi, Y.; Li, T. A small object detection model for drone images based on multi-attention fusion network. Image Vis. Comput. 2025, 155, 105436. [Google Scholar] [CrossRef]
Liu, M.; Wang, X.; Zhou, A.; Fu, X.; Ma, Y.; Piao, C. UAV-YOLO: Small object detection on unmanned aerial vehicle perspective. Sensors 2020, 20, 2238. [Google Scholar] [CrossRef]
Wang, X.; He, N.; Hong, C.; Wang, Q.; Chen, M. Improved YOLOX-X based UAV aerial photography object detection algorithm. Image Vis. Comput. 2023, 135, 104697. [Google Scholar] [CrossRef]
Koyun, O.C.; Keser, R.K.; Akkaya, I.B.; Töreyin, B.U. Focus-and-detect: A small object detection framework for aerial images. Signal Process. Image Commun. 2022, 104, 116675. [Google Scholar] [CrossRef]
Qu, J.; Tang, Z.; Zhang, L.; Zhang, Y.; Zhang, Z. Remote sensing small object detection network based on attention mechanism and multi-scale feature fusion. Remote Sens. 2023, 15, 2728. [Google Scholar] [CrossRef]
Wu, Y.; Mu, X.; Shi, H.; Hou, M. An object detection model AAPW-YOLO for UAV remote sensing images based on adaptive convolution and reconstructed feature fusion. Sci. Rep. 2025, 15, 16214. [Google Scholar] [CrossRef]
Ji, C.L.; Yu, T.; Gao, P.; Wang, F.; Yuan, R.Y. YOLO-TLA: An efficient and lightweight small object detection model based on YOLOv5. J. Real-Time Image Process. 2024, 21, 141. [Google Scholar] [CrossRef]
Hu, W.; Jiang, X.; Tian, J.; Ye, S.; Liu, S. Land target detection algorithm in remote sensing images based on deep learning. Land 2025, 14, 1047. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Zhao, L.L.; Zhu, M.L. MS-YOLOv7: YOLOv7 based on multi-scale for object detection on UAV aerial photography. Drones 2023, 7, 188. [Google Scholar] [CrossRef]
Liao, H.; Zhu, W. YOLO-DRS: A bioinspired object detection algorithm for remote sensing images incorporating a multi-scale efficient lightweight attention mechanism. Biomimetics 2023, 8, 458. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and A New Benchmark. arXiv 2019, arXiv:1909.00133. [Google Scholar] [CrossRef]

Figure 1. Structure of HAF-YOLO.

Figure 2. Network structure of the HiGA-Module.

Figure 3. Architecture of the MWA-Net.

Figure 4. Architecture of the SDDE-Module.

Figure 5. Principle of the coordinate enhancement layer.

Figure 6. Data distribution: (a) DIOR dataset; (b) NWPU VHR-10 dataset.

Figure 7. Confusion matrix comparison: (a) Confusion matrix of YOLOv8 on the NWPU VHR-10 dataset, (b) confusion matrix of HAF-YOLO on the NWPU VHR-10 dataset, (c) confusion matrix of YOLOv8 on the DIOR dataset, and (d) confusion matrix of HAF-YOLO on the DIOR dataset.

Figure 8. Training curves on the NWPU VHR-10 dataset: (a) Precision, (b) recall, (c) mAP50, (d) mAP50–95.

Figure 9. Training curves on DIOR dataset: (a) Precision, (b) recall, (c) mAP50, (d) mAP50–95.

Figure 10. Detection of scenes involving densely arranged objects: (a) Storage tanks, (b) airplanes.

Figure 11. Small-object detection scenes: (a) Vehicles, (b) windmills, and (c) ships.

Figure 12. (a–c) Multiscale object-detection scenes.

Figure 13. Detection heatmaps: (a) Small ship objects, (b) small windmill objects, (c) densely arranged objects, (d) multiscale objects.

Figure 14. Visualization of detection experiments using different networks: (a) Original images, (b) YOLOv3, (c) YOLOv5, (d) YOLOv6, (e) YOLOv7, (f) YOLOv8, (g) YOLOv9, (h) YOLOv10, (i) YOLOv11, (j) YOLOv12, (k) AAPW-YOLO, (l) YOLO-TLA, (m) MS-YOLOv7, (n) YOLO-DRS, and (o) HAF-YOLO.

Figure 15. Comparative analysis of mAP@50 (%) results across different models from experiments with the (a) DIOR dataset and (b) NWPU VHR-10 dataset.

Table 1. Experimental training environment.

Parameter	Configuration
CPU	Intel(R) Core(TM) i9-10940X
GPU	NVIDIA GeForce RTX 3090
System	Windows 10
Deep learning framework	PyTorch 2.0.0
GPU accelerator	CUDA 11.6
Integrated development environment	PyCharm
Scripting language	Python 3.8

Table 2. Training parameters.

Parameter	Configuration
Epochs	150
Workers	10
Batch	10
lr0	0.01
Momentum	0.937
weight_decay	0.0005
Network optimizer	SGD
box loss weight	7.5
cls loss weight	0.5
dfl loss weight	1.5
mosaic	1.0
IOU	C-IOU
translate	0.1
scale	0.5
fliplr	0.5

Table 3. Ablation experiments using the DIOR dataset.

YOLOv8	DyCoMF-Arch	MWA-Net	SDDE-Module	P (%)	R (%)	mAP₅₀ (%)	mAP_50:95 (%)	Param (M)	GFLOPs
√	×	×	×	84.6	66.9	75.0	52.1	3.0	8.2
√	√	×	×	85.4	68.8	76.2	53.3	3.3	9.0
√	×	√	×	85.5	68.0	76.1	53.2	3.7	10.4
√	×	×	√	85.2	67.8	76.3	53.7	3.2	8.7
√	√	√	×	85.9	69.4	76.8	53.9	4.0	11.2
√	√	×	√	86.2	69.8	77.5	54.2	3.6	9.6
√	×	√	√	86.4	70.9	77.2	54.9	3.9	10.9
√	√	√	√	87.1	71.9	78.1	55.4	4.3	11.8

Table 4. Ablation experiments using the NWPU VHR-10 dataset.

YOLOv8	DyCoMF-Arch	MWA-Net	SDDE-Module	P (%)	R (%)	mAP₅₀ (%)	mAP_50:95 (%)	Param (M)	GFLOPs
√	×	×	×	84.9	75.1	80.2	47.4	3.0	8.2
√	√	×	×	85.6	77.1	81.4	47.9	3.3	9.0
√	×	√	×	85.7	76.8	81.5	47.5	3.7	10.4
√	×	×	√	85.4	76.6	81.7	48.1	3.2	8.7
√	√	√	×	86.1	77.8	82.4	48.2	4.0	11.2
√	√	×	√	86.2	78.1	83.8	48.7	3.6	9.6
√	×	√	√	86.2	78.5	83.5	48.8	3.9	10.9
√	√	√	√	87.6	79.8	85.0	50.2	4.3	11.8

Table 5. Comparison of recognition performance for each category on the DIOR dataset.

	P (%)		R (%)		AP₅₀ (%)		AP_50:95 (%)
	YOLOv8	HAF-YOLO	YOLOv8	HAF-YOLO	YOLOv8	HAF-YOLO	YOLOv8	HAF-YOLO
All	0.847	0.871	0.669	0.719	0.75	0.781	0.521	0.554
Airplane	0.944	0.962	0.635	0.708	0.788	0.827	0.532	0.575
Airport	0.805	0.832	0.803	0.876	0.857	0.898	0.585	0.658
Baseball field	0.954	0.970	0.698	0.732	0.847	0.858	0.688	0.712
Basketball court	0.924	0.945	0.845	0.873	0.889	0.902	0.757	0.794
Bridge	0.739	0.768	0.349	0.405	0.432	0.487	0.239	0.290
Chimney	0.970	0.968	0.725	0.745	0.765	0.792	0.645	0.682
Dam	0.698	0.705	0.657	0.712	0.698	0.724	0.391	0.435
Expressway service area	0.836	0.852	0.811	0.856	0.863	0.888	0.606	0.660
Expressway toll station	0.908	0.912	0.561	0.625	0.655	0.696	0.500	0.545
Golf course	0.773	0.825	0.786	0.848	0.821	0.869	0.587	0.685
Ground track field	0.720	0.735	0.772	0.825	0.792	0.817	0.598	0.640
Harbor	0.765	0.785	0.611	0.642	0.668	0.684	0.474	0.510
Overpass	0.844	0.860	0.523	0.562	0.617	0.642	0.413	0.445
Ship	0.929	0.942	0.832	0.865	0.901	0.914	0.549	0.575
Stadium	0.843	0.885	0.595	0.652	0.783	0.802	0.604	0.625
Storage tank	0.961	0.968	0.553	0.582	0.733	0.751	0.457	0.472
Tennis court	0.950	0.958	0.860	0.875	0.911	0.923	0.764	0.788
Train station	0.633	0.635	0.650	0.705	0.653	0.67	0.339	0.395
Vehicle	0.889	0.905	0.334	0.368	0.484	0.513	0.268	0.300
Windmill	0.866	0.902	0.781	0.828	0.844	0.865	0.425	0.458

Table 6. Comparison of recognition performance for each category on the NWPU VHR-10 dataset.

	P (%)		R (%)		AP₅₀ (%)		AP_50:95 (%)
	YOLOv8	HAF-YOLO	YOLOv8	HAF-YOLO	YOLOv8	HAF-YOLO	YOLOv8	HAF-YOLO
All	0.846	0.876	0.751	0.798	0.804	0.850	0.475	0.502
Airplane	0.936	0.952	0.936	0.965	0.977	0.988	0.581	0.591
Ship	0.855	0.857	0.855	0.686	0.714	0.768	0.439	0.439
Storage tank	0.825	0.901	0.825	0.874	0.760	0.871	0.384	0.424
Baseball diamond	0.927	0.941	0.927	0.972	0.981	0.971	0.714	0.716
Tennis court	0.888	0.863	0.888	0.904	0.877	0.922	0.512	0.545
Basketball court	0.623	0.736	0.623	0.522	0.525	0.606	0.286	0.348
Ground track field	0.954	0.948	0.954	0.943	0.972	0.971	0.732	0.736
Harbor	0.860	0.854	0.860	0.903	0.923	0.925	0.468	0.510
Bridge	0.715	0.817	0.715	0.477	0.546	0.633	0.209	0.236
Vehicle	0.874	0.886	0.874	0.738	0.763	0.844	0.430	0.471

Table 7. Comparative results on the NWPU VHR-10 dataset.

Model	P (%)	R (%)	mAP50 (%)	mAP50:95 (%)	Param (M)	GFLOPs	FPS
Traditional Methods
Faster R-CNN	85.2	76.8	82.3	46.8	26.5	45.6	41
SSD	83.5	74.1	80.5	44.2	24.1	32.7	63
YOLO Series
YOLOv3	87.1	79.9	84.5	48.9	12.1	18.9	178
YOLOv5	82.7	76.3	81.5	47.1	2.5	7.1	198
YOLOv6	84.4	78	82.1	48.6	4.2	11.8	196
YOLOv7	81.1	78.9	81.2	42.9	6.0	13.3	175
YOLOv8	84.9	75.1	80.2	47.4	3.0	8.2	188
YOLOv9	83.8	72.6	78.7	46.3	1.9	7.6	168
YOLOv10	72.6	71.5	75	45.1	7.2	21.4	142
YOLOv11	87.6	78.2	83.6	49.4	9.4	21.6	128
YOLOv12	83.7	74.1	79.6	45.8	9.2	21.2	90
RS-Specific SOTA
AAPW-YOLO	86.5	78.0	83.6	49.0	3.5	10.5	158
YOLO-TLA	85.0	77.5	82.1	48.0	3.2	9.0	195
MS-YOLOv7	81.5	79.0	81.2	47.0	6.5	14.5	176
YOLO-DRS	86.0	78.5	83.6	49.6	3.9	10.9	169
Ours
HAF-YOLO	87.6	79.8	85.0	50.2	4.3	11.8	161

Table 8. Comparative results on the DIOR dataset.

Model	P (%)	R (%)	mAP₅₀ (%)	mAP_50:95 (%)	Param (M)	GFLOPs	FPS
Traditional Methods
Faster R-CNN	84.7	65.9	70.8	45.1	26.5	45.6	43
SSD	82.3	63.5	68.2	42.7	24.1	32.7	64
YOLO Series
YOLOv3	83	62.8	68.5	47.0	12.1	18.9	185
YOLOv5	83.4	65.3	72.7	49.2	2.5	7.1	194
YOLOv6	82.4	62.7	70.2	78.5	4.2	11.8	203
YOLOv7	85.1	67.2	75.5	52.6	6.0	13.3	182
YOLOv8	84.6	66.9	75.0	52.1	3.0	8.2	190
YOLOv9	83.8	66.9	74.9	53.3	1.9	7.6	163
YOLOv10	83.1	63.2	71.5	48.2	7.2	21.4	145
YOLOv11	84.4	68.9	76.8	54.1	9.4	21.6	132
YOLOv12	84.6	67.6	75.9	53.9	9.2	21.2	88
RS-Specific SOTA
AAPW-YOLO	85.0	68.5	76.8	54.0	3.5	10.5	154
YOLO-TLA	83.8	67.0	74.5	52.0	3.2	9.0	192
MS-YOLOv7	85.5	67.5	75.5	52.8	6.5	14.5	182
YOLO-DRS	84.8	69.0	76.8	54.0	3.9	10.9	173
Ours
HAF-YOLO	87.1	71.9	78.1	55.4	4.3	11.8	164

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, P.; Liu, J.; Zhang, J.; Liu, Y.; Shi, J. HAF-YOLO: Dynamic Feature Aggregation Network for Object Detection in Remote-Sensing Images. Remote Sens. 2025, 17, 2708. https://doi.org/10.3390/rs17152708

AMA Style

Zhang P, Liu J, Zhang J, Liu Y, Shi J. HAF-YOLO: Dynamic Feature Aggregation Network for Object Detection in Remote-Sensing Images. Remote Sensing. 2025; 17(15):2708. https://doi.org/10.3390/rs17152708

Chicago/Turabian Style

Zhang, Pengfei, Jian Liu, Jianqiang Zhang, Yiping Liu, and Jiahao Shi. 2025. "HAF-YOLO: Dynamic Feature Aggregation Network for Object Detection in Remote-Sensing Images" Remote Sensing 17, no. 15: 2708. https://doi.org/10.3390/rs17152708

APA Style

Zhang, P., Liu, J., Zhang, J., Liu, Y., & Shi, J. (2025). HAF-YOLO: Dynamic Feature Aggregation Network for Object Detection in Remote-Sensing Images. Remote Sensing, 17(15), 2708. https://doi.org/10.3390/rs17152708

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HAF-YOLO: Dynamic Feature Aggregation Network for Object Detection in Remote-Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Traditional Object Detection Methods

2.2. YOLO Series Object Detection Methods

2.3. Application of YOLO Series Algorithms in Remote-Sensing Image Detection

3. Methodology

3.1. DyCoMF-Arch

3.2. MWA-Net

3.3. SDDE-Module

4. Experiments and Results

4.1. Experimental Environment Configuration

4.2. Experimental Datasets

4.3. Evaluation Metrics

4.4. Experimental Results and Analysis

4.4.1. Ablation Experiments

4.4.2. Comparative Analysis Between HAF-YOLO and YOLOv8

4.4.3. Comparative Experiments with Other Models

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI