The ability of the proposed model to detect small targets in remote sensing images was verified on two public datasets, namely the VEDAI and AI-TOD datasets, which specifically focus on small objects. In addition, the scalability of the network was examined on the DOTA and NWPU VHR-10 public dataset. The selection of comparison methods in our experiments was carefully designed to align with the characteristics of each dataset and the research focus of the existing literature. For example, VEDAI & AI-TOD: These datasets emphasize small objects, so we prioritized methods specifically optimized for small targets (e.g., Super-YOLO, FFCA-YOLO). DOTA: We compared this multi-scale dataset against models with strong multi-scale fusion capabilities (e.g., SPH-YOLO, STDL-YOLO).
NWPU VHR-10: This dataset contains both small and large objects, so we included general-purpose SOTA detectors (e.g., MSF-SNET, KCFS-YOLO). This approach followed a common practice in remote sensing detection studies, where dataset-specific SOTA methods are selected to highlight domain-specific improvements and all the experimental data are divided into a training set and a testing set by 4:1. The proposed models were constructed using the PyTorch 2.1.1 and deployed on a workstation equipped with an NVIDIA 4060Ti GPU. The Stochastic Gradient Descent (SGD) optimizer was used in model training, with an initial learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005; the batch size during model training was set to 16; the models utilized in all experiments were trained on the training set for 300 epochs. To evaluate the detection performance and lightweight effect of the proposed model, the mean average precision at 0.5 (mAP@0.5) was used as an evaluation metric to assess detection accuracy, and the computational floating-point operations (GFLOPs), number of parameters (Params), and inference speed (FPS) were used to evaluate the lightweight effect.
4.1. Comparisons with Existing Methods
The proposed model’s detection results obtained on the VEDAI dataset are shown in
Table 1. The proposed model was compared with lightweight CNNs, the basic YOLOv5 model, higher versions YOLO (YOLOv8, YOLOv9), different models based on the YOLO, refs. [
2,
9,
10,
15,
45,
46,
47], and other representative detection methods [
2,
48,
49,
50]. The comparison results show that compared to the basic YOLOv5 model, the HFEF
2-YOLO improved the mAP@0.5 by nearly 0.06 and mAP@0.5:0.95 by nearly 0.02. In addition, there were certain improvements in detection accuracy (precision) and detection success rate (recall), indicating that the proposed HFEF
2-YOLO had a significant advantage over the benchmark network in remote sensing small target detection. In addition, compared to the higher version of the YOLO basic model or other YOLO-based network structures, the HFEF
2-YOLO achieved better performance. Due to the lightweight design concept of the HF-FPN and DSAM, the computational floating-point operations (GFLOPs) of the HFEF
2-YOLO were slightly reduced compared to the base YOLOv5 model. After the introduction of the GhostModule, the GFLOPs of the L-HFEF
2-YOLO were reduced to 23.6, showing an excellent lightweight effect. The detection accuracy was slightly reduced but still higher than most of the other models. Further, the detection was conducted on different target categories in the VEDIA dataset, as shown in
Table 2. The HFEF
2-YOLO achieved improvements in mAP@0.5, precision, and recall.
Figure 7 shows the visualization comparison of the HFEF
2-YOLO and the basic YOLOv5 model. The YOLOv5 model often leads to undetected and misclassified objects in complex images, while the proposed network could achieve more appropriate detection results. In addition, the overall score of the YOLOv5 model for each detected object was lower than that of the proposed network, demonstrating that the proposed network could be more suitable for small target detection tasks than the YOLOv5 model.
It is worth noting that the VEDAI dataset has a class imbalance (e.g., more “cars” than “boats”). While HFEF2-YOLO does not explicitly introduce class-balancing techniques (e.g., resampling or class-weighted loss), its architecture inherently mitigates class imbalance through three mechanisms: (1) The hierarchical dynamic gating mechanism adaptively enhances discriminative features of underrepresented classes (e.g., “boats”). By generating channel-specific weights, Wc, and spatial attention Ws, the model prioritizes minority-class regions (e.g., small boat edges) during feature fusion, reducing bias toward dominant classes like “cars”. (2) The Dynamic Spatial–Spectral Attention Module (DSAM) assigns unique spatial weights per channel, suppressing background noise (e.g., water surfaces) that often obscures minority classes. This context-aware filtering amplifies subtle features of rare targets, compensating for their limited training samples. (3) The Global Feature Enhancement Module (GFEM) integrates global statistical priors (e.g., edge and texture patterns) through GAP/GMP, ensuring stable feature extraction even for low-frequency classes.
In addition to the VEDIA dataset, the HFEF
2-YOLO also performed excellently on the AI-TOD datasets, which was another small target dataset.
Table 3 shows that compared with the SOTA methods [
9,
51,
52,
53,
54], HFEF
2-YOLO achieve the best performance. In the test set, the mAP@0.5 reaches 0.621. The mAP@0.5:0.95, mAP
vt, mAP
t, and mAP
s reach 0.280, 0.130, 0.251, and 0.319, respectively. The results demonstrate the excellent performance of HFEF
2-YOLO for small object detection in remote sensing.
Figure 8 shows the visualization comparison of the HFEF
2-YOLO and the basic YOLOv5 model. The YOLOv5 model often leads to low precision and misclassified objects in complex images, while the proposed network could achieve more appropriate detection results.
Furthermore, the HFEF
2-YOLO also performed excellently on the DOTA, which was a small-medium target dataset. All experiments on the DOTA dataset were conducted using the DOTA-v1.5 version, which includes enhanced annotations for small targets and an additional ‘container crane’ class. This version is widely recognized for benchmarking multi-scale detection tasks in remote sensing.
Table 4 presents the comparison results of the proposed HFEF
2-YOLO model with the YOLOv5 base model, YOLOv8, YOLOv9, Fast-R-CNN [
2], RetinaNet [
21], SSD [
5], and other related YOLO-based models [
11,
14,
31]. The results show that compared to the YOLOv5 base model, the HFEF
2-YOLO improved the mAP@0.5 value by nearly 0.012 and the F1 score by approximately 0.01. The detection results of the HFEF
2-YOLO are presented in
Figure 9. Compared to the small target objects annotated in the ground truth, the HFEF
2-YOLO achieved effective identification and detection. At the same time, it could also accurately identify small targets not annotated in the ground truth, which proved its capability in small target detection.
The proposed HFEF
2-YOLO model could also perform excellently in the detection of larger targets due to the multi-scale feature expression capability of the HF-FPN and the efficient attention mechanism of the HF-DSA.
Table 5 shows the comparison results of the HFEF
2-YOLO model with the YOLOv5 base model and YOLOv8, YOLOv9, MSF-SNET [
55], CBFF-SSD [
56], and other YOLO-based model frameworks on the NWPU VHR-10 dataset [
10,
15,
27,
28,
29,
30,
57,
58]. The results show that compared to the YOLOv5 base model, the HFEF
2-YOLO improved the mAP@0.5 value by nearly 0.008 and also achieved improvements compared to the other models. The results presented in
Table 5 show that HFEF
2-YOLO achieved only a 0.008 improvement over the baseline model (YOLOv5) on the NWPU VHR-10 dataset, which may appear insignificant. To address this concern, we conducted a statistical significance validation with the following steps: We randomly selected 100 images from the test set of the NWPU VHR-10 dataset and performed multiple repeated inferences (
n = 10) using both the baseline model (YOLOv5m) and the improved model (HFEF
2-YOLO), recording the mAP@0.5 values for each run, as shown in
Table 6. The results show that HFEF
2-YOLO achieves a mean mAP@0.5 of 0.969 (std = 0.0018), significantly outperforming YOLOv5m (mean = 0.961, std = 0.0025) with
p = 0.012 (independent
t-test). The 95% confidence interval of the mean difference (0.003–0.013) further confirms the stability of the improvement. It is worth noting that the improvement of HFEF
2-YOLO on the NWPU VHR-10 dataset is relatively small, which may be attributed to the high baseline performance of the dataset (YOLOv5m already achieves an mAP@0.5 of 0.961), approaching the upper limit of detection tasks. In this context, an absolute improvement of 0.8% holds practical significance. For example, in the case of critical targets in remote sensing images (e.g., ships, vehicles), this improvement could reduce the risk of missed detections.
Figure 10 presents the comparison results of the ground truth and the model’s detection values. The results indicate that many objects that were not annotated in the ground truth were detected by the model, which meant that even if only a part of an object was visible, it could be detected, which proved the proposed network’s effective use of internal object information. Moreover, it was also observed that for some categories, the bounding boxes produced by the proposed network were generally more suitable for an object compared to the ground truth bounding boxes, which was due to the limitations of manually labeled datasets. Thus, the results demonstrated the ability of the proposed network to detect objects accurately, even in the absence of feature information.
4.3. Robustness Experiments
In the actual imaging process, remote sensing data are often subject to various degradations, such as noise effects and different variations, which can cause the target objects to be mixed with a background, especially for small objects. To verify the robustness of the proposed HFEF
2-YOLO model under the condition of image degradation, this study generated a series of test sets by simulating remote sensing image degradation based on our research. All test sets had the same original image but different degradation conditions. The degradations included image blurring, Gaussian noise, and low-light conditions. The results obtained for different image degradation types are displayed in
Figure 13, where it can be seen that the degradation of images significantly impacted the features of small objects. The Peak Signal-to-Noise Ratio (PSNR) was selected to evaluate the quality of the degraded images. As shown in
Table 8, the image quality after low-light processing was the worst. The YOLOv5 base model and the proposed HFEF
2-YOLO model were compared on the degraded datasets, as shown in
Table 11. The results indicated that both models showed a significant decline in the mAP@0.5 metric on the degraded datasets, with the most noticeable decline for blurring and noise degradation types. Subsequently, the degraded datasets were added to the training set; the models were retrained and then tested again. The test results showed that the detection effects of both models improved, but the performance of the HFEF
2-YOLO was consistently better than that of the YOLO base version.
To further validate the model’s generalization ability on real-world hazy datasets, we conducted additional experiments on the RTSS (Real-world Task-Driven Testing Set) Dataset, which is a specialized test set designed for image dehazing and object detection. It is part of the RESIDE dataset and comprises 4322 real-world foggy images. We have manually annotated these images for testing YOLO models. The performance of YOLO and HFEF
2-YOLO on the RTSS dataset is shown in
Table 12 and
Figure 14. The results show that HFEF
2-YOLO improves the mAP@0.5 by 8% compared to the baseline model and does not exhibit any missed detections. For example, YOLOv5 missed detecting some cars and people in hazy conditions.