1. Introduction
With the rise of deep-learning technology, remote sensing image detection has achieved significant improvements in precision and efficiency. This technology is widely applied to various fields, including national security [
1], urban planning [
2], earthquake prevention [
3], and disaster mitigation. Remote sensing images are characterized by multiscale targets, complex backgrounds, and dense target distributions. While they offer a more precise representation of surface information, they also contain more interference and irrelevant information, which increases the complexity of detection tasks and adds new challenges for remote sensing object detection [
4]. Traditional two-stage detectors, such as the region-based convolutional neural network (R-CNN) [
5] and Faster R-CNN [
6], have limitations in balancing detection speed and precision. This makes it challenging to fulfill the application demands for real-time capability and accuracy in remote sensing image detection.
The You Only Look Once (YOLO) algorithm series, as representative achievements in single-stage object detection, has become a research hotspot in computer vision owing to its end-to-end detection framework and efficient reasoning speed. YOLOv1 [
7] directly predicts object bounding boxes and categories, with an inference speed far exceeding that of the contemporary Faster R-CNN; however, its small object detection precision is relatively low. YOLOv2 [
8] introduced the anchor box [
9] mechanism, improving the precision of bounding-box predictions and addressing the problem of failing to detect small objects. The backbone network of YOLOv3 [
10] was upgraded to Darknet-53, and residual connections were incorporated, enhancing the deep feature extraction capability. The detection precision of YOLOv4 [
11] was further improved through techniques such as Mosaic data augmentation and the CIoU [
12] loss function. YOLOv5 [
13] adopted CSPDarknet as the backbone and an optimized PANet feature-fusion structure in the neck, enhancing detection accuracy for small objects and intricate environments. YOLOv6 [
14] was specifically optimized for GPU inference, rendering it appropriate for real-time detection applications. The network architecture of YOLOv7 [
15] was refined, improving model adaptability.
YOLOv8 adopted an optimized network architecture and efficient convolutional units, significantly improving the inference speed, detection precision, and scalability in multitask scenarios while maintaining model stability. However, the YOLOv8 architecture has several key limitations that affect its effectiveness in recognizing images from remote sensing. To begin with, the original fixed upsampling module lacks adaptive channel-information adjustment during feature reconstruction. This could lead to the degradation of intricate spatial information, thereby adversely affecting the detection accuracy of small objects and detailed structures. Second, the cross-stage partial bottleneck with two convolutions (C2f) module in the backbone relies on static receptive-field convolution, which is insufficient to effectively adapt to large variations in object scale and complex background interference, resulting in suboptimal multiscale object detection performance.
To address these challenges, this study seeks to enhance the precision and reliability of YOLOv8 in identifying small-scale objects and densely clustered objects within complex remote sensing environments, while preserving the overall stability and computational efficiency of the model. To achieve this goal, this study presents an improved remote sensing object detection framework derived from YOLOv8, termed omnidirectional and adaptive YOLOv8 (OA-YOLOv8). This model integrates the following novel contributions:
- (1)
An omnidirectional perception refinement (OPR) network is introduced into the backbone. OPR deeply integrates the receptive-field convolution (RFAConv) with the triplet attention mechanism, significantly enhancing the perception and feature extraction capabilities of the network for multiscale small objects and complex background scenes.
- (2)
An adaptive channel dynamic upsampling (ACDU) module is designed and incorporated in the neck. This module integrates DySample upsampling, Haar wavelet transform, and self-supervised equivariant attention mechanism (SEAM) to achieve dynamic optimization of channel weights during upsampling and effective preservation of spatial fine-grained features, thereby improving the fidelity of feature map reconstruction and the accuracy of subsequent detection processes.
- (3)
The OPR and ACDU modules have been incorporated into the YOLOv8 architecture. The OPR module is used to substitute the initial three C2f layers of the backbone network to strengthen perception and feature extraction for multiscale objects and complex backgrounds, while the ACDU module replaces the original upsampling structure, enabling flexible allocation of channel weights during the feature upsampling stage while ensuring the complete transmission of detail features. The experimental findings indicate that these enhancements substantially enhance the detection precision for small-scale objects, intricate background environments, and multiscale objects within remote sensing contexts, as well as augmenting the overall stability of the model.
The subsequent sections of this manuscript are organized as follows:
Section 2 presents an extensive review of the literature pertaining to object detection, providing the necessary background and contextual foundation for the proposed methodological improvements.
Section 3 provides a comprehensive description of the overall network architecture of OA-YOLOv8 and elaborates on the design concepts of OPR and ACDU modules as well as their integration with the YOLOv8 framework.
Section 4 evaluates the efficacy of the proposed methodology for object detection in remote sensing applications through extensive comparative experiments and ablation analyses.
Section 5 comprehensively summarizes the study and investigates prospective directions for subsequent scholarly inquiry.
2. Related Work
Conventional methods for detecting features in remote sensing imagery can be classified into three distinct categories: template matching [
16], feature classification [
17], and target localization based on regional prior knowledge [
18]. These traditional methods typically have limited generalization, poor scene adaptability, and high computational costs. In contrast, adaptive image enhancement methods employ adaptive algorithms to enhance images, highlight target features, and make them clearer. For example, the saliency-guided adaptive random diffusion strategy (SG-ARD) [
19] combines saliency-aware guidance with adaptive diffusion to enhance reconstruction. It applies spectral awareness consistency loss to improve spectral fidelity, ensuring that the generated content aligns with the real spectral distribution. This approach enables the generation of high-fidelity and visually coherent remote sensing images. Zhou et al. [
20] proposed the feature pyramid network with fusion coefficients (FC-FPN) module for adaptive feature map fusion, assigning a learnable fusion coefficient to each feature map involved. This helps the module select the optimal features for fusion in multiscale object detection. It markedly enhanced the accuracy of maritime vessel detection. The MSMHC [
21] algorithm combines a multiscale model (MSM) with histogram features (HC). It effectively removes haze from remote sensing images, improving image quality and usability while preserving rich, detailed information.
Furthermore, current mainstream detection algorithms typically process only one type of data, such as visible light or infrared, making it difficult to effectively exploit the rich multiband information in remote sensing imagery, thereby limiting further performance improvement [
22]. Using multimodal data for object recognition enables the fusion of spectral features from various remote sensing images, demonstrating significant potential in both academic research and practical applications. Sun et al. [
9] incorporated fusion at feature and decision levels, enhancing multimodal feature extraction; however, the increased model complexity leads to a high computational load and slower detection speed. Gao [
23] introduced two lightweight fusion modules, PoolFuser and CSSA, into the Faster R-CNN backbone. This achieved robust detection under low-light and hazy conditions, resulting in a 7.4% mAP improvement compared to single-modal methods on the FLIR dataset. Fusion-Mamba [
24] uses a cross-modal feature fusion module built upon the Mamba architecture and is employed to construct a hidden state space with linear computational complexity, which is further enhanced by a gating mechanism to enable deeper and more expressive feature integration. This leads to significant improvements in detection precision and speed on three datasets.
Over the past few years, numerous studies have significantly improved the recognition precision for remote sensing images based on YOLO networks. Zhang et al. [
25] improved YOLOv8 by introducing the multi-frequency attention downsampling (MFAD) module alongside a dynamic multiscale adaptive attention network, aiming to more effectively leverage multiscale information and improve the refinement of image detail processing, enhancing the multiscale object detection of the network. Fan et al. [
26] innovatively introduced the SaElayer module and an efficient spatial pyramid pooling fast (SPPF) structure and designed a Focaler–minimum-point distance intersection over union (MDPIOU) strategy to improve the YOLOv8 network. This approach shows significant advantages in processing small objects and complex scenes, significantly enhancing the detection performance for unmanned aerial vehicle (UAV) targets. Wang et al. [
27] improved YOLOv7-tiny by integrating a coordinate attention mechanism and a new loss function, and introducing an improved C5 module. This approach achieves an equilibrium between real-time operational efficiency and detection accuracy within the domain of remote sensing image analysis. Sharma [
28] proposed YOLOrs, specifically designed for the real-time detection of objects within multimodal remote sensing imagery. Its smaller receptive field adapts to small objects and can predict target orientation, substantially enhancing the detection performance for small objects that are densely distributed. Xiao et al. [
29] built upon the latest YOLOv11 model, innovatively adding context anchor attention (CAA) and adaptive mixing strategy (ACmix) modules. They addressed the class imbalance problem by adaptively adjusting contrast and mixing samples, improving the detection precision for remote sensing of crops in complex scenes. Wan et al. [
30] proposed a multihead strategy and a mixed attention block, integrating them into the YOLOv5 network. This addresses the lack of a hybrid attention mechanism, improves network resolution, and achieves a balance between detection effectiveness and speed.
Recent studies have demonstrated the effectiveness of YOLOv10 combined with advanced transformer backbones in construction-related tasks, such as UAV-based rebar counting [
31] and safety helmet monitoring [
32]. These studies have shown that YOLOv10 can achieve high detection accuracy when paired with carefully selected backbones and augmentation strategies, particularly in settings with sufficient computational resources. Nevertheless, YOLOv8 was adopted in this study for several reasons. First, YOLOv8 provides a more mature and stable framework with well-established training pipelines and broad community adoption, ensuring reproducibility and fair evaluation of the proposed module. Second, YOLOv8 provides an advantageous balance between accuracy and computational efficiency, a factor that is essential for real-time and resource-constrained deployment scenarios. In contrast, YOLOv10 often relies on transformer-based backbones to achieve peak performance, which may introduce higher computational overheads and limit practical applicability. Third, the aim of this study is to assess the efficacy of the proposed mechanism within a widely used one-stage detector, allowing performance gains to be attributed to the proposed method rather than to architectural changes in newer YOLO variants. Therefore, while YOLOv10 shows strong performance in recent studies, YOLOv8 was selected as a robust, efficient, and widely accepted baseline for validating the proposed approach.
In summary, despite the improvements in remote sensing technology, the improved image detection methods still face multiple challenges in practical applications owing to the unique resolution and characteristics of various remote sensing imagery scenes. First, remote sensing images cover wide areas with significant target scale variations, ranging from ultra-large facilities to indistinguishable small objects, making effective multiscale feature fusion challenging. Second, the backgrounds of remote sensing images are often highly complex with similar textures. Different categories of ground objects may have highly similar visual features, while objects within the same category may exhibit considerable variation in shape, orientation, and lighting conditions, increasing the difficulty of feature extraction and category discrimination. Thus, achieving multiscale, high-precision, and fast remote sensing object detection in complex backgrounds remains a challenge. Improved models that balance detection precision and efficiency while adapting to unique characteristics of remote sensing data are required.
5. Conclusions
This study proposed an improved OA-YOLOv8 model tailored to tackle the difficulties of object detection in remote sensing images, including small targets, densely distributed targets, and targets with low-resolution backgrounds. The model enhances multiscale feature extraction and perceptual capability in complex backgrounds through the designed OPR module, which integrates RFAConv and a triplet attention mechanism, which replaces C2f layers in the backbone. Meanwhile, the ACDU module, combining DySample upsampling, wavelet transform, and the SEAM, was developed to improve detail restoration and spatial feature focusing, significantly enhancing the detection precision. Extensive experiments were conducted using the SIMD, as well as comparative evaluations with existing YOLO-based models, demonstrating that OA-YOLOv8 achieved consistently superior performance and verifying that the introduced improvements enhance detection reliability and adaptability in remote sensing applications. These findings illustrate the efficacy of the proposed model to effectively process targets of varying scales and high spatial density in challenging remote sensing environments.
Nevertheless, the proposed OA-YOLOv8 exhibits a modest escalation in computational complexity relative to the baseline model. Although the performance gains justify this trade-off in many application scenarios, further optimization remains necessary. Subsequent research will concentrate on the development of more lightweight architectural improvement strategies and exploring model acceleration techniques, such as quantization, pruning, and knowledge distillation, to reduce model complexity and inference latency while maintaining detection accuracy, thereby better supporting real-time remote sensing applications. In addition, we plan to expand the dataset by incorporating more complex and extreme scenes through additional data collection and data synthesis strategies, with the aim of further enhancing robustness and generalization. Moreover, the proposed framework will be extended toward multimodal remote sensing data fusion by integrating complementary information from optical, synthetic aperture radar (SAR), and hyperspectral imagery to improve detection precision, robustness, and generalizability in more challenging remote sensing environments.