OA-YOLOv8: A Multiscale Feature Optimization Network for Remote Sensing Object Detection

Shi, Jiahao; Liu, Jian; Zhang, Jianqiang; Zhang, Lei; Sun, Sihang

doi:10.3390/app16031467

Open AccessArticle

OA-YOLOv8: A Multiscale Feature Optimization Network for Remote Sensing Object Detection

by

Jiahao Shi

,

Jian Liu

^*,

Jianqiang Zhang

,

Lei Zhang

and

Sihang Sun

Naval University of Engineering, Wuhan 430030, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1467; https://doi.org/10.3390/app16031467

Submission received: 9 January 2026 / Revised: 26 January 2026 / Accepted: 28 January 2026 / Published: 31 January 2026

Download

Browse Figures

Versions Notes

Abstract

Object recognition in remote sensing images is essential for applications such as land resource monitoring, maritime vessel detection, and emergency disaster assessment. However, detection accuracy is often limited by complex backgrounds, densely distributed targets, and multiscale variations. To address these challenges, this study aims to improve the detection of small-scale and densely distributed objects in complex remote sensing scenes. An improved object detection network is proposed, called omnidirectional and adaptive YOLOv8 (OA-YOLOv8), based on the YOLOv8 architecture. Two targeted enhancements are introduced. First, an omnidirectional perception refinement (OPR) network is embedded into the backbone to strengthen multiscale feature representation through the incorporation of receptive-field convolution with a triplet attention mechanism. Second, an adaptive channel dynamic upsampling (ACDU) module is designed by combining DySample, the Haar wavelet transform, and a self-supervised equivariant attention mechanism (SEAM) to dynamically optimize channel information and preserve fine-grained features during upsampling. Experiments on the satellite imagery multi-vehicle dataset (SIMD) demonstrate that OA-YOLOv8 outperforms the original YOLOv8 by 4.6%, 6.7%, and 4.1% in terms of mAP@0.5, precision, and recall, respectively. Visualization results further confirm its superior performance in detecting small and dense targets, indicating strong potential for practical remote sensing applications.

Keywords:

remote sensing object detection; small and dense object detection; multiscale feature representation; attention mechanism

1. Introduction

With the rise of deep-learning technology, remote sensing image detection has achieved significant improvements in precision and efficiency. This technology is widely applied to various fields, including national security [1], urban planning [2], earthquake prevention [3], and disaster mitigation. Remote sensing images are characterized by multiscale targets, complex backgrounds, and dense target distributions. While they offer a more precise representation of surface information, they also contain more interference and irrelevant information, which increases the complexity of detection tasks and adds new challenges for remote sensing object detection [4]. Traditional two-stage detectors, such as the region-based convolutional neural network (R-CNN) [5] and Faster R-CNN [6], have limitations in balancing detection speed and precision. This makes it challenging to fulfill the application demands for real-time capability and accuracy in remote sensing image detection.

The You Only Look Once (YOLO) algorithm series, as representative achievements in single-stage object detection, has become a research hotspot in computer vision owing to its end-to-end detection framework and efficient reasoning speed. YOLOv1 [7] directly predicts object bounding boxes and categories, with an inference speed far exceeding that of the contemporary Faster R-CNN; however, its small object detection precision is relatively low. YOLOv2 [8] introduced the anchor box [9] mechanism, improving the precision of bounding-box predictions and addressing the problem of failing to detect small objects. The backbone network of YOLOv3 [10] was upgraded to Darknet-53, and residual connections were incorporated, enhancing the deep feature extraction capability. The detection precision of YOLOv4 [11] was further improved through techniques such as Mosaic data augmentation and the CIoU [12] loss function. YOLOv5 [13] adopted CSPDarknet as the backbone and an optimized PANet feature-fusion structure in the neck, enhancing detection accuracy for small objects and intricate environments. YOLOv6 [14] was specifically optimized for GPU inference, rendering it appropriate for real-time detection applications. The network architecture of YOLOv7 [15] was refined, improving model adaptability.

YOLOv8 adopted an optimized network architecture and efficient convolutional units, significantly improving the inference speed, detection precision, and scalability in multitask scenarios while maintaining model stability. However, the YOLOv8 architecture has several key limitations that affect its effectiveness in recognizing images from remote sensing. To begin with, the original fixed upsampling module lacks adaptive channel-information adjustment during feature reconstruction. This could lead to the degradation of intricate spatial information, thereby adversely affecting the detection accuracy of small objects and detailed structures. Second, the cross-stage partial bottleneck with two convolutions (C2f) module in the backbone relies on static receptive-field convolution, which is insufficient to effectively adapt to large variations in object scale and complex background interference, resulting in suboptimal multiscale object detection performance.

To address these challenges, this study seeks to enhance the precision and reliability of YOLOv8 in identifying small-scale objects and densely clustered objects within complex remote sensing environments, while preserving the overall stability and computational efficiency of the model. To achieve this goal, this study presents an improved remote sensing object detection framework derived from YOLOv8, termed omnidirectional and adaptive YOLOv8 (OA-YOLOv8). This model integrates the following novel contributions:

(1): An omnidirectional perception refinement (OPR) network is introduced into the backbone. OPR deeply integrates the receptive-field convolution (RFAConv) with the triplet attention mechanism, significantly enhancing the perception and feature extraction capabilities of the network for multiscale small objects and complex background scenes.
(2): An adaptive channel dynamic upsampling (ACDU) module is designed and incorporated in the neck. This module integrates DySample upsampling, Haar wavelet transform, and self-supervised equivariant attention mechanism (SEAM) to achieve dynamic optimization of channel weights during upsampling and effective preservation of spatial fine-grained features, thereby improving the fidelity of feature map reconstruction and the accuracy of subsequent detection processes.
(3): The OPR and ACDU modules have been incorporated into the YOLOv8 architecture. The OPR module is used to substitute the initial three C2f layers of the backbone network to strengthen perception and feature extraction for multiscale objects and complex backgrounds, while the ACDU module replaces the original upsampling structure, enabling flexible allocation of channel weights during the feature upsampling stage while ensuring the complete transmission of detail features. The experimental findings indicate that these enhancements substantially enhance the detection precision for small-scale objects, intricate background environments, and multiscale objects within remote sensing contexts, as well as augmenting the overall stability of the model.

The subsequent sections of this manuscript are organized as follows: Section 2 presents an extensive review of the literature pertaining to object detection, providing the necessary background and contextual foundation for the proposed methodological improvements. Section 3 provides a comprehensive description of the overall network architecture of OA-YOLOv8 and elaborates on the design concepts of OPR and ACDU modules as well as their integration with the YOLOv8 framework. Section 4 evaluates the efficacy of the proposed methodology for object detection in remote sensing applications through extensive comparative experiments and ablation analyses. Section 5 comprehensively summarizes the study and investigates prospective directions for subsequent scholarly inquiry.

2. Related Work

Conventional methods for detecting features in remote sensing imagery can be classified into three distinct categories: template matching [16], feature classification [17], and target localization based on regional prior knowledge [18]. These traditional methods typically have limited generalization, poor scene adaptability, and high computational costs. In contrast, adaptive image enhancement methods employ adaptive algorithms to enhance images, highlight target features, and make them clearer. For example, the saliency-guided adaptive random diffusion strategy (SG-ARD) [19] combines saliency-aware guidance with adaptive diffusion to enhance reconstruction. It applies spectral awareness consistency loss to improve spectral fidelity, ensuring that the generated content aligns with the real spectral distribution. This approach enables the generation of high-fidelity and visually coherent remote sensing images. Zhou et al. [20] proposed the feature pyramid network with fusion coefficients (FC-FPN) module for adaptive feature map fusion, assigning a learnable fusion coefficient to each feature map involved. This helps the module select the optimal features for fusion in multiscale object detection. It markedly enhanced the accuracy of maritime vessel detection. The MSMHC [21] algorithm combines a multiscale model (MSM) with histogram features (HC). It effectively removes haze from remote sensing images, improving image quality and usability while preserving rich, detailed information.

Furthermore, current mainstream detection algorithms typically process only one type of data, such as visible light or infrared, making it difficult to effectively exploit the rich multiband information in remote sensing imagery, thereby limiting further performance improvement [22]. Using multimodal data for object recognition enables the fusion of spectral features from various remote sensing images, demonstrating significant potential in both academic research and practical applications. Sun et al. [9] incorporated fusion at feature and decision levels, enhancing multimodal feature extraction; however, the increased model complexity leads to a high computational load and slower detection speed. Gao [23] introduced two lightweight fusion modules, PoolFuser and CSSA, into the Faster R-CNN backbone. This achieved robust detection under low-light and hazy conditions, resulting in a 7.4% mAP improvement compared to single-modal methods on the FLIR dataset. Fusion-Mamba [24] uses a cross-modal feature fusion module built upon the Mamba architecture and is employed to construct a hidden state space with linear computational complexity, which is further enhanced by a gating mechanism to enable deeper and more expressive feature integration. This leads to significant improvements in detection precision and speed on three datasets.

Over the past few years, numerous studies have significantly improved the recognition precision for remote sensing images based on YOLO networks. Zhang et al. [25] improved YOLOv8 by introducing the multi-frequency attention downsampling (MFAD) module alongside a dynamic multiscale adaptive attention network, aiming to more effectively leverage multiscale information and improve the refinement of image detail processing, enhancing the multiscale object detection of the network. Fan et al. [26] innovatively introduced the SaElayer module and an efficient spatial pyramid pooling fast (SPPF) structure and designed a Focaler–minimum-point distance intersection over union (MDPIOU) strategy to improve the YOLOv8 network. This approach shows significant advantages in processing small objects and complex scenes, significantly enhancing the detection performance for unmanned aerial vehicle (UAV) targets. Wang et al. [27] improved YOLOv7-tiny by integrating a coordinate attention mechanism and a new loss function, and introducing an improved C5 module. This approach achieves an equilibrium between real-time operational efficiency and detection accuracy within the domain of remote sensing image analysis. Sharma [28] proposed YOLOrs, specifically designed for the real-time detection of objects within multimodal remote sensing imagery. Its smaller receptive field adapts to small objects and can predict target orientation, substantially enhancing the detection performance for small objects that are densely distributed. Xiao et al. [29] built upon the latest YOLOv11 model, innovatively adding context anchor attention (CAA) and adaptive mixing strategy (ACmix) modules. They addressed the class imbalance problem by adaptively adjusting contrast and mixing samples, improving the detection precision for remote sensing of crops in complex scenes. Wan et al. [30] proposed a multihead strategy and a mixed attention block, integrating them into the YOLOv5 network. This addresses the lack of a hybrid attention mechanism, improves network resolution, and achieves a balance between detection effectiveness and speed.

Recent studies have demonstrated the effectiveness of YOLOv10 combined with advanced transformer backbones in construction-related tasks, such as UAV-based rebar counting [31] and safety helmet monitoring [32]. These studies have shown that YOLOv10 can achieve high detection accuracy when paired with carefully selected backbones and augmentation strategies, particularly in settings with sufficient computational resources. Nevertheless, YOLOv8 was adopted in this study for several reasons. First, YOLOv8 provides a more mature and stable framework with well-established training pipelines and broad community adoption, ensuring reproducibility and fair evaluation of the proposed module. Second, YOLOv8 provides an advantageous balance between accuracy and computational efficiency, a factor that is essential for real-time and resource-constrained deployment scenarios. In contrast, YOLOv10 often relies on transformer-based backbones to achieve peak performance, which may introduce higher computational overheads and limit practical applicability. Third, the aim of this study is to assess the efficacy of the proposed mechanism within a widely used one-stage detector, allowing performance gains to be attributed to the proposed method rather than to architectural changes in newer YOLO variants. Therefore, while YOLOv10 shows strong performance in recent studies, YOLOv8 was selected as a robust, efficient, and widely accepted baseline for validating the proposed approach.

In summary, despite the improvements in remote sensing technology, the improved image detection methods still face multiple challenges in practical applications owing to the unique resolution and characteristics of various remote sensing imagery scenes. First, remote sensing images cover wide areas with significant target scale variations, ranging from ultra-large facilities to indistinguishable small objects, making effective multiscale feature fusion challenging. Second, the backgrounds of remote sensing images are often highly complex with similar textures. Different categories of ground objects may have highly similar visual features, while objects within the same category may exhibit considerable variation in shape, orientation, and lighting conditions, increasing the difficulty of feature extraction and category discrimination. Thus, achieving multiscale, high-precision, and fast remote sensing object detection in complex backgrounds remains a challenge. Improved models that balance detection precision and efficiency while adapting to unique characteristics of remote sensing data are required.

3. Proposed Method

3.1. YOLOv8 Network

In this study, a YOLOv8 model consisting of a backbone, neck, and head was employed. In the backbone, the C2f [33] module replaces the C3 module in YOLOv5. This introduces more branch paths and shortcut connections, improving feature extraction performance while preserving a lightweight architecture. The neck is primarily responsible for enabling better integration of high-level semantic features with low-level detailed texture features. The head is designed to decouple object classification and localization regression into separate tasks. Instead of relying on predefined anchor boxes, it directly predicts the central coordinates and dimensions of objects, making the model more adaptable to diverse scenarios. Three separate detection heads are designed to target large, medium, and small-scale objects, respectively, significantly improving detection precision. The overall architecture is illustrated in Figure 1.

3.2. OA-YOLOv8 Network

The issues of high omission rates for small objects and insufficient multiscale feature interaction in remote sensing image detection were addressed. The YOLOv8 framework was structurally improved to suit the characteristics of remote sensing scenarios. The proposed improvements were based on the following two key observations. First, in high-resolution remote sensing scenarios, small objects are often difficult to recognize owing to insufficient semantic information and limited spatial resolution of feature maps, amidst complex terrain textures or similar backgrounds. Second, although the original YOLOv8 already possesses multiscale fusion capability, its feature-fusion method has limited adaptability to targets of different scales, easily leading to performance imbalance between large and small targets.

To address these issues, an improved YOLOv8-based framework, OA-YOLOv8, is proposed. Two core modules, OPR and ACDU, are introduced in this design. The OPR module is introduced to replace the initial three C2f layers within the backbone network, introducing an orientation-aware pyramid feature-optimization strategy. This explicitly models target orientation based on multiscale fusion, alleviating representational ambiguity caused by missing directional information, thereby improving localization precision for targets with distinct orientations, such as ships in harbors and airport runways. The ACDU module is deployed along the neck upsampling path, leveraging multibranch convolutions and adaptive weight allocation to fuse spatial and channel information, balancing global context and local details, thereby suppressing interference from complex backgrounds and densely occluded scenes. Together, these modules enhance small object recognition, multiscale adaptability, and background suppression in remote sensing detection while maintaining high inference speed. The overall framework is shown in Figure 2.

3.2.1. Omnidirectional Perception Enhancement Network

In remote sensing object detection, although the C2f module in YOLOv8 can enhance feature representation, its reliance on stacked convolutions for construction leads to shortcomings in modeling complex background patterns and spatial structural features. This results in small objects being easily overwhelmed by the background and unstable performance under occlusion and noise interference. The receptive field attention convolution (RFAConv) proposed by Zhang et al. [34] represents a novel approach to standard convolution operations. It integrates spatial attention mechanisms with convolution operations to enhance the performance of convolutional neural networks (CNNs). This method optimizes the operation of convolution kernels, particularly when processing spatial features within the receptive field. It achieves significant performance gains with almost no increase in computational cost or parameters. Its structural diagram is shown in Figure 3.

RFAConv emphasizes spatial information within the receptive field instead of relying solely on conventional spatial dimensions. This approach enables the network to more effectively capture and analyze local image regions, thereby improving the accuracy of feature extraction. The two key concepts of RFAConv are as follows:

(1): Addressing parameter sharing: In traditional CNNs, convolution kernels share the same parameters when processing different image regions, which could constrain the model’s capacity to acquire intricate patterns. Through the integration of an attention mechanism, RFAConv enables more flexible adjustment of convolution kernel parameters, providing customized processing for different regions. This process combines the attention mechanism with convolution kernels to generate a customized convolution kernel for each receptive-field position. The entire process can be represented as follows:

F_{1} = X_{1} \times A_{1} \times K F_{2} = X_{2} \times A_{2} \times K \dots F_{N} = X_{N} \times A_{N} \times K

(1)

where Fi denotes the output after weighted computation. Xi and Ai represent the input feature map at different positions and the learned attention weights, respectively. The convolution kernel K is only a parameter value, while the value of Ai × K serves as the new convolution kernel parameter.

(2): Improving the efficiency of large-size convolution kernels: For large convolution kernels, conventional spatial attention mechanisms may not fully capture all essential information. By providing effective attention weights and leveraging RFAConv, large convolution kernels can process image data with greater efficiency. Taking a 3 × 3 convolution kernel as an example, the process can be expressed as follows:

F_{1} = X_{11} \times A_{11} \times K_{1} + X_{12} \times A_{12} \times K_{2} + \dots + X_{19} \times A_{19} \times K_{9} F_{2} = X_{21} \times A_{21} \times K_{1} + X_{22} \times A_{22} \times K_{2} + \dots + X_{29} \times A_{29} \times K_{9} \dots F_{N} = X_{N 1} \times A_{N 1} \times K_{1} + X_{N 2} \times A_{N 2} \times K_{2} + \dots + X_{N 9} \times A_{N 9} \times K_{9},

(2)

where

F_{1}

,

F_{2}

,

F_{N}

… denote the output features within different receptive field sliding blocks. They are obtained by performing element-wise multiplication of the input feature X with the corresponding attention weights A and the convolution kernel weights K. For example,

F_{1}

is computed by multiplying

X_{11}

with the corresponding attention weight

A_{11}

and the convolution kernel weight

K_{1}

. The same applies to the remaining terms.

Although RFAConv enhances local feature capture through a deformable receptive field, it still exhibits several critical limitations in remote sensing scenarios: (1) Kernel deformation mainly relies on adjusting sampling-point positions, making it difficult to effectively perceive targets with strong directional characteristics, such as roads and bridges, which may lead to false detections. (2) Feature sampling is unevenly distributed within the dynamic receptive field, resulting in insufficient representation of texture details for objects of small size within intricate or cluttered backgrounds. (3) Its attention mechanism introduces weights only at the channel level and relies on global aggregation, and it tends to lose spatial detail information crucial for small objects.

To enhance the feature representation of small objects in images, a triplet attention mechanism is embedded into the feature-fusion stage of RFAConv while preserving the original receptive field offset operations. Channel attention, spatial attention, and directional attention are sequentially introduced to jointly model directional structures and local textures. Based on this concept, the OPR module is proposed to simultaneously improve small object localization precision and background interference suppression in multiscale feature fusion. Figure 4 depicts the structural configuration of the OPR module.

The fundamental principle of the triplet attention [35] mechanism introduced in the OPR module is to utilize the three-branch structure to capture cross-dimensional interactions within the input data, computing attention weights. This approach captures dependencies among input channels or spatial locations with minimal computational overhead. It is composed of three parallel branches, which capture interaction features between the spatial dimension (H or W) and channel dimension C. In each branch, the input tensor is first permuted; subsequently, a Z-Pool operation is performed, followed by a convolutional filter of size k × k to generate attention coefficients. Such coefficients are obtained via a sigmoid activation function, applied to the transposed input tensor, and then reverted to the original input configuration. The comprehensive architecture of the triplet attention module is depicted in Figure 5.

3.2.2. Adaptive Channel Dynamic Upsampling Module

The traditional Upsample operation in the YOLOv8 network is often implemented using nearest-neighbor or bilinear interpolation. While computationally efficient, these methods exhibit significant limitations in image recognition. Owing to the extreme sparsity of small objects (such as vehicles, ships, and hangars) at the pixel scale in remote sensing images, traditional upsampling methods tend to suffer from spatial information and detail loss. To address this, Liu et al. [36] proposed a lightweight and efficient dynamic upsampler, DySample, which employs a point-resampling approach instead of traditional kernel-based methods; this improves resource efficiency and simplifies implementation in PyTorch. Figure 6 shows the structure of DySample.

First, given a feature map of size C × H × W and a sampling set of size 2 × sH × sW, the grid sampling function uses positions in S to resample X (assuming bilinear interpolation), obtaining X′ of size C × sH × sW. Efficient upsampling is primarily achieved through the sampling point generator. The generator first receives local features from the low-resolution feature map. It then generates offsets O by combining linear layers, pixel shuffle technology, and a dynamic range factor. These offsets are combined with the original grid positions

G

to generate the sampling set

S

. This procedure can be articulated in the following manner:

O = 0.5 s i g m o i d ({l i n e a r}_{1} (X) \cdot {l i n e a r}_{2} (X))

(3)

Although the DySample dynamic sampling strategy enhances the spatial adaptability of feature restoration, it still exhibits notable limitations in remote sensing scenarios: the bias prediction and weight-generation processes overlook the differences in multifrequency components during feature fusion, leading to the weakening of high-frequency details after upsampling, which reduces the efficacy of detecting small objects. To address the above issues, based on the design principles of multiscale frequency-domain enhancement and directional capture, the ACDU module is proposed. It introduces wavelet transform at the sampling point generator stage of DySample to decompose the input features into multiple frequencies. The low-frequency components are used to stabilize sampling position prediction, while the high-frequency components are used for detail compensation, thereby preserving richer spatial texture information in the position-generation step. Meanwhile, the SEAM is embedded into the weight-generation stage. By combining a strategy of spatial enhancement and channel adaptation, it performs weighted modulation on feature responses from different directions and frequencies, enabling the dynamic sampling process to strengthen feature representation for small-scale objects. Figure 7 depicts the structural design of the ACDU module.

The ACDU module incorporates the Haar wavelet transform [37], which enhances the hierarchical representation of features in remote sensing object recognition while maintaining high computational efficiency. It is a multiscale analysis method that can separate image information into different frequency components, thereby simultaneously preserving both global structure and local details. The decomposition process is achieved by applying low-pass (L) and high-pass (H) filtering along both the row and column directions. The low-frequency wavelet sub-band (L) embodies the overall semantics and macroscopic layout, while the high-frequency wavelet sub-band (H) contains contour and edge information calculated through differencing. Consequently, the low-frequency components help the model understand scene context and large-scale object morphology, while the high-frequency components highlight detailed features, such as roads, vehicles, and building edges. In practical application, performing wavelet decomposition on the input features yields one low-frequency map yL and multiple directional high-frequency maps yHL, yLH, and yHH. Concatenating them in order forms a more information-rich output feature map.

An input image

f [a, b]

can be separated into a low-frequency component

L [a, b]

and a high-frequency component

H [a, b]

:

L [a, b] = \frac{f [a, 2 b] + f [a, 2 b + 1]}{\sqrt{2}}

(4)

H [a, b] = \frac{f [a, 2 b] - f [a, 2 b + 1]}{\sqrt{2}}

(5)

The same process is then applied to the low pass and high pass along the column direction to obtain the complete four sub-bands:

L L = \frac{L [a, 2 b] + L [a, 2 b + 1]}{\sqrt{2}}

(6)

H L = \frac{L [a, 2 b] - L [a, 2 b + 1]}{\sqrt{2}}

(7)

L H = \frac{H [a, 2 b] + H [a, 2 b + 1]}{\sqrt{2}}

(8)

H H = \frac{H [a, 2 b] - H [a, 2 b + 1]}{\sqrt{2}}

(9)

By separating features into low-frequency contour components and high-frequency texture details, this method improves the performance of the model in small-scale object recognition and complex-scene pattern discrimination. Figure 8 shows each decomposition component after applying this transformation to a visible-light remote sensing satellite image. The low-frequency wavelet sub-band (LL) completely maintains the spatial layout and connectivity of road system, whereas the high-frequency wavelet sub-band (HH) emphasizes fine-grained feature contrasts, including building roof structures and edge outlines.

To improve the depiction of features associated with small-scale objects during dynamic upsampling, the SEAM [38] is introduced. It simultaneously captures multiscale spatial structures and inter-channel correlations, thereby achieving adaptive weighting and fusion of feature information. By introducing a cross-scale attention interaction module, the SEAM can establish connections between global and local features, dynamically adjusting the response intensity for targets of different scales. This makes it especially appropriate for processing remote sensing or scene images with variable forms and complex backgrounds. This mechanism consists of two complementary paths: (1) a spatial aggregation path, which uses convolutional kernels of varying sizes and pooling operations to capture regional contextual information, enhancing the perception of edges and structural details, and (2) a channel relationship path, which includes cross-channel dependencies based on the joint use of global average and max pooling, improving the discriminative power of feature representation. Building upon this, the SEAM introduces a multimodal fusion strategy in the frequency and time domains. By incorporating a learnable weight matrix in the channel dimension, it performs a weighted combination of the multifrequency features obtained from wavelet decomposition and original spatial features. This design enables mutual attention across multiscale feature maps, guiding the model to preferentially preserve the most contributive components for detection or recognition tasks during feature allocation.

Figure 9 shows the overall architecture of the proposed SEAM. The framework integrates three channel and spatial attention multiscale mechanism (CSMM) modules operating at distinct patch scales to capture complementary multiscale representations. The feature outputs from these modules are first aggregated using average pooling, followed by a channel expansion strategy to enrich channel-wise expressiveness, and are subsequently fused through element-wise multiplication to generate enhanced feature maps. The right panel details the CSMM module, which exploits patch-based multiscale feature extraction and incorporates depthwise separable convolutions to efficiently model the interactions between spatial structures and channel dependencies.

4. Experiments

4.1. Experimental Environment

Experiments were carried out within a Windows 11 operating system environment. Considering hardware performance and model training efficiency, a workstation equipped with an Intel(R) Core(TM) i7-14650 processor and an NVIDIA GeForce RTX 4060 Laptop GPU was used to satisfy the requirements for high-parallelism computation. The programming language used was Python 3.9, and the deep-learning framework was PyTorch 2.0 with CUDA 11.8 for model construction and training. Code development and debugging were performed using the PyCharm 2025.3 integrated development environment. The hyperparameter configurations used for training and testing are listed in Table 1.

4.2. Satellite Imagery Multi-Vehicle Dataset (SIMD)

The SIMD is a high-resolution dataset constructed for multiscene remote sensing applications, designed to support complex object recognition and multiscale object detection, providing diverse and rich sample data. The dataset contains annotations for 15 object categories, such as cars, airplanes, ships, and buses, covering both sparsely and densely distributed instances.

The SIMD comprises 5000 raw remote sensing images. To support the training and evaluation of deep-learning models, the dataset is divided into training, validation, and test subsets according to an 8:1:1 proportion, comprising 4000 training images, 500 validation images, and 500 test images, respectively. Figure 10 shows the statistical distribution of object instances across different categories within the training dataset.

4.3. Performance Evaluation Metrics

The scale of the network architecture was measured using the number of model parameters and computational complexity (GFlops), and the object recognition speed was evaluated using frames processed per second (FPS). Precision, recall, and mAP@0.5 were selected as the core performance metrics. The metric mAP@0.5 represents the percentage of detection outcomes deemed accurate when the intersection over union (IoU) between forecasted bounding boxes and truth bounding boxes exceeds 50%. The primary formulas for these metrics are presented as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(10)

R e c a l l = \frac{T P}{T P + F N}

(11)

A P = \int_{0}^{1} P (R) d R

(12)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(13)

FN refers to instances where positive instances are erroneously classified as negative, FP pertains to negative instances that are mistakenly identified as positive, TP indicates samples that are correctly classified, AP denotes the precision computed for each individual object category, and mAP represents average of the AP values computed over all categories.

Figure 10. Statistical distribution of object instances in the dataset.

4.4. Analysis of Experimental Results

4.4.1. Experiments on OPR and ACDU Modules

To assess the performance of the proposed OPR and ACDU modules, the initial three C2f blocks in the YOLOv8 backbone were substituted with RFAConv or OPR modules, while the upsampling layers in the neck were replaced by DySample or ACDU modules, respectively. All experiments were performed under consistent training settings. The comparative results after incorporating the RFAConv and OPR modules into YOLOv8 are listed in Table 2, and the results obtained by integrating the DySample and ACDU modules into the YOLOv8 neck are listed in Table 3.

Table 2 shows that after introducing the RFAConv and OPR modules, the YOLOv8 network exhibited a significantly improved detection performance. YOLOv8 + OPR achieved a more substantial improvement in precision, reaching a precision of 82.7%, a recall of 76.1%, and an mAP@0.5 of 82.5%, outperforming all other models. These metrics correspond to enhancements of 3.8%, 1.2%, and 2.0%, respectively, in comparison to YOLOv8 combined with RFAConv. The computational complexity also increased, with GFLOPs rising to 10.1 and parameters increasing to 4.1 M, which resulted in a reduction in the frame rate to 121 FPS. Nevertheless, high real-time performance was maintained, and the increase in computational cost is moderate and acceptable.

Table 3 indicates that the YOLOv8 + ACDU network attained precision, recall, and mAP@0.5 values of 78.3%, 79.3%, and 82.1%, respectively, corresponding to enhancements of 4.1%, 3.7%, and 3.2% relative to the baseline model. Simultaneously, YOLOv8 + DySample achieved increases of 3.1%, 0.1%, and 1.5% in the corresponding metrics. Although GFLOPs increased from 8.2 to 8.4, parameters increased from 3.0 to 3.4 M. In addition, the frame rate slightly decreased to 131 FPS and the performance improvement remained significant.

Figure 11 compares the mAP values of the aforementioned models.

4.4.2. Evaluation of Detection Accuracy Across All Categories Using YOLOv8

Table 4 presents the outcomes of a comparative performance evaluation between the OA-YOLO model and the YOLOv8 model across all target categories of the SIMD dataset.

The comparative analysis indicates that OA-YOLOv8 demonstrates a substantial enhancement in the overall detection performance compared to the baseline YOLOv8, with mAP@0.5 increasing by 4.6% from 78.9 to 83.5%. The “helicopter” category achieved the largest improvement in AP, with an increase of +25.6%. This demonstrates a significant enhancement in the model’s capability to recognize objects characterized by irregular shapes and intricate structures. This enhancement can primarily be ascribed to the adaptive receptive field of RFAConv and the triplet attention mechanism, which effectively enhance feature modeling for objects of varying scales and shapes. Moreover, ACDU significantly optimizes detail restoration during the upsampling stage, making it particularly suitable for distinguishing small objects with intricate backgrounds and fine textures in images.

In contrast, for larger categories, such as “car”, although the baseline model already achieves high precision, OA-YOLOv8 still gained an additional 0.7% in AP, demonstrating that the improved modules enhance small object detection and strengthen overall robustness in complex backgrounds. For the “other” category, AP increased from 25.4 to 32.3%, indicating that the enhanced feature representation achieves better category generalizability with greater shape variability.

To facilitate a more intuitive assessment of the variations in performance between OA-YOLO and YOLOv8, their precision–recall (P-R) curves are shown in Figure 12a,b, respectively. The performance improves as the precision–recall curve approaches the upper-right corner of the coordinate plane. The P-R curve of OA-YOLOv8 encompasses a greater area compared to that of YOLOv8, signifying its enhanced performance.

4.4.3. Experimental Results and Model Comparison

To demonstrate the validity of OA-YOLOv8 for detecting multiple targets in satellite-based remote sensing images, the proposed OA-YOLOv8 model was compared with traditional detection methods and other models in the YOLO series. To ensure a fair comparison, all evaluated methods (including Faster R-CNN, single-shot multibox detector (SSD) and YOLOv9) were trained under the same experimental settings, using identical dataset splits, input resolution, training epochs, and data augmentation strategies. As listed in Table 5, the traditional methods, Faster R-CNN (mAP@0.5 = 72.2%, GFLOPs = 59.7) and SSD (mAP@0.5 = 71.4%, Parameters = 14.2 M), are constrained by their high computational complexity and large parameter size. Their inference speeds were only 67 and 82 FPS, respectively, making it difficult for them to meet the demands of real-time remote sensing image analysis. Single-stage detectors in the YOLO series exhibited excellent inference speeds; for example, YOLOv9 achieved a frame rate of 172 FPS, and YOLOv3-tiny reached 167 FPS. However, these lightweight versions have certain limitations in terms of precision and mAP, particularly in remote sensing tasks involving complex backgrounds mixed with multiple types of small targets, where their feature extraction capabilities are relatively weak. In contrast, the OA-YOLOv8 proposed in this study achieved the best results in precision (80.9%), recall (79.7%), and mAP (83.5%). It achieved a mAP improvement of 4.6% over the baseline YOLOv8 and 4.7% over YOLOv9. Although its computational cost (10.3 GFLOPs) and number of parameters (4.3 M) were slightly higher than those of YOLOv8, they remained within the lightweight range. It achieved an inference speed of 128 FPS, demonstrating the capability for real-time application. Figure 13 presents a comparative analysis of the mAP values across various YOLO series algorithms. The proposed OA-YOLOv8 model exhibited the optimal performance.

4.4.4. Ablation Experiment

An extensive ablation study was conducted on the SIMD to evaluate the individual impact of each proposed enhancement module on the model’s overall performance, as listed in Table 6.

Table 6 shows that when the OPR and ACDU modules were applied individually, they improved the detection precision. The simultaneous introduction of both the OPR and ACDU modules resulted in the model attaining optimal performance across all three metrics: precision of 80.9%, recall of 79.7%, and mAP of 83.5%. Thus, the two modules are complementary in optimizing feature representation and detail recovery, simultaneously enhancing detection precision and comprehensiveness. Although GFLOPs increased to 10.3, parameters increased to 4.3 M, and inference speed slightly decreased to 128 FPS, the overall performance improvement was sufficient to offset the increase in computational cost.

4.5. Visualization Analysis

4.5.1. Comparison on the SIMD

To more effectively illustrate the performance of the OA-YOLOv8 model, a visual analysis experiment was carried out on the SIMD. Four sets of images were selected for testing, as shown in Figure 14. The precision of YOLOv8 was generally lower than that of OA-YOLOv8, and instances of both missed detections and false positives were observed. As illustrated in Figure 14a, YOLOv8 missed the car target and incorrectly identified the truck as a van. In Figure 14b, OA-YOLOv8 detected objects of all sizes. In contrast, YOLOv8 is primarily effective at detecting medium- and large-scale objects, failing to detect small ones. Figure 14c shows that OA-YOLOv8 achieved higher precision in recognizing small objects against low-resolution backgrounds, such as trees, houses, and roads. Figure 14d shows that YOLOv8 exhibited extensive missed detections for objects such as boats, while incorrectly identifying cars and trucks as vans. In contrast, compared with the baseline YOLOv8, OA-YOLOv8 demonstrated superior capability in detecting small-scale and densely clustered targets. Overall, it is visually demonstrated that OA-YOLOv8 exhibited superior performance.

Although the proposed method achieves consistent improvements across most categories, we observe that the performance gains for several categories are relatively limited, and a slight decrease is noted in a few cases. For instance, the detection performance of some car targets in Figure 14a, along with the long vehicle and truck targets in Figure 14d, shows a slight degradation; meanwhile, the improvement in recognition precision for car targets remains rather limited. In Figure 14d, although the missing detection rate of boat targets is reduced, this optimization results in a marginal decline in the recognition precision of a small subset of boat targets.

This phenomenon can be attributed to inherent challenges in remote sensing imagery. Specifically, targets in these categories often appear at very small scales or in densely clustered layouts, where fine-grained features are easily degraded during down-sampling. In addition, high intra-class variation and inter-class similarity further increase classification ambiguity. Moreover, class inequality in the training dataset may restrict the effectiveness of the proposed enhancements for underrepresented classes. These results suggest that while the proposed approach improves comprehensive detection effectiveness, certain categories remain challenging due to dataset characteristics and task complexity.

4.5.2. Comparison Across Different Datasets

To assess the generalization ability of the enhanced remote sensing image recognition approach (OA-YOLOv8), OA-YOLOv8 was experimentally compared with the original YOLOv8 model on several released open-source remote sensing datasets. Four established remote sensing datasets were selected: HRSC-2016 [39], DIOR [40], NWPU VHR-10 [41], and RSOD [42]. HRSC-2016 is a high-resolution ship-detection dataset with resolutions ranging between 0.4 and 2 m. It contains 2976 ship instances across four categories (civilian ships, warships, aircraft carriers, and submarines). DIOR constitutes a comprehensive dataset developed for the purpose of object detection within optical remote sensing images, comprising 23,463 images distributed among 20 classes, including aircraft, airports, and baseball fields. Each image has a fixed size of 800 × 800 pixels, while the spatial resolution varies from 0.5 to 30 m. NWPU VHR-10 is a dataset composed of high-resolution remote sensing imagery, containing 10 categories, including aircraft, ships, tennis courts, and basketball courts. It includes 800 images and 3651 labeled objects, with spatial resolutions spanning from 0.5 to 2 m. RSOD contains 976 images, with four target classes (airplanes, overpasses, ships, and vehicles) and with images at an approximate spatial resolution of 0.5 m. All datasets included in the experiment were divided into three distinct subsets—training, validation and test—based on a ratio of 8:1:1.

As listed in Table 7, on all four datasets, OA-YOLOv8 achieved higher precision, recall, and mAP than YOLOv8, demonstrating good generalizability. Figure 15 shows a visual comparison between YOLOv8 and OA-YOLOv8 on the four datasets: OA-YOLOv8 performed better in detecting multiscale and densely distributed targets in remote sensing images, further validating its generalizability.

From the perspective of the research objective of this study, this experiment was designed to test the hypothesis that enhancing multiscale feature perception and adaptive channel reconstruction can enhance the generalization effectiveness of remote sensing object detection models across heterogeneous datasets. The four selected datasets exhibit substantial differences in category diversity, object scale, spatial resolution, and scene complexity, providing a comprehensive evaluation scenario. The consistent improvements achieved by OA-YOLOv8 over YOLOv8 on all datasets indicate that the proposed approach is not restricted to a single dataset, and can learn robust and transferable feature representations. In particular, the performance gains on datasets with dense object distributions and large-scale variations further suggest that the proposed modules effectively address the identified challenges, thereby validating the underlying research hypothesis of this study.

5. Conclusions

This study proposed an improved OA-YOLOv8 model tailored to tackle the difficulties of object detection in remote sensing images, including small targets, densely distributed targets, and targets with low-resolution backgrounds. The model enhances multiscale feature extraction and perceptual capability in complex backgrounds through the designed OPR module, which integrates RFAConv and a triplet attention mechanism, which replaces C2f layers in the backbone. Meanwhile, the ACDU module, combining DySample upsampling, wavelet transform, and the SEAM, was developed to improve detail restoration and spatial feature focusing, significantly enhancing the detection precision. Extensive experiments were conducted using the SIMD, as well as comparative evaluations with existing YOLO-based models, demonstrating that OA-YOLOv8 achieved consistently superior performance and verifying that the introduced improvements enhance detection reliability and adaptability in remote sensing applications. These findings illustrate the efficacy of the proposed model to effectively process targets of varying scales and high spatial density in challenging remote sensing environments.

Nevertheless, the proposed OA-YOLOv8 exhibits a modest escalation in computational complexity relative to the baseline model. Although the performance gains justify this trade-off in many application scenarios, further optimization remains necessary. Subsequent research will concentrate on the development of more lightweight architectural improvement strategies and exploring model acceleration techniques, such as quantization, pruning, and knowledge distillation, to reduce model complexity and inference latency while maintaining detection accuracy, thereby better supporting real-time remote sensing applications. In addition, we plan to expand the dataset by incorporating more complex and extreme scenes through additional data collection and data synthesis strategies, with the aim of further enhancing robustness and generalization. Moreover, the proposed framework will be extended toward multimodal remote sensing data fusion by integrating complementary information from optical, synthetic aperture radar (SAR), and hyperspectral imagery to improve detection precision, robustness, and generalizability in more challenging remote sensing environments.

Author Contributions

J.S.: conceptualization; methodology; software; formal analysis; investigation; data curation; writing—original draft; visualization. J.L.: methodology; validation; writing—review and editing; supervision; project administration. J.Z.: resources; data curation; validation; writing—review and editing. L.Z.: formal analysis; visualization; writing—review and editing. S.S.: project administration; supervision; writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study is the publicly available SIMD. It can be accessed publicly at https://github.com/ihians/simd (accessed on 25 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACDU	Adaptive channel dynamic upsampling
ACmix	Adaptive mixing strategy
AUC	Area under the curve
BatchNorm2d	Two-dimensional batch normalization
CAA	Context anchor attention
C2f	Cross-stage partial bottleneck with two convolutions
CNN	Convolutional neural network
Conv	Convolution
Conv2d	Two-dimensional convolution
CSMM	Channel and spatial attention multiscale mechanism
Faster R-CNN	Faster region-based convolutional neural network
FC-FPN	Feature pyramid network with fusion coefficients
FPS	Frames per second
MaxPool	Max pooling
MFAD	Multi-frequency attention downsampling
MDPIOU	Focaler–minimum-point distance intersection over union
MSM	Multiscale model
MSMHC	Multiscale model with histogram features
OA-YOLOv8	Omnidirectional adaptive YOLOv8
OPR	Omnidirectional perception refinement
P-R	Precision–recall
RFAConv	Receptive-field attention convolution
SAR	Synthetic aperture radar
SEAM	Self-supervised equivariant attention mechanism
SG-ARD	Saliency-guided adaptive random diffusion
SGD	Stochastic gradient descent
SiLU	Sigmoid linear unit
SPPF	Spatial pyramid pooling—fast
SSD	Singleshot multibox detector
UAV	Unmanned aerial vehicle
YOLO	You Only Look Once
YOLOv3	You Only Look Once version 3
YOLOv5	You Only Look Once version 5
YOLOv6	You Only Look Once version 6
YOLOv7	You Only Look Once version 7
YOLOv8	You Only Look Once version 8
YOLOv9	You Only Look Once version 9

References

Wang, W.Q. Application status and prospects of remote sensing technology in land and resources management. Surv. Mapp. Bull. 2009, 387, 38–40. [Google Scholar]
Li, C.X. Spaceborne remote sensing and its impact on the economy. Remote Sens. Technol. Appl. 1998, 13, 69–73. [Google Scholar]
Liu, Y.C. Application of remote sensing technology in smart city construction and planning. J. Phys. Conf. Ser. 2023, 2608, 012052. [Google Scholar] [CrossRef]
Zhang, H. A Gaofen-1 Remote-Sensing Image Segmentation Method Based on Fully Convolutional Networks. Master’s Thesis, Shandong Agricultural University, Taian, China, 2018. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar] [CrossRef]
Ren, S. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Dai, J.F.; Li, Y.; He, K.M.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 379–387. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. arXiv 2019, arXiv:1911.08287. [Google Scholar] [CrossRef]
YOLOv5: V6.0. 12, Ultralytics. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 September 2024).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar] [CrossRef]
Lim, H.; Chae, D.; Yoo, J.H.; Kwon, K. Template matching-based target recognition algorithm development and verification using SAR images. J. Korea Inst. Mil. Sci. Technol. 2014, 17, 364–377. [Google Scholar] [CrossRef]
Liu, Z.K.; Wang, H.Z.; Weng, L.B.; Yang, Y. Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex backgrounds. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1074–1078. [Google Scholar] [CrossRef]
Liu, G.; Zhang, Y.; Zheng, X.; Sun, X.; Fu, K.; Wang, H. A new method on inshore ship detection in high-resolution satellite images using shape and context information. IEEE Geosci. Remote Sens. Lett. 2014, 11, 617–621. [Google Scholar] [CrossRef]
Zhang, W.; Zhang, J.; Zhang, L. Saliency-guided adaptive random diffusion for remote sensing images restoration with cloud and haze. In Proceedings of the 33rd ACM International Conference on Multimedia (MM ‘25), Dublin, Ireland, 27–31 October 2025; ACM: New York, NY, USA, 2025; pp. 8596–8605. [Google Scholar]
Zhou, K.; Zhang, M.; Wang, H.; Tan, J. Ship detection in SAR images based on multi-scale feature extraction and adaptive feature fusion. Remote Sens. 2022, 14, 755. [Google Scholar] [CrossRef]
Huang, S.; Li, D.; Zhao, W.; Liu, Y. Haze removal algorithm for optical remote sensing image based on multi-scale model and histogram characteristic. IEEE Access 2019, 7, 104179–104196. [Google Scholar] [CrossRef]
Sun, Y.M.; Cao, B.; Zhu, P.F.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
Cao, Y. Intelligent Surveillance with Multimodal Object Detection in Complex Environments. Master’s Thesis, University of British Columbia, Vancouver, BC, Canada, 2024. [Google Scholar]
Dong, W.; Zhu, H.; Lin, S.; Luo, X.; Shen, Y.; Liu, X.; Zhang, J.; Guo, G.; Zhang, B. Fusion-Mamba for cross-modality object detection. arXiv 2024, arXiv:2404.09146. [Google Scholar] [CrossRef]
Zhang, P.; Liu, J.; Zhang, J.; Liu, Y.; Li, X. MD-YOLOv8: A multi-object detection algorithm for remote sensing satellite images. IET Image Process. 2025, 19, e70106. [Google Scholar] [CrossRef]
Fan, K.; Li, Q.; Li, Q.; Zhong, G.; Chu, Y.; Le, Z. YOLO-Remote: An object detection algorithm for remote sensing targets. IEEE Access 2024, 12, 155654–155665. [Google Scholar] [CrossRef]
Wang, Y.; Xiao, H.; Wang, Y.; Huang, J. Based on YOLOv7-tiny improved model of remote sensing image detection. In Proceedings of the 16th International Conference on Advanced Computer Theory and Engineering (ICACTE), Hefei, China, 15–17 September 2023; pp. 150–156. [Google Scholar] [CrossRef]
Sharma, M.; Dhanaraj, M.; Karnam, S.; Chachlakis, D.G.; Ptucha, R.; Markopoulos, P.P. YOLOrs: Object detection in multimodal remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 1497–1508. [Google Scholar] [CrossRef]
Xiao, L.; Zhang, T.; Jia, Y.; Nie, X.; Wang, M.; Shao, X. YOLO-RS: Remote sensing enhanced crop detection methods. arXiv 2025, arXiv:2504.11165. [Google Scholar] [CrossRef]
Wan, D.; Lu, R.; Wang, S.; Shen, S.; Xu, T.; Lang, X. YOLO-HR: Improved yolov5 for object detection in high-resolution optical remote sensing images. Remote Sens. 2023, 15, 614. [Google Scholar] [CrossRef]
Wang, S. Effectiveness of traditional augmentation methods for rebar counting using UAV imagery with Faster R-CNN and YOLOv10-based transformer architectures. Sci. Rep. 2025, 15, 33702. [Google Scholar] [CrossRef]
Wang, S.; Park, S.; Kim, J.; Kim, J. Safety helmet monitoring on construction sites using YOLOv10 and advanced transformer architectures with surveillance and body-worn cameras. J. Constr. Eng. Manag. 2025, 151, 04025186. [Google Scholar] [CrossRef]
Glenn-Jocher, Pre-Commit-ci[bot], AyushExel, etc. Yolov8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 25 June 2025).
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2024, arXiv:2304.03198. [Google Scholar] [CrossRef]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3139–3148. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. arXiv 2023, arXiv:2308.15085. [Google Scholar] [CrossRef]
Haar, A. Zur theorie der orthogonalen funktionensysteme. Math. Ann. 1911, 71, 38–53. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, J.; Kan, M.; Shan, S.; Chen, X. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. arXiv 2020, arXiv:2004.04581. [Google Scholar] [CrossRef]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the International Conference on Pattern Recognition Applications and Methods (SciTePress, 2017), Porto, Portugal, 24–26 February 2017; pp. 324–331. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. arXiv 2019, arXiv:1909.00133. [Google Scholar] [CrossRef]
Long, M.; Cao, Y.; Wang, J.; Jordan, M. Learning transferable features with deep adaptation networks. In Proceedings of the International Conference on Machine Learning (PMLR, 2015), Lille, France, 6–11 July 2015; pp. 97–105. [Google Scholar]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]

Figure 1. YOLOv8 network architecture.

Figure 2. Network architecture of OA-YOLOv8.

Figure 3. Structure of RFAConv.

Figure 4. Structure of the OPR module.

Figure 5. Structure of the triplet attention module.

Figure 6. Structure of DySample.

Figure 7. Architecture of the ACDU module.

Figure 8. Remote sensing image after Haar wavelet transform.

Figure 9. Structure of the SEAM.

Figure 11. Comparison of mAP values between different models.

Figure 12. P-R curves.

Figure 13. Comparison curves of mAP values across different models.

Figure 14. Detection results under different scenarios.

Figure 15. Detection results across different datasets.

Table 1. Hyperparameter settings.

Hyperparameter	Value
Epoch	150
Initial learning rate	0.01
Final learning rate	0.01
Data enhancement strategy	Mosaic
Optimizer	SGD
GPU mem	8
Weight decay	0.0005
Batch size	8
Image size	640
Warmup epochs	3
Warmup momentum	0.8
Warmup bias learning rate	0.1
Momentum	0.937
hsv_h	0.015
hsv_s	0.7
hsv_v	0.4
Patience	100

SGD: Stochastic gradient descent.

Table 2. Comparison results between RFAConv and OPR.

Model	Precision (%)	Recall (%)	mAP (%)	GFLOPs	Parameters (M)	FPS
YOLOv8	74.2	75.6	78.9	8.2	3.0	146
YOLOv8 + RFAConv	78.9	74.9	80.5	8.5	3.0	142
YOLOv8 + OPR	82.7	76.1	82.5	10.1	4.1	130

Table 3. Comparison results between DySample and ACDU.

Model	Precision (%)	Recall (%)	mAP (%)	GFLOPs	Parameters (M)	FPS
YOLOv8	74.2	75.6	78.9	8.2	3.0	146
YOLOv8 + DySample	77.3	75.7	80.4	8.2	3.0	140
YOLOv8 + ACDU	78.3	79.3	82.1	8.4	3.4	137

Table 4. Comparison of target detection precision by SIMD category between OA-YOLO and YOLOv8.

Model	Car	Truck	Van	Long Vehicle	Bus	Airline	Propeller	Trainer
YOLOv8	93.4	83.1	84.7	80.0	89.3	97.7	93.6	96.8
OA-YOLOv8	94.1	85.9	85.5	84.5	92.7	98.0	93.5	95.9
	Chartered	Fighter	Other	Stair truck	Pushback	Helicopter	Propel	mAP (%)
YOLOv8	94.2	93.3	25.4	48.5	34.2	72.0	97.2	78.9
OA-YOLOv8	95.3	96.7	32.3	50.5	51.9	97.6	97.8	83.5

Table 5. Experimental results of model comparison.

Model	Precision (%)	Recall (%)	mAP (%)	GFLOPs	Parameters (M)	FPS
Traditional method
Faster R-CNN	77.1	68.4	72.2	59.7	42.8	67
SSD	75.2	67.9	71.4	32.4	14.2	82
YOLO series
YOLOv3-tiny	80.3	74.3	78.6	19.1	12.1	167
YOLOv5	78.6	73.6	78.7	7.2	2.5	133
YOLOv6	77.9	74.8	79.9	16.3	4.4	104
YOLOv7-tiny	78.9	74.9	77.6	13.5	6.2	101
YOLOv8	74.2	75.6	78.9	8.2	3.0	146
YOLOv9	80.1	74.4	78.7	8.0	2.0	172
Our model
OA-YOLOv8	80.9	79.7	83.5	10.3	4.3	128

Table 6. Results of the ablation experiment.

YOLOv8	OPR	ACDU	Precision (%)	Recall (%)	mAP (%)	GFLOPs	Parameters (M)	FPS
√			74.2	75.6	78.9	8.2	3.0	146
√	√		82.7	76.1	82.5	10.1	4.1	130
√		√	78.3	79.3	82.1	8.4	3.4	137
√	√	√	80.9	79.7	83.5	10.3	4.3	128

Table 7. Comparison results across different datasets.

Dataset	YOLOv8			OA-YOLOv8
Dataset	Precision (%)	Recall (%)	mAP (%)	Precision (%)	Recall (%)	mAP (%)
HRSC-2016	93.9	86.9	92.9	97.2	93.5	96.6
DIOR	86.6	77.6	83.5	90.4	81.5	88.7
NWPU VHR-10	88.7	84.1	90.1	94.3	89.8	93.7
RSOD	88.4	87.3	90.8	95.8	88.5	94.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shi, J.; Liu, J.; Zhang, J.; Zhang, L.; Sun, S. OA-YOLOv8: A Multiscale Feature Optimization Network for Remote Sensing Object Detection. Appl. Sci. 2026, 16, 1467. https://doi.org/10.3390/app16031467

AMA Style

Shi J, Liu J, Zhang J, Zhang L, Sun S. OA-YOLOv8: A Multiscale Feature Optimization Network for Remote Sensing Object Detection. Applied Sciences. 2026; 16(3):1467. https://doi.org/10.3390/app16031467

Chicago/Turabian Style

Shi, Jiahao, Jian Liu, Jianqiang Zhang, Lei Zhang, and Sihang Sun. 2026. "OA-YOLOv8: A Multiscale Feature Optimization Network for Remote Sensing Object Detection" Applied Sciences 16, no. 3: 1467. https://doi.org/10.3390/app16031467

APA Style

Shi, J., Liu, J., Zhang, J., Zhang, L., & Sun, S. (2026). OA-YOLOv8: A Multiscale Feature Optimization Network for Remote Sensing Object Detection. Applied Sciences, 16(3), 1467. https://doi.org/10.3390/app16031467

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

OA-YOLOv8: A Multiscale Feature Optimization Network for Remote Sensing Object Detection

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. YOLOv8 Network

3.2. OA-YOLOv8 Network

3.2.1. Omnidirectional Perception Enhancement Network

3.2.2. Adaptive Channel Dynamic Upsampling Module

4. Experiments

4.1. Experimental Environment

4.2. Satellite Imagery Multi-Vehicle Dataset (SIMD)

4.3. Performance Evaluation Metrics

4.4. Analysis of Experimental Results

4.4.1. Experiments on OPR and ACDU Modules

4.4.2. Evaluation of Detection Accuracy Across All Categories Using YOLOv8

4.4.3. Experimental Results and Model Comparison

4.4.4. Ablation Experiment

4.5. Visualization Analysis

4.5.1. Comparison on the SIMD

4.5.2. Comparison Across Different Datasets

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI