Optimization of Object Detection Network Architecture for High-Resolution Remote Sensing

Shi, Hongyan; Bai, Xiaofeng; Bai, Chenshuai

doi:10.3390/a18090537

Open AccessArticle

Optimization of Object Detection Network Architecture for High-Resolution Remote Sensing

by

Hongyan Shi

¹,

Xiaofeng Bai

² and

Chenshuai Bai

^2,*

¹

College of Engineering, Inner Mongolia Minzu University, Tongliao 028000, China

²

School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(9), 537; https://doi.org/10.3390/a18090537

Submission received: 16 July 2025 / Revised: 16 August 2025 / Accepted: 20 August 2025 / Published: 23 August 2025

(This article belongs to the Section Combinatorial Optimization, Graph, and Network Algorithms)

Download

Browse Figures

Review Reports Versions Notes

Abstract

(1) Objective: This study is aiming at the key problems, such as insufficient detection accuracy of small targets and complex background interference in remote-sensing image target detection; (2) Methods: by optimizing the YOLOv10x model architecture, the YOLO-KRM model is proposed. Firstly, a new backbone network structure is constructed. By replacing the C2f of the third layer of the backbone network with the Kolmogorov–Arnold network, the approximation ability of the model to complete complex nonlinear functions in high-dimensional space is improved. Then, the C2f of the fifth layer of the backbone network is replaced by the receptive field attention convolution, which enhances the model’s ability to capture the global context information of the features. In addition, the C2f and C2fCIB structures in the upsampling operation in the neck network are replaced by the hybrid local channel attention mechanism module, which significantly improves the feature representation ability of the model. Results: In order to validate the effectiveness of the YOLO-KRM model, detailed experiments were conducted on two remote-sensing datasets, RSOD and NWPU VHR-10. The experimental results show that, compared with the original model YOLOv10x, the mAP@50 of the YOLO-KRM model on the two datasets is increased by 1.77% and 2.75%, respectively, and the mAP @ 50:95 index is increased by 3.82% and 5.23%, respectively; (3) Results: by improving the model, the accuracy of target detection in remote-sensing images is successfully enhanced. The experimental results verify the effectiveness of the model in dealing with complex backgrounds and small targets, especially in high-resolution remote-sensing images.

Keywords:

remote sensing image; object detection; Kolmogorov-Arnold network; receptive field attention convolution; mixed local channel attention

1. Introduction

With the rapid development of remote-sensing technology, the application of remote-sensing image target detection in a variety of fields, such as land resources monitoring, environmental monitoring, urban planning, and military reconnaissance, has become increasingly widespread and in-depth [1,2]. High-resolution remote-sensing images provide rich ground-object information, making the accuracy and efficiency of target detection a key issue. However, the task of remote-sensing image target detection faces many challenges, such as complex backgrounds and a wide variety of targets with significant size differences. In particular, the detection of small targets and targets in complex backgrounds has always been a key and difficult point in this research field [3,4].

Object detection models are mainly divided into two categories: one-stage and two-stage [5]. The two-stage method generates candidate regions (Region Proposals) and then performs classification and regression. Although the detection accuracy is high, the computational complexity is high, and it is difficult to meet the real-time requirements [6]. The one-stage method (such as the YOLO series and SSD) directly predicts the target category and location on the image without the need for region proposals and has a higher detection speed and is more suitable for real-time applications [7]. In recent years, deep learning target detection models based on convolutional neural networks (CNN) [8] have made significant progress. Among them, the YOLO series of models have attracted widespread attention due to their fast detection speed and high detection accuracy. However, faced with challenges such as small-target detection and complex background interference unique to remote-sensing images, the existing YOLO model still needs further optimization.

To address these challenges, researchers have adopted a series of methods to improve the performance of the YOLO model. To address the problem of small-target detection in remote-sensing images, the reference [9] proposed the SOD-YOLOv10 model, which enhances global perception capabilities through the Transformer backbone network, combines the feature fusion technology of the attention mechanism and the new IOU loss function, and achieves significant performance improvement on four mainstream datasets. The reference [10] proposed the DINO model for remote-sensing image detection. Although it has the ability to extract fine-grained features and search in a wide space, its multi-layer encoding and decoding structure leads to a significant increase in space and computational complexity, which is not conducive to fast reasoning. In view of the imaging characteristics of remote-sensing images, the reference [11] designed a context-aware multi-receptive field fusion network. Through the receptive field expansion module, the high-level feature aggregation module, and the feature refinement region proposal network, it effectively improves the detection accuracy of remote-sensing targets of different types and sizes. Although these methods have achieved certain performance improvements, the problems of insufficient small-target detection accuracy and complex background interference in remote-sensing image target detection have not been completely solved.

To overcome these challenges, early YOLO versions, such as YOLOv5, improved the ability to detect small targets through the Focus structure and CSPNet, but their performance in complex background suppression was limited [12]. The C2f module introduced in YOLOv8 enhanced feature reuse but brought additional computational overhead [13]. YOLOv9 significantly improved feature extraction capabilities through programmable gradient information (PGI) and generalized efficient layer aggregation network (GELAN) [14], but its architectural complexity limited its deployment efficiency on embedded devices. In contrast, YOLOv10 [15] has significant advantages in lightweight design (no NMS) and modular architecture, but there is still a trade-off between performance and deployment efficiency. The results of the literature [16] on the SIDD dataset show that the mAP@0.5 of YOLOv10n is 83%, which is lower than YOLOv9-C (87%). In addition, although the best performing YOLOv10x has higher accuracy (mAP@0.5 is 87.6%), its model size (64.1 MB) and FPS (88) are both inferior to YOLOv9-C (30.1 MB, 102), which limits its deployment and real-time performance on resource-constrained devices. Reference [17] points out that on the NWPU VHR-10 dataset, YOLOv10’s mAP50:95 (49.4%) is 7.7% and 6.3% lower than YOLOv8n (0.56.1%) and YOLOv9t, and its mAP50 is also the lowest among the compared algorithms. On the VEDAI dataset, YOLOv10’s mAP50:95 (26.7%) is lower than YOLOv6n (28.3%), YOLOv8n (30.60%), and YOLOv9t (31.6%), and its mAP50 (43.7%) is also lower, highlighting its limitations in multi-scale feature extraction capabilities in complex scenes. YOLOv10 performs better on the DIOR dataset. Overall, YOLOv10 still has obvious shortcomings in small-target detection accuracy and robustness in complex scenes.

Therefore, this paper selects the best-performing YOLOv10x model from the YOLOv10 series as its baseline model. Despite its outstanding accuracy, YOLOv10x still faces challenges in remote-sensing image detection, such as insufficient accuracy for small-targets and complex background interference. This paper aims to address this critical issue by performing targeted optimizations on the YOLOv10x model and proposing the YOLO-KRM model. The main contributions of this paper include the following:

A new backbone network structure is constructed, and the third layer C2f of the original backbone network is optimized to the Kolmogorov–Arnold Network (KAN); and the fifth layer C2f is optimized to the Receptive-Field Attention (RFA) convolution, which significantly improves the feature extraction accuracy and overall feature extraction efficiency of small objects.
The feature pyramid part is optimized, and the C2f and C2fCIB structures in the upsampling process are optimized to the Mixed Local Channel Attention (MLCA) module. Through its unique channel and spatial attention mechanism, it can more effectively process multi-scale features and significantly enhance the model’s robustness to complex backgrounds.
The effectiveness of the optimized YOLOv10x model in remote-sensing image target detection tasks is verified through experiments on two public remote-sensing datasets.

2. Related Work

Accurately and efficiently recognizing and localizing feature objects is at the heart of numerous crucial applications in remote-sensing image object recognition [18,19]. The challenges posed by complicated backdrops, object diversity, and scale variations, particularly for small objects and complex settings, place high expectations on the feature extraction capabilities of detection models [20,21,22]. Several researchers have proposed innovative approaches to tackle these issues. For example, the multi-scale context-aware network proposed in [23] addresses these challenges by improving feature extraction at different scales, reducing information loss for small objects through a multi-scale guidance module, and suppressing complex background interference with a context-aware downsampling module. Similarly, a multi-scale feature context aggregation network [24] enhances the detection of rotating objects in complex backgrounds. This is achieved by using a feature context aggregation module to boost small object feature extraction, a feature context information enhancement module to improve multi-dimensional sensing, and multi-scale feature fusion to prevent feature loss. These works highlight a common strategy: using multi-scale fusion and context-aware mechanisms to improve model performance in challenging remote-sensing environments. Moving beyond shallow convolution, the development of deep learning has led to more expressive network structures. For instance, [25] introduces a multi-stage deep enhancement network that improves the detection of tiny objects by using a center region-based label assignment strategy and a gated context aggregation module. This contrasts with earlier methods by specifically addressing the issue of tiny objects and enhancing feature representation through selective aggregation. Other research, like the work in [26], focuses on optimizing network architecture to achieve faster and more efficient fine-grained object detection. These studies demonstrate a clear trend toward deeper, more specialized network designs for enhanced performance. In light of these advancements, this study leverages KAN (Kolmogorov–Arnold Network), a novel architecture that excels at handling complex nonlinear relationships in high-dimensional data through its powerful function approximation capabilities. Our goal is to enhance the model’s ability to express the intricate features of remote-sensing images, building upon the principles of deeper and more specialized network structures seen in the literature.

Objects in remote-sensing image identification typically exhibit a wide range of intricate morphologies and orientations [27,28]. To effectively extract these features with geometric variations, researchers have proposed various receptive field adjustment and attention mechanisms. For example, a lightweight remote-sensing object detection method [29] addresses multi-scale object detection by dynamically adjusting the receptive field using a multi-sensory field fusion module and a combination of dilated convolutions. This differs from other approaches by focusing on a lightweight solution, a critical consideration for resource-constrained applications. Similarly, the method in [30] uses a multi-sensory field attention mechanism to achieve efficient detection by combining multi-scale feature enhancement and a joint attention module. Another notable approach is the receptive field and direction-induced attention network [31], which strengthens small object feature expression by designing a multi-directional attention mechanism and enhancing feature diversity through a multi-scale receptive field convolutional layer. These works collectively demonstrate the importance of dynamically adjusting the receptive field and applying attention to specific areas to improve detection performance. This study builds on these ideas by proposing Receptive Field Attention Convolutional Operations (RFACO). Our approach aims to enhance the model’s local contextual awareness and improve its robustness to changes in object geometry by dynamically assigning attention weights to the receptive fields of each convolutional kernel, thereby integrating the benefits of receptive field and attention mechanisms.

The widespread application of channel-attention and spatial-attention processes has significantly improved the performance of convolutional neural networks, especially in object identification tasks [32,33]. Effectively combining channel and spatial information is crucial for distinguishing objects from complex backgrounds in remote-sensing scenes. Several studies highlight this synergy. For instance, [34] introduces a new YOLO-based method for detecting wheat blast disease that uses a parallel channel–space attention module to strengthen disease feature expression, effectively handling challenges like overexposure and complex backgrounds. In a different domain, [35] proposes a spatial-channel attention transformer for remote-sensing image text retrieval, which uses a dual-domain attention mechanism to model both spatial and channel dependencies simultaneously, thereby greatly enhancing fine-grained retrieval performance. Meanwhile, the work in [36] uses a multi-scale, multi-channel convolutional neural network with a channel–space attention mechanism to perform deep extraction of spatial–spectral features from hyperspectral images. These examples demonstrate that different combinations of attention mechanisms can be tailored to specific challenges, from disease detection to text retrieval and hyperspectral analysis. Building on this, the lightweight multimodal remote-sensing object detection framework in [37] achieves complementary fusion of visible and infrared data through an innovative channel–space exchange module, which enhances the detection of small objects. This emphasizes the value of fusing information from multiple modalities. All these works show that considering both local and global channel and spatial information helps a model better understand scene content. Our study adopts the MLCA (Multi-Local-Channel-Attention) module, which fuses the advantages of local channel attention, global channel attention, and spatial attention. This aims to capture information in the feature map more comprehensively and enhance the model’s adaptability to multi-scale objects and complex backgrounds in remote-sensing images.

3. YOLO-KRM Model

The overall architecture of the YOLO-KRM model is shown in Figure 1. The model is based on the YOLOv10x [15] algorithm and has undergone several innovative improvements to improve its performance in remote-sensing image detection tasks.

First, the C2f modules in the third and fifth layers of the original YOLOv10x backbone network are replaced by the KAN convolution module and the RFACO module. The KAN convolution module, based on the Kolmogorov–Arnold representation theorem, is capable of efficiently capturing complex features in remote-sensing images. It not only adaptively adjusts the network structure and parameters but also provides object enhancement of the features based on the information of key points in the image, which significantly raises the model’s object recognition accuracy in remote-sensing images. The RFACO module, on the other hand, significantly enhances the model’s ability to capture the global contextual information of features by dynamically generating attention weights for each sensing field. Object identification accuracy and resilience are increased by the model’s deeper comprehension of the image’s contextual information thanks to this dynamic attention mechanism.

In addition, the YOLOv10x neck network is optimized, and the MLCA module is used to replace the original C2f and C2fCIB modules in the upsampling process. The feature representation capability is greatly improved by the MLCA module’s deft integration of local channel attention, global channel attention, and spatial attention. The model performs better in complicated scenarios because of this multi-scale attention mechanism, which makes it possible for the model to more accurately capture the image’s detailed information.

The definitions of C2f, C2fCIB, SCDown, SPPF, PSA, and MHSA in Figure 1 are as follows:

C2f: After splitting the input features along the channel, a portion is processed by several lightweight bottleneck layers before being concatenated with bypass features and fused via 1 × 1 convolutions to enhance gradient flow and feature reuse.

C2fCIB: Within the C2f framework, a depthwise separable inverted residual structure replaces the traditional bottleneck layer, further reducing the number of parameters and computation.

Spatial-Channel Decoupled Downsampling (SCDown): 1 × 1 convolutions are used to adjust the number of channels, followed by 3 × 3 depthwise convolutions for spatial downsampling. This decouples spatial resolution reduction from channel transformation, reducing computation while preserving detail.

Spatial Pyramid Pooling-Fast (SPPF): A lightweight multi-scale pooling module that concatenates and fuses features from different receptive fields by serially performing three max-pooling layers in parallel. This method preserves the multi-scale information of Spatial Pyramid Pooling (SPP) while reducing the number of parameters and computation.

Partial Self-Attention (PSA): After grouping channels, it performs multi-head self-attention on only a subset of them, capturing global context with linear complexity. The attention is then fused with the unprocessed portion, achieving a balance between accuracy and efficiency.

Multi-Head Self-Attention (MHSA): The input features are split into multiple heads, and attention weights are calculated for each head before concatenating the outputs.

3.1. Backbone Improvement

The intricacy of high-resolution images raises the bar for object recognition models in the field of remote-sensing image object detection. This research designs a new backbone feature-extraction network to address the problems of multi-scale object identification, complicated background interference, and low detection accuracy of small objects. This network significantly improves the model’s ability to express complex features and improves detection accuracy by using KAN and RFACO modules. This new backbone network not only enhances the model’s adaptability to small objects and complex backgrounds, but also improves the overall detection efficiency, offering a precise and effective method for detecting objects in remote-sensing images.

3.1.1. Kolmogorov–Arnold Network

To enhance the model’s ability to represent complex features in remote-sensing imagery and improve detection accuracy, this study introduced a KAN module into the third C2f layer of the YOLOv10x backbone network, replacing the existing C2f architecture. This improvement leverages the Kolmogorov–Arnold representation theorem, leveraging its efficient approximation of multivariate continuous functions. Furthermore, it leverages the unique design advantages of the KAN [38] in function approximation and feature representation, significantly enhancing remote-sensing target detection performance.

The Kolmogorov–Arnold representation theorem is the theoretical basis of KAN. The theorem states that any multivariate continuous function

f

defined on a bounded closed interval can be expressed as a composition of a finite number of unary continuous functions, which are then composed by ordinary real number addition and multiplication operations between the results of these unary functions. For smooth

f : {[0, 1]}^{n} \to ℝ

.

f (X) = f (x_{1}, \dots, x_{n}) = \sum_{q = 1}^{2 n + 1} Φ_{q} (\sum_{p = 1}^{n} ϕ_{q, p} (x_{p})),

(1)

where

ϕ_{q, p} : [0, 1] \to ℝ

and

Φ_{q} : ℝ \to ℝ

is a one-dimensional continuous function. This theory simplifies the complexity of multidimensional functions, so that multidimensional functions can be constructed by one-dimensional functions and addition operations, which offers a fresh concept for neural network architecture design. To construct a deep KAN, the KAN layer with

n_{i n}

-dimensional input and

n_{o u t}

-dimensional output can be defined as a matrix of one-dimensional functions as follows Equation (2):

Φ = \{ϕ_{q, p}\}, p = 1, 2, \dots, n_{i n}, q = 1, 2, \dots, n_{o u t},

(2)

where

ϕ_{q, p}

refers to the parameters that can be trained. In the theorem, the internal function forms a KAN layer of

n_{i n} = n

and

n_{o u t} = 2 n + 1

, and the external function forms a KAN layer of

n_{i n} = 2 n + 1

and

n_{o u t} = 1

. As a result, Equation (1)’s Kolmogorov–Arnold representation is really a straightforward combination of two KAN layers. The deeper Kolmogorov–Arnold representation simply stacks more KAN layers. An array of integers represents the KAN’s shape as follows:

[n_{0}, n_{1}, \dots, n_{L},],

(3)

where

n_{i}

is the number of nodes in the calculation graph’s

i th

layer. In this work, the

i th

neuron in the

l th

layer is represented by

(l, i)

, and the

(l, i)

-neuron’s activation value is represented by

x_{l, i}

. There are H activation functions between layers

l

and

l + 1

: The following is the expression for the activation function that links

(l, i)

and

(l + 1, j)

:

ϕ_{l, j, i} l = 0, \dots, L - 1, i = 1, \dots, n_{l}, j = 1, \dots, n_{l + 1},

(4)

The pre-activated

ϕ_{l, j, i}

is the post-activation of

x_{l, i}

to

ϕ_{l, j, i}

represented by

{\tilde{x}}_{l, j, i} \equiv ϕ_{l, j, i} (x_{l, i})

. The total of all the post-activation values is the neuron’s activation value

(l + 1, j)

:

x_{l + 1, j} = \sum_{i = 1}^{n_{l}} {\tilde{x}}_{l, j, i} = \sum_{i = 1}^{n_{l}} ϕ_{l, j, i} (x_{l, i}), j = 1, \dots, n_{l + 1},

(5)

In the matrix form it is as follows:

X_{l + 1} = (\begin{matrix} ϕ_{l, 1, 1} (\cdot) & ϕ_{l, 1, 2} (\cdot) & \dots & ϕ_{l, 1, n_{l}} (\cdot) \\ ϕ_{l, 2, 1} (\cdot) & ϕ_{l, 2, 2} (\cdot) & \dots & ϕ_{l, 2, n_{l}} (\cdot) \\ ⋮ & ⋮ & ⋮ & ⋮ \\ ϕ_{l, n_{l + 1}, 1} (\cdot) & ϕ_{l, n_{l + 1}, 2} (\cdot) & \dots & ϕ_{l, n_{l + 1}, n_{l}} (\cdot) \end{matrix}) X_{l},

(6)

where Equation (6) is the function matrix corresponding to the layer

i th

KAN. The general KAN is a combination of

L

layers: given an input vector

x_{0} \in ℝ^{n_{0}}

, the output of KAN is as follows:

K A N (X) = (Φ_{L - 1} \circ Φ_{L - 2} \circ \dots \circ Φ_{1} \circ Φ_{0}) X,

(7)

Rewrite the above equation to make it more similar to Equation (1), assume the output dimension

n_{L} = 1

, and define

f (X) \equiv K A N (X)

.

f (X) = \sum_{i_{L - 1} = 1}^{n_{L - 1}} ϕ_{L - 1, i_{L}, i_{L - 1}} (\sum_{i_{L - 2} = 1}^{n_{L - 2}} \dots \sum_{i_{1} = 1}^{n_{1}} ϕ_{1, i_{2}, i_{1}} (\sum_{i_{0} = 1}^{n_{0}} ϕ_{0, i_{1}, i_{0}} (x_{i 0})) \dots),

(8)

The original Kolmogorov–Arnold representation Equation (1) corresponds to a two-layer KAN with shape

[n, 2 n + 1, 1]

. All operations are differentiable, so this paper can use back propagation to train KANs.

The KAN is superior to the conventional Multi-Layer Perceptron (MLP) in the following ways:

Nonlinear activation: KAN applies a learnable nonlinear activation function at each pixel point, which increases the expression ability of the model.
Efficient parameters: Although the number of parameters has increased, the overall parameters are more efficient due to their unique design.
Easy to use: Provides a simple and easy-to-import KAN Convolutional Layer class to facilitate integration into existing projects.

The backbone network of the effective object identification algorithm YOLOv10 is typically utilized for feature extraction. The KAN convolution module is applied to the main part of YOLOv10. The KAN convolution module can process the detail information in high-resolution images more effectively through efficient parameter utilization and nonlinear activation. By introducing techniques such as nonlinear activation and dynamic update of spline meshes, it can better cope with the challenge of object detection in complex backgrounds. To satisfy the real-time and precision requirements of remote-sensing image object recognition, the benefits of the KAN convolution module and the effective object detection algorithm of YOLOv10 can be combined to further increase detection speed and accuracy.

3.1.2. Receptive Field Attention Convolutional Operations

The variety of object scales and the intricacy of the background environment present significant obstacles to feature extraction in the field of remote-sensing image object recognition. This study replaces the fifth layer C2f in its backbone network with RFACO in order to further improve the performance of YOLOv10x in the remote-sensing image object detection task. By improving the feature extraction capability and sensitivity to object position changes, RFACO can significantly improve the model’s ability to characterize complex features and detection accuracy. By dynamically generating the sensing field space features and building the attention graph, RFACO optimizes the feature representation and more effectively adjusts to the complicated background and diversity of the objects in remote-sensing images.

The goal of Receptive-Field Attention (RFA) [39] is to prioritize the spatial aspects of the receptive field and highlight the significance of various features in the receptive field slider. RFACO achieves this goal through the following steps: For receptive field spatial features, traditional convolution operations rely on fixed convolution kernel size and parameter sharing and are insensitive to information differences caused by position changes. RFACO solves this problem by dynamically generating the spatial features of the receptive field. Each receptive field slider’s features are dynamically extracted based on the convolution kernel’s size (such as 3 × 3).

Using

X \in R^{C \times H \times W}

as the input, the dimension becomes

9 \times C \times H \times W

using the expansion procedure.

C, H, W

stands for input width, input height, and channel count, respectively. By avoiding the computational cost of the conventional expansion method, RFACO uses a technique to rapidly extract the spatial properties of the receptive field in order to increase computational efficiency. Each 3 × 3 window is independently processed in the receptive field space map to extract feature representations, as shown in Figure 2.

To enhance network performance, RFACO interacts with receptive field feature information to learn attention mappings. The global information of each receptive field feature is aggregated using Avgpool to reduce the number of parameters and computational cost. Softmax is then utilized to highlight the significance of each feature in the receptive field features, and 1 × 1 grouping convolution operation is employed for information interaction. The following is an expression for the RFA calculation:

F = S o f t m a x (g^{1 \times 1} (A v g P o o l (X))) \times R e L U (N o r m (g^{k \times k} (X))) = A_{r f} \times F_{r f},

(9)

The grouping convolution of size

i \times i

is represented by

g^{1 \times 1}

, the convolution kernel size by

k

, normalization by

N o r m

, input feature mapping by

X

, and the attention graph

A_{r f}

multiplied by the converted receptive field space feature

F_{r f}

is represented by

F

.

RFA significantly enhances the extraction of object features by dynamically generating an attention map for each receptive field feature. The traditional convolutional operation, which relies on shared parameters and shows insensitivity to the information difference caused by the object position change, limits the performance of the convolutional neural network to some extent. By highlighting the significance of various features in the receptive field and improving the feature representation, RFACO successfully resolves the aforementioned issue. Replacing RFACO with C2f, the fifth layer of the YOLOv10x backbone network, not only significantly improves the model’s adaptability to small- and medium-scale objects and complex backgrounds in remote-sensing images, but also increases the detection’s accuracy and robustness. In addition, the fast-sensing field spatial feature extraction method adopted by RFACO can effectively avoid the computational burden brought by the traditional method and further improve the overall efficiency of the model. As a result of these enhancements, RFACO now exhibits notable benefits in remote-sensing image object detection, offering solid backing for achieving high-precision and high-efficiency object detection.

3.2. FPN Improvement

Optimizing the feature pyramid network (FPN) [40] is essential for increasing the precision and effectiveness of remote-sensing image object recognition. FPN fuses multi-scale features through upsampling and downsampling operations to enhance the model’s ability to detect objects of different sizes. In order to further enhance the performance of YOLOv10x in remote-sensing image object detection, this paper optimizes the feature pyramid network by replacing the original C2f and C2fCIB layers in the upsampling with MLCA. The model’s capacity to recognize multi-scale objects is much improved by increasing the richness and accuracy of the feature representations. This also increases the model’s adaptability to complicated backdrops and detection accuracy.

This works in order to enhance the YOLOv10x object recognition model’s performance on remote-sensing photos, particularly with regard to precisely locating and identifying objects in intricate landscapes. In this paper, the feature pyramid network of YOLOv10x is optimized; specifically, the C2f and C2fCIB of the upsampling of its neck network are replaced with MLCA. Traditional channel attention mechanisms (Squeeze-and-Excitation (SE) [41] and Efficient Channel Attention, (ECA) [42]) mainly focus on modeling the global dependencies between channels, while ignoring the spatial context information within each channel. Due to this restriction, when channel weight assignment is being performed, some of the important local feature information may be overlooked or lost. MLCA [33] is able to capture richer feature representations by integrating the channel attention mechanism with the local spatial attention mechanism, which, in turn, significantly improves the detection performance of the model. The principle of MLCA is depicted as shown in Figure 3.

The definitions of LAP, GAP, and UNAP in Figure 3 are as follows:

Local Adaptive Pooling (LAP): LAP divides the feature map into multiple blocks, independently computes the mean for each block, and outputs local statistics. This preserves spatial locality and avoids the blurring of critical regions caused by global pooling.

Global Average Pooling (GAP): GAP averages all spatial locations within each channel. While providing global channel attention it also loses spatial detail.

Unpooling with Average (UNAP): UNAP restores the low-resolution output of LAP to its original size (C, H, W) by interpolation or replication, allowing element-by-element fusion with the original features to restore spatial resolution.

The core idea of the MLCA module is to combine channel attention and local spatial attention to capture richer feature information. After converting the input into a vector of

1 \times C \times k s \times k s

, local pooling is used to retrieve the local spatial information. Two branches are designed to process the input features: one branch is used to extract global channel information, and the other branch is used to extract local spatial information. After the two branches are processed by one-dimensional convolution, the original resolution is restored by anti-pooling. Finally, the output of the two branches is fused to obtain the feature map after mixed attention. The MLCA module uses an adaptive convolution kernel size adjustment technique to react to input characteristics with varying channel numbers. According to the following formula, the convolution kernel size k is proportional to the channel dimension C:

k = Φ (C) = {|\frac{\log_{2} (C)}{γ} + \frac{b}{γ}|}_{o d d},

(10)

where

γ

is the number of channels,

k

is the size of the convolution kernel,

γ

and

b

are hyperparameters, the default value is 2,

o d d

means that

k

is only an odd number, and if

k

is even, add 1.

This paper replaces the up-sampled C2f and C2fCIB of its neck network with MLCA in order to optimize the FPN of YOLOv10x. By combining channel attention and local spatial attention, MLCA can capture richer feature information and enhance the model’s ability to recognize and localize objects in remote-sensing images. In complex scenes, objects may be affected by factors such as background interference, occlusion, or illumination changes, etc. The MLCA module can significantly improve the performance of object detection by emphasizing important features and suppressing irrelevant information. Despite containing two branches and multiple operation steps, the MLCA module can maintain a similar amount of computation as the traditional attention mechanism module through reasonable parameter settings and efficient computation, thus significantly enhancing the model’s robustness and detection accuracy considerably without appreciably raising its computational complexity.

4. Experiment

The extensive tests were out on two popular remote-sensing datasets, RSOD and NWPU VHR-10, which will be thoroughly described in this chapter. This chapter will specifically address the metrics system for assessing the model’s performance, the comprehensive introduction of the datasets utilized, and a number of comparative tests that demonstrate the model’s superiority over the current mainstream approaches. In addition, ablation experiments are conducted in order to explore the specific contribution of the key techniques of KAN, RFACO, and MLCA to the model performance. Lastly, the experimental findings confirm how well the suggested model handles challenging distant sensing scenarios.

4.1. Experimental Configuration

Table 1 summarizes the configuration of the experiment in detail. The hardware uses the server V100 and Windows RTX 3070Ti as the GPU, with 32 GB DDR5 memory and powerful CPU, which provides a solid foundation for training. In terms of software, Torch 1.12.0 and CUDA 11.3 are used for deep learning on Windows 11 system, and Matplotlib and other libraries are used for data analysis. In the training parameter setting, the input size is 640 × 640, the learning rate is 0.01, the batch size is 4, and a variety of data enhancement methods are used to improve the performance of the model.

4.2. Dataset

The object detection job in the realm of remote-sensing photography is the focus of both the RSOD and NWPU VHR-10 datasets; Figure 4 and Figure 5 show examples of both datasets. Specifically, the RSOD dataset covers four major object categories, namely, airplanes, oil tanks, playgrounds, and overpasses, and consists of 446, 189, 176, and 165 images, respectively. The NWPU VHR-10 dataset, on the other hand, contains richer object categories, with a total of 800 ultra-high-resolution remote-sensing images, covering ten object categories such as airplanes, ships, storage tanks, baseball fields, tennis courts, basketball courts, athletic fields, harbors, bridges, and vehicles, which provides researchers with more comprehensive data resources.

4.3. Evaluating Indicator

The model’s precision, which measures the percentage of positive samples it predicts, truly indicates how reliable the model’s predictions of positive outcomes are. The percentage of samples that are expected to be positive increases with the accuracy value.

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

The model’s recall indicates how well it can forecast the percentage of positive samples among all samples that are, in fact, positive. The model can capture nearly all positive samples when the recall rate is around 1, meaning that the missed detection rate is very low.

R e c a l l = \frac{T P}{T P + F P}

(12)

Among these, True Positive (TP) denotes the proportion of samples that the model accurately classifies as positive, False Positive (FP) denotes the proportion of samples that the model misclassifies as positive, and False Negative (FN) denotes the proportion of samples that the model misclassifies as negative.

One metric used to assess the model’s overall performance on multi-category tasks is the mean Average Precision (mAP). The average precision (AP) of each category is first determined, and the mAP is then obtained by averaging the AP values of all categories. The model performs better on multi-category jobs when its mAP value is larger.

When the IoU threshold is 0.5, the model’s AP value is denoted as mAP@50. One of the most important measures for assessing the effectiveness of object detection models is intersection over union (IoU), which quantifies the extent of overlap between the detection box (bounding box) and the ground truth box (ground truth box).

The model’s average accuracy within the range of IoU thresholds from 0.5 to 0.95 (with a step size of 0.05) is denoted by mAP @ 50:95. It displays the accuracy and robustness of the model more thoroughly by integrating the performance under various IoU criteria. The model retains good detection performance across a range of IoU thresholds, as evidenced by the increased mAP@50:95.

FPS (frames per second) measures the computational efficiency of the model in the inference phase; that is, the number of image frames that the model can process per second. The higher the FPS value, the stronger the real-time processing ability of the model, which is crucial for actual deployment. An efficient model can achieve faster inference speed without sacrificing too much accuracy, thus achieving a good balance between performance and computational overhead.

4.4. Contrast Experiments

This subsection details the comprehensive comparison experiments of YOLO-KRM models on two datasets, RSOD and NWPU VHR-10. During the experiments, several YOLO series models, including YOLOv10x, and the YOLO-KRM model developed by this paper are selected as comparison objects. By deeply analyzing the four evaluation metrics of Precision(P), Recall(R), mAP@50, and mAP@50:95, and FPS, this paper comprehensively and systematically evaluates the performance strengths and weaknesses of each model in the object detection task.

Table 2 compares the performance of the YOLO-KRM model with several comparison models on the RSOD remote-sensing dataset. The baseline YOLOv10x model demonstrates strong detection capabilities, achieving Precision of 92.37, Recall of 92.13, mAP@50 of 95.91%, and mAP@50:95 of 75.95%. YOLO-KRM achieves comprehensive improvements, particularly in the comprehensive performance metric of mAP@50:95, where it significantly outperforms all other comparison models, demonstrating that the model maintains stable detection performance across different IoU thresholds. YOLO-KRM also demonstrates a clear lead in mAP@50, demonstrating superior detection accuracy at standard IoU thresholds. Comparing both Precision and Recall, YOLO-KRM not only achieves the highest precision, demonstrating higher confidence in its predictions, but also excels in Recall, effectively reducing missed detections. Compared to the YOLOv10x model, YOLO-KRM achieves a 3.11% improvement in Precision to 95.48%, a 2.37% increase in Recall to 94.5%, a 1.77% increase in mAP@50 to 97.68%, and a significant 3.82% increase in mAP@50:95 to 79.77%. YOLO-KRM surpasses all competing models in all four core metrics, demonstrating exceptional performance on the RSOD dataset and significantly improving detection accuracy and robustness in complex remote-sensing scenarios. However, it is noted that YOLO-KRM achieves an FPS of 21.8, which is lower than most comparison models (e.g., YOLOv8 at 121.5 FPS, and YOLOv10x at 27.4 FPS), indicating a trade-off between detection performance and inference speed.

As shown in Table 3, the performance on the NWPU VHR-10 dataset is comprehensively evaluated, and this paper similarly evaluates the performance of a variety of YOLO family models as well as the YOLO-KRM model. As can be seen from the data in the table, the YOLO-KRM model outperforms the benchmark model YOLOv10x and all other comparison models in all four metrics, Precision, Recall, mAP@50, and mAP@50:95. This comprehensive performance advantage not only proves the YOLO-KRM model’s excellence in a single metric, but also shows its balance and leadership in comprehensive performance.

Compared with the benchmark model YOLOv10x, YOLO-KRM improves 3.07% to 92.99% on Precision; and 4.81% to 94.34% on Recall. It indicates that the YOLO-KRM model can effectively identify most of the objects while ensuring detection accuracy, thus achieving a better balance between reducing false alarms and controlling missed detections. In terms of comprehensive evaluation indicators, compared with YOLOv10x, mAP@50 of YOLO-KRM model increased by 2.75% to 97.59%. mAP @ 50: 95 increased by 5.23% to 77%. It is shown that the YOLO-KRM model can maintain high detection accuracy under different IoU thresholds, which further proves the robustness and adaptability of the model in dealing with complex background and small object detection problems. In terms of inference speed, YOLO-KRM achieves an FPS of 22.1. While this is comparable to some larger variants, like YOLOv10C (23.7 FPS), it remains significantly slower than lighter models, such as YOLOv8 (123.8 FPS) and even the baseline YOLOv10x (33.3 FPS). In conclusion, the YOLO-KRM model performed quite well on the NWPU VHR-10 dataset as well. It not only achieved excellent results on various key indicators, but also had significant advantages compared with other models, though at the cost of reduced inference speed.

4.5. Ablation Experiments

In order to ensure the reliability of the results, we conducted systematic ablation experiments on the RSOD and NWPU VHR-10 datasets. The evaluation indicators include Precision, Recall, mAP@50, and mAP@50:95. Each configuration was run three times independently and the average performance indicators were reported. In order to further evaluate the statistical significance, we conducted a paired t-test. The results show that, compared with the baseline YOLOv10x, the performance improvement of the proposed YOLO-KRM model is statistically significant (p < 0.01), indicating that these enhancements are not caused by random variation. Table 4 shows the results of ablation experiments in detail, showing the changes in performance after each addition of an improved module. ‘√’ indicates that the module is used, and ‘×’ indicates that it is not used.

The ablation experimental results on the RSOD dataset show that each improvement brings a certain performance improvement. No improvement is made; that is, the table’s first behavior benchmark model is YOLOv10x. When the third layer C2f in the backbone network is replaced by KAN, the four indicators are greatly improved, and its Precision, Recall, mAP@50, and mAP@50:95 are increased by 2.07%, 1.26%, 1.08%, and 1.59%, respectively. When the fifth layer C2f of the YOLOv10x backbone network is further replaced by RFACO, the four indicators are also significantly improved, and the four indicators are further increased by 0.75%, 1.03%, 0.36%, and 2.23%, respectively. Furthermore, the upsampled C2f and C2fCIB of the neck network of YOLOv10x are replaced by MLCA; that is, the YOLO-KRM model. The four indicators further increased by 0.29%, 0.08%, 0.33%, and 0.3%, respectively, reaching 95.48%, 94.5%, 97.68%, and 80.07%. On the whole, the YOLO-KRM model is better in comprehensive performance.

The efficacy of each enhanced model is also confirmed by the ablation tests conducted on the NWPU VHR-10 dataset. The benchmark model YOLOv10x’s performance is shown in the first row. When C2f, the third layer of the backbone network, is replaced by the KAN module, the four metrics are improved by 0.71%, 0.1%, 0.92%, and 0.39%, respectively. When the fifth layer C2f of the backbone network is further replaced by the RFACO module, the performance of the model is greatly improved, and the four indicators are improved by 0.89%, 1.74%, 1.43%, and 3.75%, respectively. Further, the C2f and C2fCIB sampled on the neck network were replaced by the MLCA module; that is, the YOLO-KRM model. The four indicators were increased by 1.47%, 2.97%, 0.4%, and 1.09%, respectively, to 92.99%, 94.34%, 97.59%, and 77.00%. This result shows that the YOLO-KRM model achieves better comprehensive performance on the NWPU VHR-10 dataset.

4.6. Experimental Results

Figure 6 and Figure 7 clearly illustrate the performance comparison of the YOLO-KRM algorithm and its baseline/ablations on the remote-sensing image object detection task. Specifically, Figure 6 shows the detection results on the RSOD dataset, while Figure 7 presents the results on the NWPU VHR-10 dataset. The figures compare the following: (a) the baseline YOLOv10x, (b) YOLOv10x-K (YOLOv10x with KAN innovation), (c) YOLOv10x-KR (YOLOv10x-K with additional RFACO innovation), and (d) the proposed YOLO-KRM model.

The results in Figure 6 and Figure 7 demonstrate the YOLO-KRM algorithm’s capacity to achieve high-precision object detection on both datasets. Notably, the progressive improvements from (a) to (d) validate the effectiveness of each introduced innovation (KAN, RFACO, and the final integrated KRM). These results amply validate the algorithm’s viability and dependability in the field of remote-sensing image object detection. The clear detection effect diagrams strongly support the research and application of remote-sensing image object detection and further validate the benefits of the proposed YOLO-KRM model, particularly in handling small objects and complex backgrounds.

5. Conclusions and Outlook

In remote-sensing picture object recognition tasks, the YOLO-KRM model presented in this study shows a notable improvement in performance. Based on the YOLOv10x model, by using KAN and RFACO to construct a new backbone network as well as using MLCA to improve the neck network, the model not only enhances the approximation ability to complex nonlinear functions, but also solves the problem of insufficient feature response due to the sharing of convolutional kernel parameters, and effectively combines local and global features as well as channel and spatial information. The experimental results show that compared with the original model, YOLOv10x, the YOLO-KRM model improves the Precision and Recall on two remote-sensing datasets, RSOD and NWPU VHR-10, by 3.11% and 2.37%, and 3.07% and 4.81%, respectively, and also the mAP@50 and mAP@50:95, respectively, by 1.77% and 3.82% and 2.75% and 5.23%, respectively. These improvements demonstrate how well the YOLO-KRM model handles small objects and complicated backdrops, particularly in high-resolution remote-sensing photos. Although the YOLO-KRM model has achieved remarkable results in remote-sensing image object detection tasks, there is still room for further optimization. In the future, the detection performance and application scope of the YOLO-KRM model can be further improved through in-depth research on model lightweighting, multi-scale feature fusion, contextual information utilization, and cross-domain generalization capability.

Author Contributions

Conceptualization, X.B. and C.B.; methodology, H.S. and C.B.; software, X.B.; validation, H.S., X.B. and C.B.; formal analysis, H.S.; investigation, H.S., X.B. and C.B.; resources, H.S.; data curation, X.B.; writing—original draft preparation, C.B.; writing—review and editing, X.B.; visualization, C.B.; supervision, H.S. and C.B.; project administration, X.B. and H.S.; funding acquisition, H.S. and C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Key R & D and achievement transformation in Inner Mongolia Autonomous Region (project No. 2023YFDZ0060).

Data Availability Statement

The data are available upon request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Wang, L.; Zhang, M.; Gao, X.; Shi, W. Advances and challenges in deep learning-based change detection for remote sensing images: A review through various learning paradigms. Remote Sens. 2024, 16, 804. [Google Scholar] [CrossRef]
Wei, W.; Cheng, Y.; He, J.; Zhu, X. A review of small object detection based on deep learning. Neural Comput. Appl. 2024, 36, 6283–6303. [Google Scholar] [CrossRef]
Wen, L.; Cheng, Y.; Fang, Y.; Li, X. A comprehensive survey of oriented object detection in remote sensing images. Expert Syst. Appl. 2023, 224, 1199–1231. [Google Scholar] [CrossRef]
Hua, W.; Chen, Q. A survey of small object detection based on deep learning in aerial images. Artif. Intell. Rev. 2025, 58, 1–67. [Google Scholar] [CrossRef]
Wang, X.; Wang, A.; Yi, J.; Song, Y.; Chehri, A. Small object detection based on deep learning for remote sensing: A comprehensive review. Remote Sens. 2023, 15, 3265. [Google Scholar] [CrossRef]
Wang, Y.; Bashir, S.M.A.; Khan, M.; Ullah, Q.; Wang, R.; Song, Y.; Guo, Z.; Niu, Y. Remote sensing image super-resolution and object detection: Benchmark and state of the art. Expert Syst. Appl. 2022, 197, 116793–116814. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 723–751. [Google Scholar] [CrossRef]
Sun, H.; Yao, G.; Zhu, S.; Zhang, L.; Xu, H.; Kong, J. SOD-YOLOv10: Small Object Detection in Remote Sensing Images Based on YOLOv10. IEEE Geosci. Remote Sens. Lett. 2025, 18, 235–257. [Google Scholar] [CrossRef]
Zhang, W.F.; Zhang, H.W.; Mei, Y.; Xiao, N. A DINO remote sensing target detection algorithm combining efficient hybrid encoder and structural reparameterization. J. Beijing Univ. Aeronaut. Astronaut. 2025, 3, 1–13. [Google Scholar]
Yao, T.T.; Zhao, H.X.; Feng, Z.H.; Hu, Q. A Context-Aware Multiple Receptive Field Fusion Network for Oriented Object Detection in Remote Sensing Images. J. Electron. Inf. Technol. 2025, 47, 233–243. [Google Scholar]
Liu, X.; Gong, W.; Shang, L.; Li, X.; Gong, Z. Remote Sensing Image Target Detection and Recognition Based on YOLOv5. Remote Sens. 2023, 15, 4459. [Google Scholar] [CrossRef]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A Review on yolov8 and Its Advancements. In International Conference on Data Intelligence and Cognitive Informatics; Springer: Singapore, 2024; pp. 529–545. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Nguyen, P.T.; Nguyen, G.L.; Bui, D.D. LW-UAV–YOLOv10: A lightweight model for small UAV detection on infrared data based on YOLOv10. Geomatica 2025, 77, 100049. [Google Scholar] [CrossRef]
Fan, K.; Li, Q.; Li, Q.; Zhong, G.; Chu, Y.; Le, Z.; Xu, Y.; Li, J. YOLO-remote: An object detection algorithm for remote sensing targets. IEEE Access 2024, 12, 155654–155665. [Google Scholar] [CrossRef]
Xu, D.Q.; Wu, Y.Q. Progress of research on deep learning algorithms for object detection in optical remote sensing images. Natl. Remote Sens. Bull. 2024, 28, 3045–3073. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Zhu, P.; Jia, X.; Tang, X.; Jiao, L. Generalized few-shot object detection in remote sensing images. ISPRS J. Photogramm. Remote Sens. 2023, 195, 353–364. [Google Scholar] [CrossRef]
Jiang, H.; Peng, M.; Zhong, Y.; Xie, H.; Hap, Z.; Lin, J.; Ma, X. A survey on deep learning-based change detection from high-resolution remote sensing images. Remote Sens. 2022, 14, 1552. [Google Scholar] [CrossRef]
Qu, J.; Tang, Z.; Zhang, L.; Zhang, Y.; Zhang, Z. Remote sensing small object detection network based on attention mechanism and multi-scale feature fusion. Remote Sens. 2023, 15, 2728. [Google Scholar] [CrossRef]
Zheng, X.; Wang, H.; Shang, Y.; Gang, C.; Suhua, Z.; Quanbo, Y. Starting from the structure: A review of small object detection based on deep learning. Image Vis. Comput. 2024, 10, 105054–105073. [Google Scholar]
Zhou, H.; Liu, W.; Sun, K.; Wu, J.; Wi, T. MSCANet: A multi-scale context-aware network for remote sensing object detection. Earth Sci. Inform. 2024, 17, 5521–5538. [Google Scholar] [CrossRef]
Jiang, H.; Luo, T.; Peng, H.; Zhang, G. MFCANet: Multiscale Feature Context Aggregation Network for Oriented Object Detection in Remote-Sensing Images. IEEE Access 2024, 32, 10324–10357. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Zhu, X.; Wang, G.; Han, X.; Tang, X.; Jiao, L. Multistage enhancement network for tiny object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
Xie, Q.; Zhou, D.; Tang, R.; Feng, H. A deep cnn-based detection method for multi-scale fine-grained objects in remote sensing images. IEEE Access 2024, 12, 15622–15630. [Google Scholar] [CrossRef]
Zhang, Y.; Guo, W.; Wu, C.; Li, W.; Tao, R. FANet: An arbitrary direction remote sensing object detection network based on feature fusion and angle classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]
Tang, W.; He, F.; Bashir, A.K.; Shao, X.; Cheng, Y.; Yu, K. A remote sensing image rotation object detection approach for real-time environmental monitoring. Sustain. Energy Technol. Assess. 2023, 57, 103270. [Google Scholar] [CrossRef]
Wang, W.; Chen, Y.; Lin, M. MFLD: Lightweight object detection with multi-receptive field and long-range dependency in remote sensing images. Int. J. Intell. Comput. Cybern. 2024, 17, 805–823. [Google Scholar] [CrossRef]
Pan, Z.; Xu, L.; Liang, C.; Pan, K.; Zhao, M.; Lu, M. Lightweight spatial sliced-concatenate-multireceptive-field enhance and joint channel attention mechanism for infrared object detection. IEEE Access 2022, 10, 55508–55521. [Google Scholar] [CrossRef]
Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the synergistic effects between spatial and channel attention. Neurocomputing 2025, 634, 129866–129884. [Google Scholar] [CrossRef]
Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 2023, 123, 106442. [Google Scholar] [CrossRef]
Bao, W.; Huang, C.; Hu, G.; Su, B.; Yang, X. Detection of Fusarium head blight in wheat using UAV remote sensing based on parallel channel space attention. Comput. Electron. Agric. 2024, 217, 108630–1086662. [Google Scholar] [CrossRef]
Wu, D.; Li, H.; Hou, X.; Xu, C.; Cheng, G.; Guo, L.; Liu, H. Spatial-Channel Attention Transformer with Pseudo Regions for Remote Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 261, 1371–1395. [Google Scholar] [CrossRef]
Zhao, R.; Zhang, C.; Xue, D. A multi-scale multi-channel CNN introducing a channel-spatial attention mechanism hyperspectral remote sensing image classification method. Eur. J. Remote Sens. 2024, 57, 2353290–2353319. [Google Scholar] [CrossRef]
Nan, G.; Zhao, Y.; Fu, L.; Ye, Q. Object detection by channel and spatial exchange for multimodal remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 72, 630–652. [Google Scholar] [CrossRef]
Somvanshi, S.; Javed, S.A.; Islam, M.M.; Pandit, D.; Das, S. A survey on Kolmogorov-Arnold network. ACM Comput. Surv. 2024, 68, 331–352. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Soong, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]

Figure 1. YOLO-KRM model.

Figure 2. The detailed structure of RFACO.

Figure 3. MLCA principle.

Figure 4. RSOD dataset.

Figure 5. NWPU VHR-10 dataset.

Figure 6. RSOD dataset detection effect comparison chart.

Figure 7. NWPU dataset detection effect comparison chart.

Table 1. Experimental hardware and software configuration.

Configuration	Setting
Input size	640 × 640
Learning rate	0.01
Optimizer	AdamW
Batch size	4
Epochs	100
Data augmentation	HSV (H:0.015, S:0.7, V:0.4), Translation (0.1), Scaling (0.5), Horizontal Flip (0.5), Mosaic (1.0)
Plotting and Analysis Library	Matplotlib 3.7.4, Seaborn 0.13.2, TensorBoard 2.13.0
Tools	Torch 1.12.0, TorchVision 0.13, CUDA 11.3
Operating system	Windows 11 Home Edition
Disk	512 GB SSD
GPU	Server V100 and Windows RTX 3070Ti
Memory	DDR5 32 GB
CPU	2 × 718532C
Video Memory	16 G

Table 2. Comparative experimental model results for RSOD remote-sensing images.

Models\Metrics	P (/%)↑	R (/%)↑	mAP@50 (/%)↑	mAP@50:95 (/%)↑	FPS↑
YOLOv6	87.61	91.19	91.38	64.29	117.2
YOLOv5x	89.04	88.63	91.51	63.29	119.1
YOLOv8	91.20	87.75	90.64	64.29	121.5
YOLOv10x	92.37	92.13	95.91	75.95	27.4
YOLOv10C	88.03	91.23	93.06	74.03	23.8
YOLOv10D	93.90	93.26	96.35	75.47	32.9
YOLOv10G	92.88	91.23	95.98	75.49	31.8
YOLOv10GD	93.70	92.05	95.82	76.73	32.1
YOLOv10GM	93.04	92.88	96.71	76.32	30.1
YOLOv10GS	92.52	92.05	95.81	74.76	31.4
YOLOv10M	92.27	92.66	96.21	75.04	31.3
YOLO-KRM	95.48	94.50	97.68	79.77	21.8

“↑” means the higher the model, the better.

Table 3. Comparative experimental model results for NWPU VHR-10 remote-sensing images.

Models\Metrics	P (/%)↑	R (/%)↑	mAP@50 (/%)↑	mAP@50:95 (/%)↑	FPS↑
YOLOv6	88.42	84.44	89.12	58.62	40.5
YOLOv5x	86.28	82.53	87.23	55.55	42.4
YOLOv5s	91.51	88.54	91.45	59.71	-
YOLOv5n	92.01	84.63	90.86	59.37	-
YOLOv7-tiny	90.93	86.7	91.22	58.73	-
YOLOv8	90.64	86.03	91.99	61.10	123.8
YOLOv10x	89.92	89.53	94.84	71.77	33.3
YOLOv10n	87.58	85.14	91.90	59.55	-
YOLOv10C	92.38	91.66	96.79	73.56	23.7
YOLOv10D	90.08	89.53	94.84	71.82	33.3
YOLOv10G	91.13	90.94	96.22	71.77	24.9
YOLOv10GD	91.67	90.86	95.95	71.95	32.8
YOLOv10GM	90.66	90.34	95.69	71.65	31.0
YOLOv10GS	91.06	90.16	94.68	70.36	31.5
YOLOv10M	88.19	87.54	93.70	69.21	31.2
YOLOv11n	90.10	86.80	92.45	60.76	-
YOLO-KRM	92.99	94.34	97.59	77.00	22.1

“↑” means the higher the model, the better.

Table 4. Results of ablation experiments.

KAN	RFACO	MLCA	RSOD Dataset (/%)				NWPU VHR-10 Dataset (/%)
KAN	RFACO	MLCA	P↑	R↑	mAP@50↑	mAP@50:95↑	P↑	R↑	mAP@50↑	mAP@50:95↑
×	×	×	92.37	92.13	95.91	75.95	89.92	89.53	94.84	71.77
√	×	×	94.44	93.39	96.99	77.54	90.63	89.63	95.76	72.16
√	√	×	95.19	94.42	97.35	79.77	91.52	91.37	97.19	75.91
√	√	√	95.48	94.5	97.68	80.07	92.99	94.34	97.59	77.00

“↑” means the higher the model, the better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, H.; Bai, X.; Bai, C. Optimization of Object Detection Network Architecture for High-Resolution Remote Sensing. Algorithms 2025, 18, 537. https://doi.org/10.3390/a18090537

AMA Style

Shi H, Bai X, Bai C. Optimization of Object Detection Network Architecture for High-Resolution Remote Sensing. Algorithms. 2025; 18(9):537. https://doi.org/10.3390/a18090537

Chicago/Turabian Style

Shi, Hongyan, Xiaofeng Bai, and Chenshuai Bai. 2025. "Optimization of Object Detection Network Architecture for High-Resolution Remote Sensing" Algorithms 18, no. 9: 537. https://doi.org/10.3390/a18090537

APA Style

Shi, H., Bai, X., & Bai, C. (2025). Optimization of Object Detection Network Architecture for High-Resolution Remote Sensing. Algorithms, 18(9), 537. https://doi.org/10.3390/a18090537

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimization of Object Detection Network Architecture for High-Resolution Remote Sensing

Abstract

1. Introduction

2. Related Work

3. YOLO-KRM Model

3.1. Backbone Improvement

3.1.1. Kolmogorov–Arnold Network

3.1.2. Receptive Field Attention Convolutional Operations

3.2. FPN Improvement

4. Experiment

4.1. Experimental Configuration

4.2. Dataset

4.3. Evaluating Indicator

4.4. Contrast Experiments

4.5. Ablation Experiments

4.6. Experimental Results

5. Conclusions and Outlook

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI