Land Target Detection Algorithm in Remote Sensing Images Based on Deep Learning

Hu, Wenyi; Jiang, Xiaomeng; Tian, Jiawei; Ye, Shitong; Liu, Shan

doi:10.3390/land14051047

Open AccessArticle

Land Target Detection Algorithm in Remote Sensing Images Based on Deep Learning

by

Wenyi Hu

¹

,

Xiaomeng Jiang

¹

,

Jiawei Tian

²

,

Shitong Ye

³ and

Shan Liu

^4,*

¹

College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu 610059, China

²

Department of Computer Science and Engineering, Hanyang University, Ansan 15577, Republic of Korea

³

School of Artificial Intelligence, Guangzhou Huashang College, Guangzhou 511300, China

⁴

Department of Modelling, Simulation, and Visualization Engineering, Old Dominion University, Norfolk, VA 23529, USA

^*

Author to whom correspondence should be addressed.

Land 2025, 14(5), 1047; https://doi.org/10.3390/land14051047

Submission received: 1 April 2025 / Revised: 25 April 2025 / Accepted: 8 May 2025 / Published: 11 May 2025

(This article belongs to the Special Issue GeoAI for Land Use Observations, Analysis and Forecasting)

Download

Browse Figures

Versions Notes

Abstract

Remote sensing technology plays a crucial role across various sectors, such as meteorological monitoring, city planning, and natural resource exploration. A critical aspect of remote sensing image analysis is land target detection, which involves identifying and classifying land-based objects within satellite or aerial imagery. However, despite advancements in both traditional detection methods and deep-learning-based approaches, detecting land targets remains challenging, especially when dealing with small and rotated objects that are difficult to distinguish. To address these challenges, this study introduces an enhanced model, YOLOv5s-CACSD, which builds upon the YOLOv5s framework. Our model integrates the channel attention (CA) mechanism, CARAFE, and Shape-IoU to improve detection accuracy while employing depthwise separable convolution to reduce model complexity. The proposed architecture was evaluated systematically on the DOTAv1.0 dataset, and our results show that YOLOv5s-CACSD achieved a 91.0% mAP@0.5, marking a 2% improvement over the original YOLOv5s. Additionally, it reduced model parameters and computational complexity by 0.9 M and 2.9 GFLOPs, respectively. These results demonstrate the enhanced detection performance and efficiency of the YOLOv5s-CACSD model, making it suitable for practical applications in land target detection for remote sensing imagery.

Keywords:

remote sensing; deep learning; land target detecting; YOLOv5

1. Introduction

Remote sensing images offer extensive coverage and provide rich, high-resolution land information, which supports various fields, including earth science. Owing to the progressive development of geospatial technologies, research in remote sensing has consistently achieved new breakthroughs and has gradually developed into a comprehensive technical system. Currently, remote sensing technology is widely applied in areas such as meteorological observation [1,2], agricultural monitoring [3,4], urban planning [5,6], natural resource assessment [7,8], and military applications [9]. It has increasingly become a vital tool for addressing global and regional ecological challenges, promoting regional economic development, and enhancing public services [10].

The fundamental purpose underlying object detection in geospatial imagery analysis serves to delineate and precisely categorize specific entities within satellite-acquired visual data through advanced algorithmic processing. The computational framework demonstrates critical functionality across multiple application domains, particularly in self-navigating vehicle systems [11], post-disaster rescue [12] and monitoring security, traffic monitoring, and other scenarios [13], through real-time environmental perception capabilities. The progressive refinement of satellite-based Earth observation capabilities has generated exponentially increasing geospatial data volumes, and it is particularly noteworthy that domestically developed systems now deliver sub-decimeter imaging granularity through enhanced multispectral sensor configurations [14]. Automatic and rapid detection and recognition of ground targets using the detailed texture information from remote sensing images presents a significant challenge. Conventional computational approaches for visual target recognition are broadly stratified into two principal methodological frameworks distinguished by their feature extraction mechanisms: algorithms based on template matching and algorithms based on artificial feature modeling. Although the theory is well developed, it is difficult to fully represent the features of various complex task scenarios and targets. Moreover, operations such as sliding windows limit the efficiency of the algorithm, resulting in low recognition accuracy, low efficiency, and susceptibility to background interference, making it difficult to meet current needs [15,16].

Advancements in cognitive computing systems have revolutionized visual pattern recognition through hierarchical neural architectures, particularly in automating target identification processes [17]. In contemporary implementations, these computational architectures predominantly manifest as two distinct methodological frameworks: region-proposal-driven approaches utilizing selective search mechanisms and direct-regression-based paradigms employing anchor box generation techniques, commonly categorized as multi-phase versus single-pass detection systems [18]. Multi-phase frameworks employ sequential refinement mechanisms involving initial region of interest identification followed by precise feature extraction, whereas unified architectures execute holistic processing through integrated bounding box regression and classification modules. Empirical evaluations consistently demonstrate that two-stage detection methodologies achieve superior localization precision, while one-stage detection methodologies exhibit significantly reduced computational latency in practical deployment scenarios. Object detection algorithms based on candidate regions divide the problem into two stages: initially conducting coarse region of interest (ROI) identification through selective search mechanisms, followed by discriminative classification and spatial coordinate refinement utilizing region-based convolutional operations within localized feature maps. Within this methodological spectrum, the R-CNN architectural paradigm and its evolutionary derivatives have established themselves as quintessential implementations. Girshick R. [19] proposed the R-CNN algorithm in 2014, which first generates candidate regions through selective search algorithms and then normalizes their scales and sends them to the CNN for feature extraction. Finally, the discriminative analysis phase implements multi-task learning frameworks through kernel-optimized support vector machines (SVMs) coupled with backpropagation-regulated coordinate adjustment modules, both operating on the refined feature embeddings obtained from preceding convolutional layers. During this process, feature extraction is applied to each candidate box obtained through selection search, followed by separate classification and box regression training, resulting in computational redundancy and suboptimal processing efficiency. The year 2015 marked a significant methodological advancement with Girshick’s introduction of the Fast R-CNN architecture [20], which uses a feature extraction network to sample the convolutional features of candidate boxes of different sizes into fixed-size features through ROI pooling. Then, fully connected layers are used to complete classification and regression tasks, thereby enabling integrated feature abstraction and categorical regression within a consolidated architecture to accelerate inference. Advancing on prior methodologies, Ren introduced the Faster R-CNN framework in 2017 [21]. This approach unified region proposal generation and detection processes within a single network, establishing a seamless workflow that enhanced both computational efficiency and recognition precision. Unlike regression-analysis-based object detection algorithms, which do not require separate candidate region generation, it directly regresses and analyzes the bounding boxes and categories of targets from multiple positions in the input image, resulting in faster processing speed than candidate-region-based algorithms. Redmon’s 2015 [22] breakthrough introduced the YOLO architecture, which transformed object detection into a regression problem. After years of development, the YOLO series has now been updated to version v10 [23]. Compared with traditional object detection algorithms, YOLO series models have excellent performance, a wide practical application range, and more convenient operation, gradually becoming the mainstream method of current object detection. To improve the speed of YOLO, Kumar et al. [24] changed the original backbone to a mobile network and removed the last fully connected layer and SoftMax layer, thereby significantly improving the training efficiency of the network. Gong et al. [25] optimized model complexity through separable convolutions and cross-layer shortcuts, effectively minimizing redundant network parameters, ensuring the real-time detection of small target images without compromising the detection accuracy. Li et al. [26] also attempted to introduce the dynamic anchor box mechanism into the original detection network, further improving the accuracy. Zhu et al. [27] enhanced the baseline architecture by incorporating residual blocks, enabling hierarchical feature map generation through modified convolutional layers, which can simultaneously extract the category and position information. Good detection results can be achieved for dense- and mixed-distribution target detection scenes. Zhang et al. [28] developed an efficient detection framework, termed PGYOLOv5, incorporating pyramid-structured attention mechanisms and linear feature mapping. This architecture enhances recognition precision while maintaining deployability on resource-constrained devices.

Nevertheless, owing to the inherent constraints of remote sensing images, characterized by substantial capture distances, the resolution is usually low, and the number of land targets to be detected is small. In addition, most of the targets are arranged in a disorderly and densely distributed manner. In remote sensing image land target detection, the majority of existing object detection algorithms struggle to obtain optimal detection results and even have some degree of false detection and missed detection issues. Kim et al. [29] enhanced aerial imagery analysis by modifying the YOLOv5 architecture with channel attention pyramids mechanisms, improving the recognition of diminutive targets while optimizing computational overhead. Arunnehru et al. [30] enhanced YOLO for UAV image detection via target box clustering, multi-scale training, and candidate box optimization, achieving a 79.5% detection accuracy and >84% positioning accuracy with reduced errors in real-time multi-scale scenarios. Yin et al. [31] redefined the channel partitioning of CSPNet and integrated it into the neck of YOLOV4. This redesign, combined with a bidirectional multi-scale feature weighting mechanism, significantly improved small target detection capabilities. However, there are shortcomings in detecting rotating targets. Zhang et al. [32] designed a pyramid attention network that incorporates edge information, which improved the detection accuracy of remote sensing land targets in multi-scale and complex scenes. However, the model contains a substantial number of parameters and cannot effectively detect land targets with a wide extension range. Zhao et al. [33] proposed an MS-YOLOv7 model that utilized multiple detection heads and CBAM to capture features at various scales. However, there remains potential for enhancing land target detection performance for heavily occluded remote sensing images. Qu et al. [34] improved the feature extractor and feature fusion device and replaced the attention-based dynamic detection head to make their model more suitable for vehicle detection in remote sensing land targets while also having fewer parameters and less computational complexity. However, they did not achieve ideal detection results for dense rotating targets on land. Cao et al. [35] modified the horizontal detection box to a rotation detection box by adding angle parameters and introducing circular smooth labels, achieving good results in detecting land rotation targets in remote sensing images. However, compared with other network structures, there remains potential for improvement in terms of algorithm accuracy. Zheng et al. [36] proposed the Distance-IoU (DIoU) Loss, a more efficient and geometrically informed alternative to traditional IoU-based losses, which considers both overlap area and the distance between box centers to accelerate convergence and improve localization accuracy. By integrating DIoU into detection frameworks, models can better focus on spatial alignment, thereby enhancing the detection precision of small or weak targets in complex remote sensing scenarios. Jingxin et al. [37] proposed SPDC-YOLO, modifying the backbone to preserve small-target features, introducing SPC-FPN with selective boundary aggregation for multi-scale fusion, replacing detection heads with Dyhead-DCNv4 using attention mechanisms, and implementing contextual downsampling and patch-aware attention modules, demonstrating superior performance in complex UAV scenarios.

YOLOv5s is the optimal choice for remote sensing land target detection due to its lightweight efficiency, balancing high accuracy with fast inference—ideal for large-scale real-time processing. Its compact architecture enables seamless edge deployment, while adaptive multi-scale detection handles varying target sizes effectively, ensuring robust and efficient performance compared to other YOLO variants. However, there are three core limitations of the baseline YOLOv5s in remote sensing, including limited small-object sensitivity due to weak shallow feature representation, feature aliasing in dense areas from fixed upsampling kernels, and rotation-insensitive CIoU Loss leading to angular errors. Thus, this study enhances the YOLOv5s model and proposes a new model, YOLOv5s-CACSD, aimed at improving land target detection accuracy in remote sensing images while achieving model lightweighting. This study makes the following four fundamental contributions:

(1): The YOLOv5s framework was enhanced by integrating a coordinate attention mechanism, enabling more effective extraction of key features from remote sensing images characterized by intricate backgrounds.
(2): To refine the quality of features, the original nearest-neighbor interpolation was replaced with CARAFE, a lightweight and versatile upsampling operator, significantly improving reconstruction quality.
(3): The conventional CIoU Loss function for bounding box regression was substituted with Shape-IoU, mitigating the impact of varying bounding box dimensions and geometries on the regression accuracy.
(4): For real-time performance optimization, the model was lightweighted by using depthwise separable convolution.

The structure of this study is outlined below. Section 2.1 introduces the architecture of the YOLOv5 framework. In Section 2.2, a new YOLOv5s-CACSD remote sensing image land target detection model is presented. Section 2.3 introduces the dataset used to train the new model, the specific experimental methods, and the indicators used for evaluating the experimental results. Section 3 provides a quantitative and qualitative analysis of the ablation results and comparative experiments. Section 4 and Section 5 summarizes and discusses the performance of the YOLOv5s-CACSD remote sensing image land target detection model.

2. Materials and Methods

2.1. YOLOv5 Detection Algorithm

The YOLO algorithm is a single-stage deep-learning-based algorithm for object detection. It directly predicts the spatial coordinates and class probabilities of the target from the inputs, thereby achieving faster detection speed. Introduced by Ultralytics in mid-2020, YOLOv5 has undergone rapid iterations, establishing itself as a state-of-the-art solution due to its exceptional inference speed and precision. The framework offers five scalable variants (n, s, m, l, x), where network depth and channel width progressively expand across models [38]. These configurations provide a trade-off between accuracy and processing speed, allowing for deployment across different application requirements. Their structural principles are roughly the same, as presented in Figure 1.

The YOLOv5 algorithm comprises three primary components: The backbone serves as its foundational feature extraction module, which includes the input and various other modules, such as CBS, C3_1, and SPPF. In this section, convolutional networks extract object information from an image to create a feature pyramid for further object detection. The second part is the neck, which is primarily responsible for integrating and combining multi-scale features of feature maps through FPN + PAN and transmitting these features to the head for prediction. The final output layer, the head, performs two critical functions: computing bounding box regression loss for coordinate refinement and implementing NMS to eliminate redundant detections.

2.1.1. Backbone

(1): CBS

The CBS module integrates three sequential operations: a convolutional layer, batch normalization (BN), and a SiLU activation function. Figure 2 presents its detailed architecture.

In the backbone, the stride of the CBS is 2, and the core size is 3. Therefore, CBS halves the spatial dimensions of the feature map along both the width and height during each processing step and extracts the target features while downsampling the feature map. BatchNorm2d (BN) is a batch normalization layer that normalizes the data for each batch. SiLU combines advantageous properties from both Sigmoid and ReLU, which aids in faster convergence of the network during training and surpasses ReLU in deep models, offering enhanced ability.

(2): C3_1

C3_1 consists of three CBS modules and one BottleNeck1, hence its name. The three CBS modules in C3 consist of 1 × 1 convolutions, which are primarily used for dimensionality reduction or enhancement rather than feature extraction. The Bottleneck uses residual connections, and the Bottleneck incorporates two CBS modules with distinct functions: (1) a 1 × 1 convolutional layer that compresses the channel depth by 50%, followed by (2) a 3 × 3 convolutional layer that expands the channel dimension twofold. The complete architecture is depicted in Figure 3.

(3): SPPF

The spatial pyramid pooling module, abbreviated as SPPF, is an essential component of the backbone network and a key technology for achieving excellent detection performance. The SPPF architecture is presented in Figure 4. This module extracts feature representations with multi-scale contextual information through spatial pyramid pooling and feature fusion, which demonstrates superior adaptability for multi-scale target detection, thereby enhancing overall model performance in object recognition tasks.

2.1.2. Neck

In YOLOv5, the neck receives the feature map output from the backbone and further processes them to obtain higher-level feature representations. FPN and PAN are used in the neck module [39], and their structures are shown in Figure 5.

The FPN architecture employs a top-down pathway with lateral connections, where higher-level semantic features are progressively upsampled and merged, primarily utilized to convey semantic features. In contrast, PAN is a bottom-up approach that uses downsampling to fuse high-level details, such as shape and position, with low-level features, primarily employed to convey localization features. The feature pyramid formed by combining the two sampling methods can fuse the parameters between different detection layers, ultimately resulting in a predicted feature map.

2.1.3. Head

The head layer is the detection module, which undertakes the final target localization and classification tasks in YOLOv5.

The YOLOv5 objective function integrates three distinct loss components: classification loss, regression loss for bounding box localization, and confidence loss. The calculation formula is shown in Equation (1).

\begin{matrix} L_{v 5} = \sum_{i}^{N} (λ_{1} L_{c l s} + λ_{2} L_{b o x} + λ_{3} L_{o b j}) \\ = \sum_{i}^{N} (λ_{1} \sum_{j}^{B_{i}} L_{c l s_{j}} + λ_{2} \sum_{j}^{B_{i}} L_{C I o U_{j}} + λ_{3} \sum_{j}^{s_{i} \times s_{i}} L_{o b j_{j}}) \end{matrix}

(1)

N

is the detection layer count,

B

is the prior box-matched target count,

S \times S

is the grid count, where features are segmented at that scale,

L_{b o x}

is the bounding box regression loss,

L_{o b j}

is the confidence loss of the target object,

L_{c l s}

is the classification loss of the target object, and

λ_{1}

,

λ_{2}

, and

λ_{3}

are the weights of the three types of losses. The target object classification loss and confidence loss are calculated by using the BCE loss function. The formula is shown in Equations (2) and (3).

L_{B C E} = - \frac{1}{n} \sum_{i}^{n} [y_{i} \times \log (σ (x_{i})) + (1 - y_{i}) \times \log (1 - σ (x_{i}))]

(2)

σ (a) = \frac{1}{1 + \exp (- a)}

(3)

where

n

is the number of samples,

y_{i}

is the true label of the i-th sample (0 or 1), and

σ (x_{i})

is the predicted probability of the i-th sample (i.e., the probability that the model considers the sample to belong to the positive class).

2.2. YOLOv5s-CACSD Land Target Detection Algorithm

Considering both speed and accuracy, this study selects the YOLOv5s one-stage object detection method to detect land targets in remote sensing images. YOLOv5s utilizes three detection layers of different sizes, which can effectively balance targets of different sizes. This study designs the YOLOv5s-CACSD model by improving the network structure and optimizing network parameters, which can better adapt to complex remote sensing image land target detection scenarios while ensuring its rate.

2.2.1. Introducing Attention Mechanisms

In the training process of neural networks, typically, the model parameter exhibits a positive correlation with both the representational capacity and information storage potential. As the total amount of information increases, the proportion of important information decreases. By introducing attention mechanisms, the model can automatically identify more crucial features when processing intricate input data, thus enhancing its performance and efficiency. Modern deep learning architectures frequently employ three predominant attention mechanisms: (1) Squeeze-and-Excitation (SE) networks [40], (2) Convolutional Block Attention Module (CBAM) [41], and (3) coordinate attention (CA) [42]. Unlike SE and CBAM, CA independently applies global average pooling along the vertical and horizontal axes, thereby maintaining critical positional information. CA outperforms other attention mechanisms by explicitly modeling long-range spatial dependencies through coordinate transformation—critical for sparse, directional targets in remote sensing. Unlike SE (channel-only) or CBAM (local-convolution-limited), CA efficiently captures large-scale spatial correlations with lightweight computation, enabling precise localization ideal for real-time models like YOLOv5s. Figure 6 illustrates this distinctive architectural approach.

The coordinate attention mechanism operates through two sequential processes: coordinate information embedding and attention generation. The embedding phase transforms standard global pooling into separate one-dimensional encoding operations. Each channel is encoded by using pooling kernels along the horizontal and vertical coordinates. Channel

c

in the

h

-height dimension can be represented by Equation (4).

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(4)

where

W

is the weight of the feature map.

Likewise, channel

c

in

w

-weight dimension can be represented by Equation (5).

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w)

(5)

where

H

is the height of the feature map.

These operations aggregate features along two axes, generating complementary feature maps. Subsequently, the coordinate attention generation phase begins by concatenating these directional feature maps. The combined features are then transformed through a convolutional layer followed by nonlinear activation, as shown in Equation (6).

f = δ (F_{1} ([z^{h}, z^{ω}]))

(6)

[z^{h}, z^{ω}]

denotes the channel-wise concatenation of feature maps while preserving their spatial dimensions,

δ

applies a nonlinear transformation to the input features, and

F_{1}

is a convolutional transformation operation. Then,

f

undergoes spatial division, producing two distinct output tensors, through convolution and activation functions, and its output value is finally expanded to obtain the output

y

of CA, as presented in Equation (7).

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(7)

Horizontal (

g_{c}^{h}

) and vertical (

g_{c}^{w}

) attention weights are element-wise-summed with the input features to strengthen the feature representation.

2.2.2. Improving Upsampling Methods

YOLOv5’s fusion module utilizes nearest-neighbor upsampling, which simply copies the nearest pixel value to fill in new pixel positions; however, it cannot provide more detailed information. To address this limitation, we introduce CARAFE [43], a lightweight and adaptive upsampling operator. Compared to conventional interpolation, CARAFE significantly improves feature map quality. The details of CARAFE are presented in Figure 7.

CARAFE primarily comprises an upsampling kernel prediction module and a feature recombination module. The upsampling kernel prediction module can be divided into channel compression, content encoding, and kernel normalization. Firstly, the channel compression part uses a

1 \times 1

convolution to reduce the input feature map’s channel dimension, reducing subsequent computational complexity. Next, the content encoding part uses convolutional layers of size

k_{e n c o d e r} \times k_{e n c o d e r}

to predict the upsampling kernel and then unfolds it. The channels are divided into

σ^{2}

blocks, each with

k_{u p}^{2}

layers. This area is rearranged to obtain an upsampling kernel with a shape of

σ H \times σ W \times k_{u p}^{2}

.

k_{u p}^{2}

represents the size of the upsampling kernel, and

σ

represents the upsampling ratio. The final processing stage employs SoftMax normalization to stabilize the predicted upsampling kernels. Subsequently, the feature reassembly module executes a content-aware mapping operation by projecting each output feature position to its corresponding receptive field in the inputs, extracting the local feature neighborhood centered at

k_{u p} \times k_{u p}

area, and computing a weighted summation through a dot product between the extracted features and the normalized prediction kernel.

CARAFE outperforms nearest-neighbor interpolation by dynamically predicting adaptive convolution kernels based on local semantic context, enabling the precise reconstruction of details and edges—unlike mechanical pixel replication that causes aliasing and blurring in traditional methods. This content-aware approach significantly enhances small target and edge clarity in object detection while maintaining computational efficiency, with only a marginal overhead increase. The combination of improved upsampling quality and preserved real-time performance makes CARAFE particularly effective for geometry-critical applications like remote sensing target detection.

2.2.3. Improving the Bounding Box Regression Loss Function

YOLOv5 uses CIoU Loss as the bounding box regression loss function. CIoU Loss considers penalties for distance, area, and incomplete overlap of the target box center point, but it fails to consider the geometric properties of bounding boxes, potentially leading to suboptimal localization accuracy. Considering this issue, this article proposes using Shape-IoU as the bounding box regression loss. The formulas are presented in Equations (8)–(13).

L_{Shape - IoU} = 1 - I_{oU} + {distance}^{s h a p e} + 0.5 \times Ω^{s h a p e}

(8)

I_{oU} = \frac{B \cap B^{g t}}{B \cup B^{g t}}

(9)

d i s t a n c e^{s h a p e} = \frac{h h \times (x_{c} - x_{c}^{g t})^{2}}{c^{2}} + \frac{w w \times (y_{c} - y_{c}^{g t})^{2}}{c^{2}}

(10)

\{\begin{matrix} w w = \frac{2 \times (w^{g t})^{s c a l e}}{(w^{g t})^{s c a l e} + (h^{g t})^{s c a l e}} \\ h h = \frac{2 \times (h^{g t})^{s c a l e}}{(w^{g t})^{s c a l e} + (h^{g t})^{s c a l e}} \end{matrix}

(11)

Ω^{s h a p e} = \sum_{t = w, h} (1 - e^{- ω_{t}})^{θ}, θ = 4

(12)

\{\begin{matrix} ω_{w} = h h \times \frac{|w - w^{g t}|}{\max (w, w^{g t})} \\ ω_{h} = w w \times \frac{|h - h^{g t}|}{\max (h, h^{g t})} \end{matrix}

(13)

I_{oU}

represents the traditional intersection to union ratio.

B

represents the prediction box.

B^{g t}

represents the ground truth box.

d i s t a n c e^{s h a p e}

is a distance term that considers the shape factor of the box.

x_{c}

and

y_{c}

are the coordinates of the predicted box.

x_{c}^{g t}

and

y_{c}^{g t}

are the coordinates of the label box.

c

is related to the size of the label box and the predicted box and the coordinates of the center point.

w w

and

h h

are the weight coefficients in the horizontal and vertical directions, whose values are related to the shape of the label box.

Ω^{s h a p e}

represents the shape loss term.

ω_{w}

and

ω_{h}

are are the shape loss factors for width and height.

w^{g t}

and

h^{g t}

are the size of the real box.

θ

is a fixed proportionality coefficient (usually taken as 4).

Shape-IoU advances upon CIoU by dynamically adapting loss weights to target geometry (e.g., aspect ratio, rotation) and decoupling geometric feature optimization, eliminating CIoU’s rigid constraints that cause regression conflicts. This shape-aware approach achieves superior bounding box fitting, particularly for complex targets in remote sensing, while maintaining computational efficiency, perfectly aligning with modern detection frameworks prioritizing adaptive precision.

2.2.4. Depthwise Separable Convolution

To optimize computational efficiency, we replace the standard convolution operations in the YOLOv5s backbone’s C3 with depthwise separable convolution [44]. This technique decomposes conventional convolution into two sequential operations: a depthwise convolution, followed by a pointwise convolution. This approach exhibits robust representational capabilities while effectively preserving model performance. The detailed implementation is depicted in Figure 8.

While pointwise convolution uses a 1 × 1 convolution kernel to linearly combine the depth convolution results across channels, depthwise separable convolution uses a smaller kernel for each input channel for independent convolution operations, in contrast to traditional convolutions. By drastically lowering the model’s parameter load, this convolutional approach can make the model lighter by lowering its computational complexity and storage requirements.

2.2.5. Improved Model Structure

As illustrated in Figure 9, the proposed YOLOv5s-CACSD architecture introduces three key modifications to the baseline model: (1) Backbone: replacement of standard C3 modules with C3-DSConv and the incorporation of CA attention mechanisms after the third C3-DSConv. (2) Neck: substitution of the original method with the CARAFE, where k_enc = 3 and k_up = 5, as well as an additional CA attention layer after the first C3 module in PAN. (3) Head: the adoption of Shape-IoU as the bounding box regression loss function, where the scale is set to 1. All of the CA attention mechanisms in these cases use Sigmoid as their activation function.

2.3. Experimental Design

2.3.1. Description of the Dataset

The study utilized the DOTAv1.0 dataset, a big collection specifically designed for object detection tasks. This dataset comprises 2806 images acquired from multiple sensor platforms, with image dimensions varying from 800 × 800 to 4000 × 4000 pixels. The annotated corpus contains 188,282 instances across 15 object categories: plane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer field, and swimming pool. Partial images are shown in Figure 10:

The original images in the DOTAv1.0 dataset are too large, which is not conducive to model training. Therefore, in the image preprocessing stage, this article used a sliding window to uniformly crop the images to a size of 1024 × 1024 pixels. After cutting, a total of 45,929 images were selected. Among them, 7000 were allocated for training, 2000 for validation, and 1000 for testing at a 7:2:1 ratio. This 7:2:1 split ensured effective model development: the 70% training data enabled robust feature learning, the 20% validation set optimized the hyperparameters and prevented overfitting, while the 10% test set provided unbiased performance evaluation. This industry-standard ratio balances training efficiency and assessment reliability, particularly suited for data-intensive tasks, like remote sensing object detection, that require rigorous validation.

2.3.2. Evaluation Indicators

The evaluation employd mAP as the primary metric for assessing the detection accuracy. Defined as the mean of the per-category AP scores at specified IoU thresholds, mAP provides a comprehensive measure of model performance. Among them, mAP50 can simultaneously evaluate the model’s ability in terms of target localization and classification, which offers a more holistic view of detection capability. The mathematical formulations for AP and mAP are provided in Equations (14) and (15), respectively.

mAP = \frac{1}{n} \sum_{j = 1}^{n} AP (j)

(14)

AP 50 = \frac{1}{n} \sum_{i = 1}^{n} P_{i}^{I o U = 0.5} (R_{i}^{I o U = 0.5})

(15)

n

: category number,

R

: recall rate,

P

: accuracy rate.

To evaluate the model’s efficiency, we employed two key metrics: FLOPs and parameters. FLOPs serve as a quantitative measure of computational complexity, directly impacting hardware resource requirements and processing latency. Parameters reflect the model’s capacity, with higher counts typically indicating greater representational power but also demanding more extensive training data, a larger memory footprint, and enhanced computational resources.

2.3.3. Experimental Methods

The configuration of the parameters and training specifications for our experimental setup is detailed in Table 1.

3. Results

3.1. Results of Ablation Experiments

To systematically measure the efficacy of the YOLOv5s-CACSD framework and investigate the individual contributions of each architectural modification to the model’s performance, this section mainly conducted ablation experiments. The specific experimental methods are shown in Table 2. “√” represents the introduction of this improvement method. “×” represents not introducing this improvement method.

The quantitative results are shown in Table 3.

A comparison between the different models is presented in Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15.

Through a comprehensive evaluation of multiple network models, as documented in Table 3 and Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15, our proposed YOLOv5s-CA model demonstrated consistent performance improvements over the baseline YOLOv5s, achieving a 2.3% higher precision, 0.4% greater recall, and 1.4% increased mAP. This suggests that adding an attention mechanism can successfully enhance the model’s capacity to recognize land target images, but it also results in a rise in parameters and FLOPs. With gains of 2.7% and 1%, respectively, YOLOv5s-C demonstrated superior P and R values over YOLOv5s. In addition to having a bigger influence on parameters and FLOPs, the effect was more significant than YOLOv5s-CA. The YOLOv5s-S model maintained constant computational and parameter needs while achieving a modest 0.4% gain in mAP. YOLOv5s-CAC, YOLOv5s-CAS, and YOLOv5s-CS merged the three approaches in a pairwise fashion. It is evident that enhancing the upsampling technique and adding a CA mechanism improved the performance more than enhancing the bounding box loss function. The enhanced YOLOv5s-CACS model in this paper obtained P, R, and mAP values of 92.7%, 87.5%, and 91.8%, respectively, as can be seen from the different indicators in the table. When combined, these modules facilitate better information fusion. The combined effectiveness of CA, CARAFE, and Shape-IoU comes from their complementary roles in feature processing: CA provides spatial guidance for CARAFE, CARAFE enhances boundary accuracy for Shape-IoU, and Shape-IoU refines CA’s attention by penalizing misalignments. Together, they result in a more robust model, which translates to significant improvements in some performance metrics. Although the model’s detection capability was significantly enhanced, its computational costs and parameter count also increased by 0.3 M and 0.4 GFLOPs, respectively, rendering it insufficiently lightweight.

To address the critical need for land target detection models in practical scenarios, this paper attempted to use depthwise separable convolutions to replace the convolutions in the C3 modules at different positions in the YOLOv5s-CACS model, and it explored its effect on model lightweighting, which can be seen in Table 4.

Comparison charts of the different models are presented in Figure 16, Figure 17, Figure 18, Figure 19 and Figure 20.

The aforementioned table and histograms demonstrate that the parameters and FLOPs dropped by 1.2 M and 3.5 GFLOPs, respectively, when depthwise separable convolution was used to substitute the C3’s convolution in the backbone; similarly, the parameters and FLOPs dropped by 0.7 M and 2.1 GFLOPs, respectively, when depthwise separable convolution was used to display the C3’s convolution in the neck. While guaranteeing that the decline in detection average accuracy, precision, and recall falls within a reasonable error range, both of these approaches accomplish model lightweighting. The lightweighting effect was more noticeable than the accuracy difference between replacing the neck and the C3 module in the backbone. However, when replacing the convolution operations in all C3 modules, the lightweighting effect was optimal; however, the detection accuracy decreased by 4.5%, which is a significant change. In summary, this article opted to replace the convolution operation of C3 in the backbone with depthwise separable convolution and proposes the final improved model, YOLOv5s-CACSD.

3.2. Results of Comparative Experiments

To comprehensively evaluate the land target detection results of YOLOv5s-CACSD, we conducted rigorous comparative experiments against five representative detection architectures: Faster-RCNN, YOLOv4, YOLOv5s, YOLOv8, and YOLOv10, providing a comprehensive comparison of our model against a range of established and state-of-the-art object detection architectures. Faster-RCNN represents a classical approach to object detection with region proposals, serving as a strong baseline for accuracy-focused models. YOLOv4 and YOLOv5s are widely recognized for their real-time performance and efficiency, with YOLOv5s being a lightweight variant ideal for speed. YOLOv8, despite performing worse than YOLOv5s, was included to assess its innovative features, particularly its anchor-free design. However, the design may not perform as well as the YOLOv5s anchor box method in certain situations, especially in remote sensing image land target detection scenes with significant changes in target size and scale. YOLOv10, though introducing progressive hierarchical distillation for improved multi-scale detection, exhibited limited adaptability to extreme scale variations typical in remote sensing imagery. The quantitative results in Table 5 and Table 6 demonstrate our model’s advantages.

Figure 21, Figure 22 and Figure 23 present a comprehensive comparative analysis of the model performance across three key evaluation dimensions.

From the table and line charts, it is evident that Faster R-CNN achieved high detection accuracy; however, it also exhibited significant drawbacks, including considerable architectural parameters and prohibitive computational requirements. In the experiment, the mean mAP of YOLOv4 was 86.4%, but it suffered from a low detection accuracy, a large model size, and high computational demands. The mAP values for YOLOv5s and YOLOv8 were not significantly different; however, YOLOv5s had 4.1 M fewer parameters and 12.5 GFLOPs lower computational complexity than YOLOv8. This indicates that YOLOv5s has considerable advantages in remote sensing image land object detection tasks. The proposed YOLOv5s-CACSD architecture achieved a 2.0% improvement in mAP@0.5:0.95 over the baseline YOLOv5s while simultaneously reducing parameters by 0.9 M and 2.9 GFLOPs. In contrast with YOLOv8, YOLOv5s-CACSD achieved an mAP that was 2.5% higher, with a reduction of 5 M and 15.6 GFLOPs in computational complexity. In contrast with YOLOv10, YOLOv5s-CACSD achieved an mAP that was 0.7% higher, with a reduction of 1.1 M and 8.8 GFLOPs in computational complexity. In summary, the YOLOv5s-CACSD algorithm presented in this study demonstrated higher accuracy than the other algorithms, along with lower parameters and computational complexity, providing distinct advantages in remote sensing image land object detection tasks.

The different performance values in Table 6 demonstrate that the YOLOv5s-CACSD architecture outperformed the other four models across six critical land target categories, including ship, storage tank, tennis court, basketball court, large vehicle, and small vehicle, achieving superior average precision scores in all cases. Especially in the detection of small and rotating targets, the advantages were significant, such as the helicopter results being improved by 7.3 compared to YOLOv5s, the bridge results being improved by 5.6, and the small vehicle results being improved by 4.6. Its improved strategy effectively enhances the capture ability of complex targets. The detection accuracy for the remaining nine types was also at an elevated level, validating the efficacy of our methodological enhancements.

The difference in the detection accuracy between the different target categories was due to multiple factors, including the size, shape, texture complexity, and background interference in the image. For example, regular, large-scale targets such as tennis courts typically have a higher detection accuracy because their features are more prominent and easier to distinguish. Targets such as small vehicles and helicopters are relatively small and irregular in shape, making their detection difficult and resulting in a lower accuracy. In addition, the complexity of the background and the similarity between targets can also affect the accuracy. For example, ships and planes may have a lower detection accuracy due to their complex structure and background environment.

Comparing the loss curves of the YOLOv5s-CACSD with those of YOLOv5s, the improved model demonstrated enhancements in three key areas: bounding box regression loss, classification loss, and confidence loss.

As illustrated in Figure 24, the bounding box regression loss and classification loss of the two models in the first 10 epochs were close, and the loss curves were approximately overlapping. After that, the differences gradually became apparent, especially in the bounding box regression loss. The loss values steadily decreased during subsequent training until they stabilized after 175 epochs. Notably, the bounding box regression loss and confidence loss were obviously lower. After 200 epochs, the bounding box regression losses of YOLOv5s and YOLOv5s-CACSD reached 0.03 and 0.025, and the confidence losses reached 0.022 and 0.019, indicating that our model has the ability to accurately locate objects and has higher confidence in prediction. Although there was no significant difference in the classification loss curves, the loss curve of our model was generally smooth and exhibited high stability, proving that the improved scheme has an optimization effect.

Figure 25 presents a comprehensive visual comparison between the detection outputs of our YOLOv5s-CACSD framework and the four baseline architectures.

From the actual detection results of the five models listed above, YOLOv4 exhibited the poorest detection performance, particularly concerning small vehicles, where missed detections were notably severe. There was a discernible discrepancy between the target positions detected by YOLOv8 and those in the original images. The actual results of Faster R-CNN and YOLOv5s were not significantly different; both models performed better overall than YOLOv4 and YOLOv8, although they still experienced a certain degree of missed detections. The detection results of YOLOv10 were the closest to our model, but there was still a problem of missed detection of small vehicle targets.

The YOLOv5s-CACSD model proposed in this article significantly reduces the occurrences of missed detections and misalignments seen in other mainstream models, achieving the highest detection accuracy and effectively ensuring precise detection of land targets in remote sensing images.

4. Discussion

This article investigated a YOLOv5s-CACSD architecture for land target detection in remote sensing images. Through systematic experimentation and evaluation, we demonstrated the model’s viability for real applications. Our modifications to the baseline YOLOv5s framework include three key enhancements: First, we integrated a coordinate attention module to strengthen the feature extraction capabilities from complex remote sensing data. Second, we replaced conventional upsampling with an advanced feature reassembly technique that better preserves spatial details during resolution enhancement. Third, we implemented Shape-IoU to overcome geometric limitations in traditional bounding box regression. To optimize computational efficiency, we incorporated depthwise separable convolutions throughout the network. A comprehensive evaluation on the DOTA dataset showed that our model achieved a 91.0% detection accuracy while maintaining real-time processing speeds, significantly outperforming existing approaches in both precision and efficiency.

While the proposed YOLOv5s-CACSD framework demonstrates significant improvements in land target detection, there remains significant potential for enhancement. Firstly, the DOTAv1.0 dataset presents three key limitations for land target detection: extreme scale variation between dense small targets and large objects challenges single-model accuracy; horizontal bounding boxes fail to precisely represent rotation-sensitive targets; the imbalanced category distribution leads to overfitting of high-frequency categories in the model while significantly reducing the detection performance of low-frequency categories, affecting the overall generalization ability. Meanwhile, limited datasets risk overfitting, as models may memorize noise/specifics rather than learning generalizable features, causing high training accuracy but poor test performance. Future work should incorporate more suitable datasets to validate the proposed model’s robustness. Secondly, to maintain the model’s lightweight nature and meet real-time requirements in practical applications, the YOLOv5s-CACSD model proposed in this paper compromises on the detection accuracy to some extent. Subsequent research should prioritize optimizing the model’s computational efficiency while preserving detection accuracy. Thirdly, YOLOv5s-CACSD encounters three key deployment challenges: computational constraints, accuracy–speed trade-offs, and environmental adaptability. These limitations necessitate focused research on optimized compression techniques and adaptive inference methods.

5. Conclusions

Land target detection in remote sensing images is a crucial component of remote sensing image analysis technology. This paper presents an enhanced model, YOLOv5s-CACSD, which is based on YOLOv5s. By incorporating a CA mechanism, a lightweight universal upsampling operator (CARAFE), and a Shape-IoU loss function, the detection accuracy is significantly improved. Additionally, depthwise separable convolution is employed for lightweight processing, addressing the requirements for land target detection in practical scenarios.

The efficacy of the YOLOv5s-CACSD was demonstrated through experiments, establishing its effectiveness in land target detection. Specifically, the mAP value of the YOLOv5s-CACSD algorithm, applied to the DOTAv1.0 dataset, exceeded that of YOLOv5s by a margin of 2.0%. The proposed augmentation method demonstrated consistent performance improvements across multiple object categories, with an enhanced accuracy for objects. Additionally, the model achieved a reduction in parameter count and computational complexity, with values of 0.9 M and 2.9 GFLOPs, contributing to its design.

Future enhancements could focus on multi-scale detection and rotated bounding box representations to address scale and rotation challenges. Handling class imbalance through oversampling or loss function adjustments and optimizing computational efficiency through pruning, quantization, and knowledge distillation are key steps. Additionally, exploring compression techniques and adaptive inference methods will improve deployment flexibility and performance.

Author Contributions

Conceptualization, J.T., S.Y. and S.L.; Methodology, W.H. and X.J.; Software, X.J.; Validation, X.J. and S.Y.; Formal analysis, W.H., J.T. and S.Y.; Resources, S.Y.; Data curation, X.J.; Writing—original draft, W.H. and J.T.; Writing—review & editing, S.L.; Supervision, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The DOTAv1.0 dataset, which serves as a valuable resource for validating the results obtained in this study, is publicly accessible at the following location: https://captain-whu.github.io/DOTA/dataset.html (accessed on 1 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hui, C.; Yong, X.; Xin, X. Design of disaster meteorological observation data monitoring system based on multi-source satellite remote sensing. Comput. Meas. Control 2023, 31, 24–29. [Google Scholar]
Shuchen, Z.; Kun, L.; Yi, Z.; Wu, X. The Great Transformation of Intelligent Remote Sensing Monitoring Technology. Computer News, 8 January 2024. [Google Scholar] [CrossRef]
Yang, C. Application examples of remote sensing and precision agriculture technology in crop disease detection and management. Engineering 2020, 6, 102–112. [Google Scholar] [CrossRef]
Wei, C.; Du, Y.; Cheng, Z.; Zhou, Z.; Gu, X. Study on yield estimation of winter wheat covered with plastic film based on drone remote sensing vegetation index optimization. J. Agric. Mach. 2024, 5, 1–14. [Google Scholar]
Bao, S.; Lu, L. The impact of spatial evolution guided by urban planning in Hefei on the spatiotemporal evolution of land prices. J. Geogr. 2015, 70, 906–918. [Google Scholar]
Kuffer, M.; Pfeffer, K.; Persello, C. Special issue “remote-sensing-based urban planning indicators. Remote Sens. 2021, 13, 1264. [Google Scholar] [CrossRef]
Chen, J.; Wu, H.; Zhang, J.; Liao, A.; Liu, W.; Zhang, J.; Miao, Q.; Feng, W.; Lu, W. The direction and tasks of constructing a natural resource survey and monitoring technology system. J. Geogr. 2022, 77, 1041–1055. [Google Scholar]
Guo, D.; Li, S.; Chen, Z.; Wang, L. Evaluation of Demand Satisfaction for High Resolution Satellite Natural Resource Survey. J. Remote Sens. 2022, 26, 579–587. [Google Scholar]
Wang, Z.; Kang, Q.; Xun, Y.; Shen, Z.Q.; Cui, C.B. Military reconnaissance application of high-resolution optical satellite remote sensing. In Proceedings of the International Symposium on Optoelectronic Technology and Application 2014: Optical Remote Sensing Technology and Applications, Beijing, China, 13–15 May 2014; Volume 9299, pp. 301–305. [Google Scholar]
Zhu, W.; Xie, B.; Wang, T.; Shen, J.; Zhu, H. Review of Aircraft Target Detection Technology in Optical Remote Sensing Images. Comput. Sci. 2020, 47, 165–171+182. [Google Scholar]
Hu, H.; Zuo, J.; Lu, Y.; Zhao, R. Remote sensing image road network detection method for autonomous driving. Chin. J. Highw. 2022, 35, 310–317. [Google Scholar] [CrossRef]
Gao, Y.; Lei, R. Progress and Prospects of Multi source Remote Sensing Forest Fire Monitoring. J. Remote Sens. 2024, 28, 1854–1869. [Google Scholar]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deeplearning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
Guo, H.; Liang, D.; Liu, G. Progress of Earth Observation in China. Chin. J. Space Sci. 2020, 40, 908–919. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Hui, Y.; You, S.; Hu, X.; Yang, P.; Zhao, J. SEB-YOLO: An Improved YOLOv5 Model for Remote Sensing Small Target Detection. Sensors 2024, 24, 2193. [Google Scholar] [CrossRef]
Luo, H.; Chen, H. A review of deep learning based object detection research. J. Electron. 2020, 48, 1230–1239. [Google Scholar]
Yang, Z.; Luo, L.; Wu, T.; Yu, B. Improved lightweight optical remote sensing image ship target detection algorithm for YOLOv8. Comput. Eng. Appl. 2024, 60, 248–257. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Kumar, S.S.; Shreenath, K.N.; Sunil, G.; Vishwanath, P.; Shankar, S.S. A Rapid Object Recognition Scenario Based on Improved YOLO-V3 Network. In Proceedings of the 2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT), Bengaluru, India, 20–21 October 2023; pp. 1–5. [Google Scholar]
Gong, J.; Zhao, J.; Li, F.; Zhang, H. Vehicle detection in thermal images with an improved yolov3-tiny. In Proceedings of the 2020 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 28–30 July 2020; pp. 253–256. [Google Scholar]
Li, Y.; Lv, C. Ss-yolo: An object detection algorithm based on YOLOv3 and shufflenet. In Proceedings of the 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (IT-NEC) 2020, Chongqing, China, 12–14 June 2020; Volume 1, pp. 769–772. [Google Scholar]
Zhu, G.; Wei, Z.; Lin, F. An object detection method combining multi-level feature fusion and region channel attention. IEEE Access 2021, 9, 25101–25109. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, J.; Sun, Y.; Liu, S.; Wang, C. Lightweight object detection based on segmentation attention and linear transformation. J. Zhejiang Univ. (Eng. Ed.) 2023, 57, 1195–1204. [Google Scholar]
Kim, S. ECAP-YOLO: Efficient Channel Attention Pyramid YOLO for Small Object Detection in Aerial Image. Remote Sens. 2021, 13, 4851. [Google Scholar] [CrossRef]
Arunnehru, J.; Thalapathiraj, S.; Vaijayanthi, S.; Ravikumar, D.; Loganathan, V.; Kannadasan, R.; Khan, A.A.; Wechtaisong, C.; Haq, M.A.; Alhussen, A.; et al. Target Object Detection from Unmanned Aerial Vehicle (UAV) Images Based on Improved YOLO Algorithm. Electronics 2022, 11, 2343. [Google Scholar] [CrossRef]
Yin, L.; Wang, L.; Li, J.; Lu, S.; Tian, J.; Yin, Z.; Liu, S.; Zheng, W. YOLOV4_CSPBi: Enhanced Land Target Detection Model. Land 2023, 12, 1813. [Google Scholar] [CrossRef]
Zhang, J.; Ding, A.; Li, G.; Zhang, L.; Zeng, D. A pyramid attention network with edge information injection for remote sensing object detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Zhao, L.L.; Zhu, M.L. MS-YOLOv7: YOLOv7 based on multi-scale for object detection on UAV aerial photography. Drones 2023, 7, 188. [Google Scholar] [CrossRef]
Qu, H.; Wang, M.; Chai, R. Efficient remote sensing image vehicle detection using bidirectional multi-scale feature fusion. Comput. Eng. Appl. 2024, 60, 346–356. [Google Scholar]
Cao, F.; Xing, B.; Luo, J.; Li, D.; Qian, Y.; Zhang, C.; Bai, H.; Zhang, H. An Efficient Object Detection Algorithm Based on Improved YOLOv5 for High-Spatial-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 3755. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Bi, J.; Li, K.; Zheng, X.; Lei, T. SPDC-YOLO: An Efficient Small Target Detection Network Based on Improved YOLOv8 for Drone Aerial Image. Remote Sens. 2025, 17, 685. [Google Scholar] [CrossRef]
Vijayakumar, A.; Vairavasundaram, S. YOLO-based Object Detection Models: A Review and its Applications. Multimed. Tools Appl. 2024, 83, 83535–83574. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]

Figure 1. Structure of YOLOv5s.

Figure 2. Structure of CBS.

Figure 3. Structure of C3_1.

Figure 4. Structure of SPPF.

Figure 5. Structure of FPN and PAN.

Figure 6. Structure of CA.

Figure 7. Structure of CARAFE.

Figure 8. Depthwise separable convolution.

Figure 9. Improved network structure.

Figure 10. DOTAv1.0 dataset images.

Figure 11. mAP comparison across model variants.

Figure 12. Precision (P) comparison across model variants.

Figure 13. Recall (R) comparison across model variants.

Figure 14. Parameter (M) comparison across model variants.

Figure 15. FLOPs comparison across model variants.

Figure 16. Model-lightweighting experiment—mAP trend.

Figure 17. Model-lightweighting experiment—Precision (P) trend.

Figure 18. Model-lightweighting experiment—Recall (R) trend.

Figure 19. Model-lightweighting experiment—Parameter (M) comparison.

Figure 20. Model-lightweighting experiment—FLOPs comparison.

Figure 21. Cross-model comparison—mAP.

Figure 22. Cross-model comparison—Parameter (M).

Figure 23. Cross-model comparison—FLOPs.

Figure 24. Loss curve comparison. (a) box_loss; (b) cls_loss; (c) obj_loss.

Figure 25. Comparison between the detection outputs.

Table 1. Training settings.

Parameter Settings
Initial learning rate	0.01
Learning rate momentum	0.937
Optimizer	SGD
Learning rate adjustment strategy	Cosine annealing strategy
Training order	200
Training woker	8
Batch size	64

Table 2. Improvement method corresponding to network names.

Network Names	CA	CARAFE	SIoU
YOLOv5s	×	×	×
YOLOv5s-CA	√	×	×
YOLOv5s-C	×	√	×
YOLOv5s-S	×	×	√
YOLOv5s-CAC	√	√	×
YOLOv5s-CAS	√	×	√
YOLOv5s-CS	×	√	√
YOLOv5s-CACS	√	√	√

Table 3. Experimental results.

Network Names	mAP/%	P/%	R/%	Params/M	FLOPs/G
YOLOv5s	87.3	85.6	83.3	7.0	15.9
YOLOv5s-CA	88.7	87.9	83.7	7.1	16.0
YOLOv5s-C	88.9	88.3	84.3	7.2	16.1
YOLOv5s-S	87.7	86.3	83.0	7.0	15.8
YOLOv5s-CAC	90.3	90.1	86.4	7.3	16.3
YOLOv5s-CAS	89.2	88.5	85.8	7.1	16.0
YOLOv5s-CS	89.6	89.8	84.6	7.2	16.1
YOLOv5s-CACS	91.8	92.7	87.5	7.3	16.3

Table 4. Model lightweighting experiment.

Replace Position	mAP/%	P/%	R/%	Params/M	FLOPs/G
Backbone	91.0	92.6	87.1	6.1	12.8
Neck	89.8	93.2	85.7	6.6	14.2
All	87.3	87.3	82.9	5.4	11.9

Table 5. Contrast test.

Network Names	mAP/%	Params/M	FLOPs/G
Faster-RCNN	89.1	93.6	198.7
YOLOv4	86.4	74.3	153.2
YOLOv5s	89.0	7.0	15.9
YOLOv8	88.5	11.1	28.4
YOLOv10	90.3	7.2	21.6
YOLOv5s-CACSD (ours)	91.0	6.1	12.8

Table 6. Accuracy of various categories.

Target Categories	Faster-RCNN	YOLOv4	YOLOv5s	YOLOv8	YOLOv10	YOLOv5s-CACSD
plane	84.6	87.4	83.7	82.6	87.9	86.7
ship	81.3	81.2	81.6	80.9	82.2	82.5
storage tank	83.7	87.5	85.2	84.3	88.6	89.3
baseball diamond	93.1	93.3	90.4	92.7	94.8	94.1
tennis court	95.6	89.7	95.8	94.9	96.1	96.4
basketball court	94.5	90.3	89.9	91.3	94.5	94.9
ground track field	97.4	92.4	88.1	89.1	94.3	93.6
harbor	94.1	88.5	85.8	87.2	89.5	88.8
bridge	70.3	90.1	79.4	75.5	84.4	85.0
large vehicle	88.6	86.9	84.6	85.6	88.7	89.3
small vehicle	79.7	80.0	77.5	78.4	81.3	82.1
helicopter	82.3	83.0	75.5	75.0	81.5	82.8
roundabout	77.9	90.2	86.8	84.9	87.8	86.3
soccer field	92.7	93.6	84.2	86.8	94.0	92.1
swimming pool	76.4	88.7	79.2	78.8	87.2	87.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, W.; Jiang, X.; Tian, J.; Ye, S.; Liu, S. Land Target Detection Algorithm in Remote Sensing Images Based on Deep Learning. Land 2025, 14, 1047. https://doi.org/10.3390/land14051047

AMA Style

Hu W, Jiang X, Tian J, Ye S, Liu S. Land Target Detection Algorithm in Remote Sensing Images Based on Deep Learning. Land. 2025; 14(5):1047. https://doi.org/10.3390/land14051047

Chicago/Turabian Style

Hu, Wenyi, Xiaomeng Jiang, Jiawei Tian, Shitong Ye, and Shan Liu. 2025. "Land Target Detection Algorithm in Remote Sensing Images Based on Deep Learning" Land 14, no. 5: 1047. https://doi.org/10.3390/land14051047

APA Style

Hu, W., Jiang, X., Tian, J., Ye, S., & Liu, S. (2025). Land Target Detection Algorithm in Remote Sensing Images Based on Deep Learning. Land, 14(5), 1047. https://doi.org/10.3390/land14051047

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Land Target Detection Algorithm in Remote Sensing Images Based on Deep Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. YOLOv5 Detection Algorithm

2.1.1. Backbone

2.1.2. Neck

2.1.3. Head

2.2. YOLOv5s-CACSD Land Target Detection Algorithm

2.2.1. Introducing Attention Mechanisms

2.2.2. Improving Upsampling Methods

2.2.3. Improving the Bounding Box Regression Loss Function

2.2.4. Depthwise Separable Convolution

2.2.5. Improved Model Structure

2.3. Experimental Design

2.3.1. Description of the Dataset

2.3.2. Evaluation Indicators

2.3.3. Experimental Methods

3. Results

3.1. Results of Ablation Experiments

3.2. Results of Comparative Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI