DPCSANet: Dual-Path Convolutional Self-Attention for Small Ship Detection in Optical Remote Sensing Images

Chen, Jiajie; Tian, Xin; Du, Chong

doi:10.3390/electronics14061225

Open AccessArticle

DPCSANet: Dual-Path Convolutional Self-Attention for Small Ship Detection in Optical Remote Sensing Images

by

Jiajie Chen

^1,2,*

,

Xin Tian

¹ and

Chong Du

¹

Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai 201210, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(6), 1225; https://doi.org/10.3390/electronics14061225

Submission received: 28 February 2025 / Revised: 18 March 2025 / Accepted: 18 March 2025 / Published: 20 March 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Detecting small ships in optical RSIS is challenging. Due to resolution limitations, the texture and edge information of many ship targets are blurred, making feature extraction difficult and thereby reducing detection accuracy. To address this issue, we propose a novel dual-path convolutional self-attention network, DPCSANet, for ship detection. The model first incorporates a dual-path convolutional self-attention module to enhance its ability to extract local and global features and strengthen target features. This module integrates two parallel branches to process features extracted by convolution and attention mechanisms, respectively, thereby mitigating the potential conflicts between local and global information. Additionally, a high-dimensional hybrid spatial pyramid pooling module is introduced into the model to expand the scale range of feature extraction. This enables the model to fully utilize background contextual features to compensate for weak feature representations of the target. To further improve the detection accuracy for small ships, we developed a focal complete intersection over union loss function. This regression loss guides the model to focus on weak targets during training by increasing the contribution of low-accuracy prediction boxes to the loss. Experimental results demonstrate that the proposed method effectively enhances the model’s detection ability for small ships. On the LEVIR-ship, OSSD, and DOTA-ship datasets, DPCANet achieves an average precision improvement of 0.9% to 11.4% over the baseline, outperforming other state-of-the-art object detection models.

Keywords:

optical RS; small object detection; ship; self-attention

1. Introduction

Continuous and accurate monitoring of ships is of significant importance for marine supervision, territorial security, and other areas [1,2]. Ship detection aims to accurately localize and identify ship targets in the given remote sensing images (RSIs), which is a crucial part of ship monitoring. In recent years, due to the development of remote sensing (RS) technology, a large amount of RS observational image data have been generated, leading to rapid development in the research of optical RS ship detection (ORSSD) [3]. RSIs usually have large fields of view, making them ideal for monitoring vast areas. Additionally, the advancement of deep learning has significantly propelled the development of this area [4], particularly through the application of convolutional neural networks (CNNs), which have demonstrated robust performance in ship detection tasks [5,6,7,8,9].

However, current ship detection methods still struggle to meet accuracy demands, especially when dealing with smaller ship targets. Due to resolution limitations, RSIs contain numerous small ship targets, as shown in Figure 1. For example, in a medium-resolution (MR) image (e.g., resolution of 5 m), the majority of ship targets are smaller than 16 × 16 in size. Even in images with a 1–3 m resolution, small ships may appear smaller than 32 × 32, which classifies them as typical small-target problems in object detection tasks. According to the COCO dataset’s guidelines, objects smaller than 32 × 32 pixels are classified as small targets. See more details at https://cocodataset.org/ (accessed on 17 March 2025). Furthermore, restricted scales cause the texture and edge information of many ship targets to be blurred, making it difficult to extract meaningful features from the ship targets. Moreover, in optical RSIs, background interferences such as sea waves, lighting, and fog often appear simultaneously, further increasing the difficulty of ship detection.

Previous research has highlighted the complexity of small target detection in RS and proposed various improvement strategies. One direct approach to enhance ship target feature extraction is to introduce prior knowledge into the detection model. This includes transforming the predicted bounding box to adapt to ships with arbitrary orientations and large aspect ratios [10], designing angle-dependent loss functions based on the linear diffusion of ship wake trails [11], or altering the convolutional method to enhance ship feature extraction [12]. However, these methods rely on clear texture and edge information, which is often unavailable in most images. Other studies have attempted to optimize the model structure by guiding the model to focus on important features or extract more representative features to improve the detection accuracy of small ship targets. These approaches include the introduction of attention mechanisms [13], dilated convolutions [14], or multiscale feature fusion [15,16]. For instance, DRENet [17] introduced a multi-head self-attention (MHSA) mechanism with position encoding [18] into the detection model, improving the average ship detection accuracy in optical RS. However, self-attention mechanisms tend to focus more on global contextual information rather than important local features, limiting their ability to extract features of small targets. To address this, CLFT [19] integrated convolution into the self-attention mechanism to enhance small target detection in RSIS. While this approach balances the extraction of local and global features, the integration of convolution and self-attention mechanisms can lead to interference between global and local features. Additionally, the limited scale range of feature extraction may adversely affect the model’s ability to distinguish between target and background features. In terms of model training, existing research primarily focuses on optimizing regression loss functions, such as DIoU and CIoU [20], EIoU [21], and Alpha IoU [22]. Although these methods improve the overall detection accuracy by optimizing the fit between predicted and ground-truth boxes, they do not specifically focus on difficult targets during training, resulting in limited improvement in small ship detection.

In response to these challenges, we propose DPCSANet, a novel RS small ship detection model that significantly advances the state of the art. Specifically, the major contributions of this paper are as follows.

Dual-path convolutional self-attention module (DPCSA): DPCSA is proposed to address the challenge of capturing fine-grained details on small targets. Unlike existing self-attention mechanisms that primarily focus on global contextual information, DPCSA employs a simple dual-path structure to independently process convolutional and self-attention features. Considering that different features extracted in the same channel may suppress or hinder each other, this method solves the problems of feature contention and interference, enabling the model to capture both local texture details and global contextual information simultaneously, ultimately improving the performance of feature extraction for small targets.
High-dimensional hybrid spatial pyramid pooling module (HHSPP): HHSPP integrates multilevel pooling operations to expand the receptive field and enhance feature representation. By integrating multilevel pooling operations, this module expands the receptive field, enhances feature representation, and thereby enables the model to fully utilize background contextual features to compensate for weak feature representations of the target.
Focal complete intersection over union loss (Focal CIoU): We introduce a novel regression loss function to further improve small ship detection performance. By incorporating a focal penalty term that automatically adjusts during training, Focal CIoU increases the contribution of low-accuracy and weak targets to the loss. This enhances the model’s ability to focus on difficult ship targets, overcoming the limitations of existing loss functions that prioritize high-accuracy predictions.

The remainder of this paper is organized as follows. Section 2 provides a brief review of the related work. Section 3 describes the details of our proposed methods. Section 4 presents the experimental setup and results, and Section 5 summarizes our contributions and future research directions.

2. Related Work

2.1. CNN-Based Model in RS Object Detection

In recent years, numerous detectors based on CNNs have been proposed, which can be categorized into two-stage and one-stage methods. Two-stage methods, such as the faster RCNN [23], generate region proposals and then classify and regress these proposals. In contrast, one-stage methods, including the the you only look once (YOLO) series [24,25,26,27] and SSD [28], directly produce detection results and offer faster detection speeds. For RS object detection, TPH-YOLO [29] integrates transformer encoder blocks into the backbone to enhance global context, while FE-YOLO [30] uses deformable convolutions in the neck to fuse features from different scales. Additionally, Liu et al. [31] designed small ship detection models based on YOLO version 5 (YOLOv5), improving accuracy through auxiliary supervision and attention mechanisms. In summary, the YOLO architecture stands out for its superior scalability and efficiency, making it highly suitable for RS tasks that demand real-time performance. Therefore, we select the YOLO series model as our foundational framework and introduce specifically designed modules to enhance the representation of small ship features.

2.2. Remote Sensing Small Ship Detection Methods

2.2.1. Feature Enhancement

Small objects typically have weaker feature representations, making feature enhancement a promising method for improving detection accuracy. To enhance the model’s attention to important features during feature extraction and fusion, previous studies have introduced self-attention mechanisms such as squeeze and excitation (SE) [32], convolutional block attention (CBAM) [33], and coordinate attention [34] to enhance target features. For example, Yin [35] and Zhao [36] integrated the CBAM into their model to automatically learn the weighted values of channel and feature map positions, guiding the model to focus on important features. CCSA [37] was then introduced to further reduce information loss by combining SE and CBAM mechanisms. However, these mechanisms primarily focus on enhancing local features and may not fully capture global dependencies. To extract global information, self-attention mechanisms, like MHSA [18], are widely used in this field. Cui [38] designed a group space self-attention module (SSGEA) to enhance the model’s attention to target features. These mechanisms establish global feature dependency relationships by calculating the inner product of feature vectors, thereby allocating feature importance. However, self-attention mechanisms mainly focus on global background information. For small object detection, we believe that the model should emphasize both global features and local key features around the target. In this case, CLFT [19] integrates convolution into the self-attention mechanism. Although this method can utilize the advantages of convolution to extract local features, the stacking of different feature extraction methods inevitably results in features with different semantics being squeezed out in the same channel. In contrast, we propose a dual-path convolutional self-attention mechanism, which uses two independent branches to process the features extracted by convolution and self-attention, respectively. This avoids the crowding out and interference between different semantic features in the module structure design.

In addition to attention mechanisms, expanding the receptive field or using multiscale feature fusion are also common methods for feature enhancement. Liu [39] introduced dilated convolutions with varying dilation rates to effectively increase the receptive field. Li et al. [40] designed a composite convolution module using large-scale convolutions to improve detection accuracy in RS tasks. However, stacking large-scale convolutions significantly increases computational cost and reduces detection speed. To this end, the spatial pyramid pooling (SPP) module [41] was introduced, which has a lower computational cost compared to convolution modules with the same receptive field size. Despite these improvements, existing methods have limitations. The receptive field had not been fully expanded to the global level, potentially leading to insufficient background feature representation. Liu [31] introduced a layer of average pooling before SPP and connected two layers of pooling in series to expand the receptive field of the entire module, but the model structure is relatively complex. In addition, average pooling may not be applicable in small object detection because the feature information of small objects is already sparse. Average pooling may further weaken these key features, leading to a decrease in the model’s detection ability for small objects.

2.2.2. Regression Loss Function Design

Bounding box regression is commonly used to locate objects, with IoU (intersection over union) [42] being a widely used metric for assessing detection accuracy. However, IoU, which measures the ratio of intersection to union areas between predicted and ground-truth boxes, is highly sensitive to positional deviations for small objects, often leading to insufficient positive samples for small targets. To address this, some studies have optimized regression loss functions from a geometric perspective. YOLOv4 [24] incorporated DIoU to enhance detection accuracy, while Liu et al. [31] used EIoU to improve the learning of high-accuracy predictions in ship detection. Additionally, Liang [10] proposed a keypoint-based ship detection method for orientation issues, using midpoints of edges and object centers to measure prediction accuracy and improve heading angle prediction. However, these optimization methods do not account for the interference of labeling errors on bounding box regression, potentially causing overfitting and also failing to address the weak contribution of hard targets to the loss value, neglecting some potential detections.

3. Methodology

This section first outlines the overall architecture of the proposed DPCSANet model, as illustrated in Figure 2. It then provides a detailed discussion of the model’s key components, including DPCSA, HHSPP, and Focal CIoU, explaining the principles behind each component and their contributions to enhancing the model’s performance.

3.1. Overall Network Architecture

To facilitate model deployment and application, DPCSANet is designed based on YOLOv5s, like FFCA-YOLO [43]. We chose YOLOv5 as the baseline model for several reasons. Firstly, its widespread adoption in recent RS studies validates its suitability as a robust foundation for comparison and improvement. Secondly, YOLOv5’s lightweight architecture enables real-time detection, balancing accuracy and efficiency across diverse hardware platforms. Lastly, its modular design allows seamless integration of advanced techniques like attention mechanisms and feature fusion modules. YOLOv5s combines a lightweight CSPDarknet-53 backbone with a PANet-based bidirectional feature fusion structure, achieving a balance between high detection accuracy and low computational complexity. This modular design enables seamless integration of advanced techniques such as attention mechanisms and feature fusion modules, making it an ideal choice for real-time small object detection tasks, including RS ship detection.

The overall structure of DPCSANet, as shown in Figure 2, consists of three main components: the backbone feature extraction network, the feature fusion neck, and the detection head. The backbone network remains CSPDarknet-53 for feature extraction, spanning five layers (

C_{1}

to

C_{5}

). To counter the information loss caused by downsampling during feature extraction, the PAN bidirectional feature fusion structure is employed to transfer semantic and spatial information across multiscale features at different extraction stages, resulting in richer fused features. Considering that additional detection branches may introduce excessive prediction boxes, leading to false positives, and retaining shallow features could hinder the model’s ability to learn high-level information, DPCSANet maintains three detection branches for predictions. Additionally, to improve detection speed, DPCSANet retains the Focus module and performs downsampling of features in

C_{5}

using only convolutional operations.

3.2. DPCSA

In optical RSIS, the relative position and other feature information of small ships are easily lost during the downsampling process of feature extraction. In addition, in the subsequent feature fusion process, target features may be confused with the background features, which is not conducive to small object detection. Traditional CNNs focus on local features (Figure 3a) but struggle with background information, while self-attention mechanisms emphasize global context over target-specific features (Figure 3b), reducing local feature extraction and background-foreground distinguishability [44,45]. To enhance local feature preservation, CRM [45] uses a sparse self-attention mechanism to suppress background features and highlight the target area. But, this damages the extraction of background information, which is crucial for small object detection. Some studies, like CLFT [19], embed convolution into self-attention mechanisms to integrate local and global features, as shown in Figure 3c. However, this approach extracts local features only to use them in subsequent global dependency calculations, failing to truly emphasize local features. Essentially, it still leads to feature competition and interference during fusion, especially for small targets, where local features may be overshadowed by global features, diminishing the model’s detection ability.

In this context, we introduce DPCSA. Unlike previous approaches that either focus solely on local or global features or merely concatenate convolutional and self-attention features, DPCSA adopts a dual-path structure to independently process convolutional and self-attention features. This design enables the model to fully capture both local texture details and global contextual information, effectively alleviating feature contention and reducing the interference between global and local features, especially for small targets. As shown in Figure 3d, by extracting features through separate branches, DPCSA generates a feature map that integrates richer local and global features. This unique integration enhances the model’s ability to detect small targets in complex RSIS, overcoming the limitations of existing methods.

Specifically, as shown in Figure 4, for the input feature map

X \in R^{H \times W \times C}

, the left branch of DPCSA uses a standard self-attention module to capture the similarity between all feature positions, building global feature dependencies. This process generates global feature map

Y_{S A} \in R^{H \times W \times C / 2}

, expressed as follows:

\begin{matrix} Y_{S A} = S o f t m a x (Q K^{T} + Q R^{T}) V . \end{matrix}

(1)

In Equation (1), Q, K, and V represent intermediate features, which are extracted from the input feature X using separate 1 × 1 convolutions, and the number of output channels was set to half of the input. R represents the additional relative position encoding, which is obtained by adding two learnable 1D vectors,

R_{w}

and

R_{h}

, after they are expanded via broadcasting, i.e.,

R = R_{w} + R_{h}

.

Different from CLFT and CRM, we introduce an additional convolutional branch to independently extract local features G, emphasizing local key information while fully capturing background details. Furthermore, a learnable variable

η

was applied to weight the features, yielding the regional feature map

Y_{c o n v} \in R^{H \times W \times C / 2}

, represented as follows:

\begin{matrix} G = C o n v_{3 \times 3} (X), \end{matrix}

(2)

\begin{matrix} Y_{c o n v} = η \cdot G, \end{matrix}

(3)

where

C o n v_{3 \times 3} (X)

represents a 3 × 3 convolution, including batch normalization and a ReLU activation function.

Finally, we adjusted the channel number of the fusion features through a 1 × 1 convolution, obtaining the final output feature map

Y \in R^{H \times W \times C}

, represented as follows:

\begin{matrix} Y = Con v_{1 \times 1} (Y_{S A} + Y_{c o n v}) . \end{matrix}

(4)

To reduce the computational load from the simultaneous calculation of the self-attention mechanism and convolution, we encapsulate DPCSA within DPBlock (see Figure 4). We use two layers of

1 \times 1

convolution to reduce the number of channels of input features to one-fourth of the original. Furthermore, we use two layers of skip connection to merge the original input features to retain the original semantic information. Additionally, to maintain focus on the relative positional information of the targets throughout the feature fusion process, the DPBlock is used to replace the C3 module of YOLOv5 after each feature layer fusion, as shown in Figure 2.

3.3. HHSPP

In RSIS, small ship targets occupy only a limited number of pixels, making their features highly susceptible to being overshadowed by complex background information. At this time, background features play an important role in the detection process as they provide essential semantic context for detection. To further expand the model’s receptive field and capture richer multi-scale features, we designed HHSPP and integrated it into the top layer of the feature extraction backbone network (see Figure 2). Compared with the original SPP module in YOLOv5, HHSPP demonstrates several significant advantages. First, by expanding the scale range and enriching the feature representation of pooling operations, HHSPP significantly enhances the model’s ability to extract multi-scale features. This improvement enables the model to perform more effectively in detecting small objects against complex backgrounds, thereby substantially increasing detection accuracy and robustness. Second, HHSPP employs a series-parallel pooling approach that effectively balances the module’s computational load. This design allows the module to maintain high efficiency while delivering stronger feature extraction capabilities. Additionally, unlike HSPP [31], which uses average pooling, HHSPP employs max pooling exclusively. This choice has several advantages. Max pooling selects the maximum value within each pooling window, emphasizing high-activation regions and discarding less significant features. This characteristic makes max pooling particularly effective for highlighting key features and edges, which are crucial for detecting small targets in complex backgrounds.

The structure of HHSPP is shown in Figure 5. For the input feature map

X \in R^{H \times W \times C}

, a

1 \times 1

convolutional layer is introduced to reduce the number of channels, producing the intermediate feature map

X^{'}

. i.e.,

\begin{matrix} X^{'} = C o n v_{1 \times 1} (X), \end{matrix}

(5)

where

{Conv}_{1 \times 1}

is defined as a 1 × 1 convolution with batch normalization and a SiLU activation function. Then, a 5 × 5 max pooling layer is applied to generate a finer max value feature representation

F_{1}

. Based on this, a set of parallel pooling layers with pooling sizes of 5 × 5, 9 × 9, and 13 × 13 are concatenated to obtain different coarse max value features, represented as

F_{2}

,

F_{3}

, and

F_{4}

, respectively. We set the output channel of all pooling operations to be consistent with the input, so

X^{'}, F_{1, 2, 3, 4} \in R^{H \times W \times C / 2}

.

F_{1, 2, 3, 4}

are defined as follows:

\begin{matrix} F_{1} = M a x P o o l (k = 5, s = 1, p = 2) (X^{'}), \end{matrix}

(6)

\begin{matrix} F_{2} = M a x P o o l (k = 5, s = 1, p = 2) (F_{1}), \end{matrix}

(7)

\begin{matrix} F_{3} = M a x P o o l (k = 9, s = 1, p = 4) (F_{1}), \end{matrix}

(8)

\begin{matrix} F_{4} = M a x P o o l (k = 13, s = 1, p = 6) (F_{1}), \end{matrix}

(9)

where k, s, and p represent kernel size, stride, and padding, respectively. Finally, all the obtained max pooling representations are concatenated along the channel dimension to generate the multiscale fused output feature map

Y \in R^{H \times W \times C}

, represented as follows:

\begin{matrix} Y = C o n v_{1 \times 1} \cdot C o n c a t (F_{1}, F_{2}, F_{3}, F_{4}, F_{5}) . \end{matrix}

(10)

Through the serial and parallel pooling group, the receptive field of the module is expanded to 15 × 15, which is approximately equal to the size of the C5 feature map (16 × 16).

3.4. Focal CIoU

Existing object detection models typically rely on IoU-based regression loss for assigning positive and negative labels to predicted boxes and calculating prediction accuracy. However, IoU loss is more sensitive to small targets because of their smaller area. Even minor prediction deviations can significantly alter the IoU value, amplifying the penalty for errors. Additionally, annotation errors may result in the model acquiring biased positional information, causing overfitting of high-accuracy prediction boxes and ultimately reducing detection accuracy.

To address these issues, this paper proposes a new regression loss function, Focal CIoU. Inspired by the focal loss concept in EIoU [21], this method introduces an exponentially weighted IoU penalty term, Focal, in addition to the baseline CIoU loss to emphasize the regression of low-accuracy predicted boxes. Unlike other IoU optimization methods, the added penalty term does not compromise the original regression performance of the loss function. Instead, it dynamically adjusts the focus of the loss function on predicted boxes of varying accuracy based on the regression progress. Furthermore, this method increases the number of valid predicted boxes, thereby improving the recall rate for ship detection.

The original CIoU loss is defined as follows:

\begin{matrix} L o s s_{C I o U} = 1 - C I o U, \end{matrix}

(11)

\begin{matrix} C I o U = I o U - (\frac{ρ^{2} (b, b_{g t})}{c^{2}} + λ v) . \end{matrix}

(12)

In Equation (12), IoU is used to measure the overlap between the predicted and ground truth boxes.

ρ

represents the distance between the centers of the predicted box b and the ground truth box

b_{g t}

, while c denotes the diagonal distance of the smallest enclosing box that covers these two boxes.

λ

is used to measure the aspect ratio difference between the predicted and ground truth boxes, with v being a weighted exponential hyperparameter.

λ

and v are defined as follows:

\begin{matrix} λ = \frac{v}{(1 - I o U) + v}, \end{matrix}

(13)

\begin{matrix} v = \frac{4}{π^{2}} {(arctan \frac{w_{g t}}{h_{g t}} - arctan \frac{w}{h})}^{2} . \end{matrix}

(14)

Here,

(w, h)

and (

w_{g t}, h_{g t}

) represent the width and height of the predicted and ground truth boxes, respectively. It is important to note that in EIoU, the focal penalty term increases the weight of high-accuracy prediction boxes in the loss, which contrasts with the need to emphasize low-accuracy prediction boxes. In Focal CIoU, the exponentially weighted IoU penalty term is defined as follows:

\begin{matrix} F o c a l_{I o U} = {(1 - I o U)}^{γ}, \end{matrix}

(15)

where

γ > 0

is a model hyperparameter. Finally, the Focal CIoU loss is defined as follows:

\begin{matrix} L o s s_{F o c a l C I o U} & = & F o c a l_{I o U} \cdot L o s s_{C I o U} \\ = & {(1 - I o U)}^{γ} (1 - C I o U) . \end{matrix}

(16)

According to Equation (16), as the IoU increases, i.e., when the accuracy of the predicted box improves, the penalty term decreases. This results in a smaller contribution of high-accuracy prediction boxes to the overall loss compared to low-accuracy prediction boxes, forcing the model to focus on low-accuracy prediction boxes. This helps mitigate overfitting to high-accuracy prediction boxes. Moreover, unlike EIoU, the penalty term in Focal CIoU has the same gradient direction as IoU during backpropagation. This causes the model’s IoU to converge more quickly in the early stages of training, further promoting the regression of prediction boxes towards the ground truth boxes and improving the model’s detection accuracy. In addition, we conduct experiments to compare the effects of different exponential weighting parameters

γ

on the model’s detection performance, selecting the optimal parameter as the final penalty exponent for Focal CIoU, as shown in Section 4.5.4.

4. Experiments and Analysis

In this section, we evaluate the proposed method through comparative and ablation experiments. Comparative experiments are conducted on the three datasets (LEVIR-ship, OSSD, and DOTA-ship) to validate the effectiveness of the proposed model (DPCSANet) against the state-of-the-art methods in small ship detection. Meanwhile, ablation experiments are conducted on the LEVIR-ship dataset to analyze the impact of three different components (DPCSA, HHSPP, and Focal CIoU) of our proposed method and validate their necessity. These experiments provide a thorough understanding of the strengths and contributions of our approach in small ship detection using optical RSIS.

4.1. Datasets

The LEVIR-ship dataset [17] is specifically curated for medium-resolution ORSSD with a resolution of 16 m. The dataset comprises 512 × 512-pixel Optical RSIS extracted from the GaoFen-1 and GaoFen-6 satellite imagery. With target ships typically smaller than 20 × 20 pixels, this dataset presents a challenge for detection algorithms. It encompasses 2320 images for training and an equal number of 788 images for both validation and testing purposes. Notably, LEVIR-ship includes a significant number of images with cloud interference, which adds to the complexity and realism of the dataset. This feature makes it more representative of real-world conditions compared to other datasets that are typically clean and cloud-free. Figure 6a shows sample images from the LEVIR-ship dataset, highlighting the presence of cloud interference and other challenging conditions.

The OSSD dataset [31] has a resolution of 5 m and dimensions of 512 × 512 pixels. The targets are captured using only the RGB channels. The dataset is structured with 6383 images allocated for training, 710 for validation, and 3040 for testing. Figure 6b illustrates sample images from the OSSD dataset, highlighting its focus on small ship detection in complex maritime environments.

The DOTA dataset [46] is a large-scale benchmark for object detection in aerial images, containing diverse scenes captured from various platforms and sensors. It includes images with a wide range of resolutions and sizes, from 800 × 800 pixels to as large as 20,000 × 20,000 pixels. We extracted the samples containing ship targets from the DOTA dataset, cropped them into 640 × 640 images with a 200-pixel overlap, and re-labeled the annotations to create a new dataset named DOTA-ship, as shown in Figure 7. The dataset was then divided into training, validation, and testing sets in a 6:2:2 ratio, resulting in 2659 images for training, 886 for validation, and 886 for testing.

We utilize the training and validation sets for model training and the test set for performance evaluation. We conducted a statistical analysis of ship target sizes in the three datasets. As shown in Figure 8, it can be observed that many ship targets occupy less than 32 × 32 pixels. Specifically, 97.73% of LEVIR-ship instances, 31.83% of OSSD instances, and 45.18% of DOTA-ship instances are small targets.

4.2. Evaluation Metrics

To comprehensively evaluate our model’s performance, we assess it based on three key aspects: model complexity, detection accuracy, and detection speed. Model complexity is measured by the number of parameters (Params) and floating-point operations (FLOPs), which reflect the model’s computational efficiency and resource requirements. Detection accuracy is evaluated using precision (P), recall (R), and average precision (AP), defined as follows:

\begin{matrix} P = \frac{TP}{TP + FP} \end{matrix}

(17)

\begin{matrix} R = \frac{TP}{TP + FN} \end{matrix}

(18)

\begin{matrix} AP = \int_{1}^{0} P d (R) . \end{matrix}

(19)

Here, TP, FP, and FN denote true positives, false positives, and false negatives. AP provides a balanced evaluation by integrating precision and recall across different thresholds, serving as a key metric for overall accuracy. Detection speed is measured by inference time (Time) and frames per second (FPS) in a batch size of 1, indicating the model’s efficiency in real-time applications. These metrics collectively highlight the model’s efficiency, accuracy, and suitability for practical deployment. Additionally, all accuracy metrics are calculated at an IoU threshold of 0.5.

4.3. Experimental Setup

To ensure fair comparisons, all experiments were conducted on two NVIDIA RTX 2080Ti GPUs with 11 GB of memory. The code was implemented using the PyTorch 2.0 framework. All ResNet backbone networks in the comparison models were pre-trained on ImageNet for 300 epochs, and models with pre-trained ResNet backbones fine-tuned for less than 200 epochs on the respective datasets, as all models typically converged within this limit. All other models were trained from scratch. Specifically, models for the DOTA-ship dataset were trained for 300 epochs, while those for the LEVIR-ship and OSSD datasets were trained for 500 epochs. The shorter training duration for DOTA-ship is primarily due to the presence of many duplicate ship samples in the dataset, which allows for faster model convergence.

The initial learning rate was set to 0.01 with a decay rate of 0.2 using the One Cycle Learning Rate scheduler. The SGD optimizer was configured with a momentum of 0.937 and a weight decay of 5 × 10⁻⁴. The warm-up period was set to 3 epochs, with an initial warm-up momentum of 0.8 and a warm-up bias learning rate of 0.1. During inference and testing, the confidence threshold for non-maximum suppression (NMS) was set to 0.001, and the IoU threshold was set to 0.6.

4.4. State-of-the-Art Comparison

To demonstrate the effectiveness and advantages of the proposed DPCSANet, we conducted a series of comprehensive experiments and compared its performance against both baseline models and state-of-the-art methods on the three datasets mentioned above. The results consistently validate the superior performance of our approach in detecting small ships in optical RSIS.

4.4.1. Results on the LEVIR-Ship Dataset

A comparison of detection results between DPCSANet and state-of-the-art models on the LEVIR-ship dataset is presented, including YOLO series models such as YOLOv5s [25], YOLOv7 [26], YOLOv8, and YOLOv9 [27], high-precision detection models like faster RCNN [47], CenterNet [48], and cascade RCNN [49], transformer-based models such as RT-DETR [50] and DINO [51], as well as FFCA-YOLO [43], Imyolov8 [52], and two RS small ship detection models, including DRENet [17] and improved YOLOv5s [31]. Table 1 shows the results of the comparative experiments.

DPCSANet, with 6.20 M parameters and 9.6 G FLOPs, achieved precision, recall, and AP of 66.1%, 88.2%, and 85.0%, respectively, outperforming the baseline YOLOv5s by 7.4%, 6.3%, and 11.4%. From the results, DPCSANet exhibits higher detection recall and AP than all the compared state-of-the-art models, including high-precision detection models pre-trained on ImageNet, such as cascade RCNN-ResNet101 (recall: 86.3%, AP: 81.3%) and DINO (AP: 83.8%). Moreover, DPCSANet surpasses the small ship detection models DRENet and ImYOLOv5 in all three metrics. While DPCSANet lags slightly behind some high-precision models in terms of precision, it is important to note that in RS detection tasks, missed detections (low recall) are generally more problematic than false positives. Small ships in RSIS often have limited feature information, and background interference significantly affects detection results. Therefore, improving recall is crucial for practical applications. Additionally, DPCSANet maintains real-time detection capabilities, achieving an inference speed of 149 frames per second (FPS) on the experimental hardware. This performance is comparable to other YOLO-based models, ensuring that the model can be deployed in real-time applications. Some detection results are shown in Figure 9, where DPCSANet successfully detects small ship targets in RSIS, even under cloud and fog interference.

4.4.2. Results on the OSSD Dataset

To further validate the small ship detection capability of the proposed model, comparisons between DPCSANet, YOLOv5s, YOLOv7, EfficientDet [53], faster RCNN, DRENet, FFCA-YOLO [43], Imyolov8 [52], and improved YOLOv5s were conducted on the OSSD dataset, and the results are presented in Table 2. According to the experimental results, DPCSANet achieved precision, recall, and AP of 93.6%, 95.9%, and 95.9%, respectively, outperforming the baseline by 11.3%, 10.0%, and 8.9%. Specifically, DPCSANet’s recall and AP are superior to those of all other comparison models. These results further demonstrate DPCSANet’s superiority in small ship detection and validate its stability across different RS detection tasks. Some visualized detection results are shown in Figure 10. DPCSANet performs well even on 5-m resolution images, with all ships in the images being accurately detected.

4.4.3. Results on the DOTA-Ship Dataset

To validate the generalization capability of DPCSANet in detection tasks, we conducted comparative experiments on the DOTA-ship dataset, which contains images of various resolutions. The results show that DPCSANet achieved a detection AP of 95.8%, representing a 0.9% improvement over the baseline model and being comparable to other state-of-the-art general object detection models (Table 3). This indicates that DPCSANet not only performs well on specific small ship detection datasets, but also adapts to detection tasks in different resolutions and complex scenes, as DOTA ships contain a large number of larger targets compared to LEVIR-ship or OSSD. In addition, as shown in Figure 11, our model has fewer false positives and missed detections, especially when detecting small ships, which further demonstrates the effectiveness of the proposed method.

4.5. Ablation Studies

In this section, we conducted a series of ablation experiments to systematically evaluate the contributions of the key components of the proposed DPCSANet model. The primary objective of these experiments was to validate the effectiveness of the DPCSA, HHSPP, and Focal CIoU in enhancing the detection performance for small ships in optical RSIS. By isolating and comparing the impact of each component, we aimed to demonstrate their individual and combined contributions to the overall performance improvement. Additionally, these experiments were designed to further illustrate the rationality and necessity of the proposed module design. All ablation experiments were carried out using the LEVIR-ship dataset.

4.5.1. Effectiveness of Each Proposed Module

Table 4 summarizes the ablation results of the proposed optimization methods. Using the baseline model alone yields an accuracy, recall, and AP of 58.7%, 81.9%, and 73.6%, respectively. Introducing DPCSA alone, by replacing the original C3 module with DPBlock, improves these metrics to 65.3%, 85.5%, and 83.0%, representing gains of 6.6%, 3.6%, and 9.4%. This highlights DPCSA’s effectiveness in capturing precise target positions and critical background features. As shown in Figure 12, the feature maps with DPCSA exhibit brighter regions around the targets, indicating enhanced attention to target localization. Replacing the original SPP with HHSPP expands the model’s receptive field, improving accuracy, recall, and AP by 2.2%, 0.6%, and 2.7%, respectively, indicating that larger-scale multiscale features enhance the model’s understanding of targets in complex backgrounds. Combining DPCSA and HHSPP further boosts these metrics to 65.5%, 86.1%, and 83.5%, showing complementary benefits. The feature maps, as shown in Figure 12, validate these observations. Although HHSPP increases the size of the attention region, it also enhances the model’s focus on potential difficult targets, reducing the likelihood of missed detections. The Focal CIoU loss function independently increases accuracy, recall, and AP by 0.4%, 2.5%, and 4.1%, respectively, by focusing on low-fitting prediction boxes and detecting more challenging targets. As shown in Figure 12, Focal CIoU further refines the attention by narrowing down the focus to precise target locations, enhancing detection accuracy. Furthermore, integrating all proposed components results in the best performance, with accuracy, recall, and AP reaching 66.1%, 88.2%, and 85.0%, respectively.

In addition, the ablation experiments highlight the impact of each module on DPCSANet’s performance. The DPCSA module reduces the model parameters to 6.07 M by introducing DPBlock design, effectively reducing the computational complexity of the model. However, due to the need for DPBlock to handle both convolution and self-attention mechanisms simultaneously, the inference time increases to 6.6 ms. Nevertheless, this increase is reasonable as DPCSA significantly improves the detection accuracy and feature representation capability of the model. HHSPP, while increasing computational cost and model parameters to 7.18 M, only adds 0.1 ms to inference time, a negligible increase considering its significant enhancement to feature representation. Focal CIoU has minimal impact on complexity or speed, further refining detection accuracy. Collectively, these modules enable DPCSANet to achieve high detection accuracy with fast inference, processing a 512 × 512 image in just 6.7 ms.

4.5.2. Ablation Studies of DPCSA

(1) Comparison of similar self-attention methods: We compared the baseline, CA [34], SA [56], MHSA [18], and CRM [45] with the proposed DPCSA for analysis. All methods were integrated into the neck of the baseline model in the same manner, with experimental results shown in Table 5. The results indicate that the model’s performance in terms of precision, recall, and average accuracy improved with the incorporation of DPCSA over other compared self-attention mechanisms. Specifically, MHSA enhanced model detection precision by 2.9% compared to SA, with improvements also observed in recall and average accuracy, suggesting that positional encoding aids in retaining relative positional information during feature fusion. Furthermore, DPCSA, building on MHSA, added extra convolutional branches to extract regional features, enabling the model to focus on both global and regional information. This led to further improvements in precision, recall, and average accuracy by 1.5%, 0.7%, and 1.4%, respectively. This indicates that emphasizing the correlation between spatially proximate features through convolution in global feature extraction helps the model to focus on important features, thereby enhancing detection accuracy. Compared with CRM, DPCSA has advantages in detection accuracy. Although CRM also focuses on the impact of global background feature information on local key features, the module reduces the proportion of global background features through sparse self-attention, which may be detrimental to fully extracting background information for small vessel detection. It is worth noting that the inference speed of DPCSA module is comparable to CRM, with an increase of only 0.1ms in inference time, indicating that there is not much difference in computational complexity between these two modules.

4.5.3. Ablation Studies of HHSPP

(1) Comparison of different spatial pyramid pooling methods: We compared the high-dimensional hybrid spatial pyramid pooling (HHSPP) with the baseline SPP, SPPF (used in YOLOv8), and HSPP [31] for multiscale feature integration. The test results are shown in Table 6. SPP utilizes a set of parallel pooling layers to extract multiscale features, with a maximum receptive field of 13 × 13. SPPF reduces computational load by replacing SPP’s parallel structure with three serial 5 × 5 pooling layers, which, however, narrows the receptive field and slightly decreases the model’s detection recall by 0.2%. HSPP expands the receptive field by adding a serial average pooling layer to SPP, thereby enhancing recall and AP by 0.4% and 0.5%, respectively. Unlike HSPP, HHSPP employs both serial and parallel pooling structures, achieving a maximum receptive field of 15 × 15, which nearly matches the feature map size output by the backbone network (16 × 16). To address the issue of average pooling blurring the relative positions of important targets, HHSPP uses max pooling to expand the receptive field. Experimental results show that HHSPP achieved precision, recall, and AP of 60.9%, 82.5%, and 76.3%, respectively, which are improvements of 2.2%, 1.6%, and 3.2% over the baseline (SPP). These results surpass those of comparable pooling methods, including SPPF and HSPP, without significantly impacting the model’s inference speed. The findings suggest that the receptive field range significantly influences the model’s ability to fit small targets and background features, and expanding the receptive field to integrate features over a broader area can improve the model’s detection accuracy for RS ship targets.

(2) Comparison of different structures of HHSPP: A comparative analysis of HHSPP structures was conducted, and the results are shown in Table 7. Introducing larger-scale pooling layers (5 × 5, 9 × 9, 13 × 13) significantly improved precision, recall, and average accuracy by 4.3%, 0.3%, and 2.0% over the baseline, respectively. However, adding residual connections to fuse original and pooled features (with equal channel distribution) reduced performance, with accuracy, recall, and AP dropping to 58.1%, 81.5%, and 73.9%, even below baseline levels. These findings confirm the benefits of larger-scale pooling for small ship detection. The optimal pooling structure was selected based on these results.

4.5.4. Ablation Studies of Focal CIoU

(1) Comparison of different IoU-based regression loss: The impact of different loss functions on DPCSANet’s training performance was compared, with the results summarized in Table 8. When CIoU was used as the regression loss (same as the baseline), DPCSANet achieved precision, recall, and AP of 65.5%, 86.1%, and 83.5%, respectively, outperforming DIoU and EIoU. This suggests that the aspect ratio of the predicted box is a more suitable metric for small ship detection. Using EIoU, which replaces aspect ratio error with absolute errors in width and length, led to a decline in all three metrics, with recall and AP dropping below DIoU. This indicates that focusing on absolute error exacerbates the overfitting of accurate predicted boxes, harming overall detection performance. The Alpha CIoU, which adds an exponential weighting factor to CIoU, resulted in a precision of 68.7%, but recall and AP were both lower (82.2% and 79.3%), showing that prioritizing high-accuracy boxes does not aid in detecting challenging targets. Training with Focal CIoU improved accuracy, recall, and AP by 0.6%, 2.1%, and 1.5%, respectively, compared to CIoU. This confirms that the focal penalty, which increases the weight of poorly predicted boxes, enhances model performance by focusing on difficult targets during training.

In summary, the Focal CIoU loss function yielded the best results for DPCSANet, improving overall detection accuracy, particularly for challenging targets. Therefore, the final model selected was trained using this loss function.

(2) Comparison of different values of

γ

: A comparative analysis was conducted to determine the optimal value

γ

for small ship detection, with the results presented in Table 9. When the parameter increased from one to four, the recall rate dropped to 85.0%, suggesting that focusing too much on high-accuracy boxes can cause the model to overlook potential targets, reducing the parameter from 1 to 0.5 improved precision, recall, and accuracy to 66.1%, 88.2%, and 85.0%, respectively, compared to the baseline. However, further decreasing the parameter to 0.25 led to a decline in both precision and recall, likely because the model started overemphasizing low-accuracy boxes, which were classified as negative samples due to the confidence threshold. The results indicate that the optimal detection accuracy for DPCSANet occurs when the parameter is 0.5, which was ultimately selected as the weighted exponent for the regression penalty term.

4.6. Additional Experiments Under Diverse Environmental Conditions

To comprehensively evaluate the robustness of our model, we conducted experiments under three typical environmental scenarios: clear scene, fractus or mist conditions, and thick fog or thick clouds, as shown in Figure 13. These scenarios were all derived from the LEVIR-ship dataset. Specifically, the clear scene scenario includes 135 images, the fractus or mist conditions include 480 images, and the thick fog or thick clouds scenario includes 253 images. The experiments compared our model with a baseline model and several state-of-the-art models using the same evaluation metrics as in Section 4.3. The experimental results in Table 10 demonstrate the robustness of our proposed DPCSANet model across different environmental conditions.

In clear scene conditions, DPCSANet achieved an AP of 84.4%, significantly outperforming all other models, including YOLOv5s (77.4%), YOLOv7 (79.0%), and YOLOv9 (67.1%). This indicates that our model maintains superior performance even under ideal conditions.
Under fractus or mist conditions, DPCSANet maintained an AP of 83.2%, outperforming YOLOv5s (69.8%), YOLOv7 (65.9%), and YOLOv9 (73.3%). This highlights the model’s ability to handle light fog or mist interference effectively.
In the most challenging scenario of thick fog or thick clouds, DPCSANet achieved an AP of 68.7%, which is higher than YOLOv5s (62.8%), YOLOv7 (68.2%), and YOLOv9 (64.4%). This result underscores the model’s robustness in low-visibility conditions.

In general, the consistent performance improvements of DPCSANet in clear, misty, and foggy conditions validate its superior robustness compared to state-of-the-art models.

4.7. Error Analysis

To provide a comprehensive understanding of the model’s limitations and areas for improvement, we conducted an in-depth error analysis, particularly focusing on cases where DPCSANet produces false positives or fails to detect small ships.

False positive: False positives primarily occur in scenarios with significant background interference, such as cloud cover and fog. As shown in Figure 14, DPCSANet occasionally misclassifies fragmented clouds or ground structures covered by clouds as ships. This is particularly evident when these structures resemble the elongated shapes of ships. Additionally, narrow piers and other linear features in the image can also be mistakenly identified as ship targets due to their similar geometric properties. These errors arise because the model may learn to associate certain shapes and textures with ships, leading to misclassifications when similar patterns appear in the background.
False negative: False negatives mainly occur when the ship target is very small, as shown in Figure 14. When the target is smaller than 16 × 16 pixels or even smaller, its limited feature representation and weak texture information make it difficult for the model to detect. In fact, in many cases, even human observers find it challenging to directly detect these small targets. The presence of stern currents may be the only visual clue to determine their location. This highlights the challenge of detecting small objects in complex backgrounds, as models may struggle to distinguish targets from surrounding noise.

In summary, the detection of small ships is susceptible to various interference factors, with the most significant being the small size of the targets and the complex background environment. These factors highlight areas where our model can still be improved to enhance detection accuracy and robustness.

5. Conclusions

To enhance the detection accuracy of small ships in optical RSIS, this study proposes a novel detection model, DPCSANet, based on the YOLOv5s architecture. The model incorporates a dual-path convolutional self-attention (DPCSA) module and a high-dimensional hybrid spatial pyramid pooling (HHSPP) module, and is trained with a Focal CIoU loss function. These enhancements significantly improve the detection performance of small ships. Experiments conducted on the LEVIR-ship and OSSD ship detection datasets demonstrate that DPCSANet achieves substantial improvements in AP compared to the baseline model. Moreover, DPCSANet outperforms other state-of-the-art models in terms of AP, thereby validating the effectiveness of the proposed approach. Additional experiments on the DOTA-ship dataset show that DPCSANet’s average accuracy is comparable to or superior to other state-of-the-art models, further confirming its generalization ability across different resolutions. Furthermore, experiments conducted under various weather conditions, including clear scenes, fractal or mist conditions, and thick fog or clouds, highlight the robustness and generalization capability of DPCSANet in diverse environmental scenarios.

Specifically, the DPCSA module is integrated throughout the feature fusion process of the model. It employs two parallel branches to separately process convolutional and self-attention features, effectively mitigating the conflict between local and global information and enhancing the feature representation of target objects. Additionally, the HHSPP module is implemented at the top of the feature extraction backbone to extract features from a larger receptive field, leveraging the supplementary role of background feature information in small ship detection. Experimental results indicate that expanding the receptive field improves the model’s ability to extract background features, ultimately leading to higher detection accuracy. Moreover, training the model with the Focal CIoU regression loss guides the model to focus on weakly predicted targets, thereby effectively improving the detection performance of challenging targets.

Despite these advancements, DPCSANet still encounters some missed or false detections in scenarios with occluded ultra-small ship targets. Future research will focus on exploring super-resolution representations of images and data augmentation techniques specifically designed for small targets to address these limitations.

The code will be available at https://github.com/chenjiajiechen/DPCSANet-2025 (accessed on 17 March 2025).

Author Contributions

Conceptualization, J.C.; methodology, J.C.; software, J.C.; validation, J.C.; formal analysis, J.C.; investigation, J.C.; resources, J.C.; data curation, J.C.; writing—original draft preparation, J.C.; writing—review and editing, J.C. and X.T.; visualization, J.C.; supervision, C.D.; project administration, C.D.; funding acquisition, C.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the conclusions of this article are available from the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Zhao, T.; Wang, Y.; Li, Z.; Gao, Y.; Chen, C.; Feng, H.; Zhao, Z. Ship Detection with Deep Learning in Optical Remote-Sensing Images: A Survey of Challenges and Advances. Remote Sens. 2024, 16, 1145. [Google Scholar] [CrossRef]
Li, B.; Xie, X.; Wei, X.; Tang, W. Ship detection and classification from optical remote sensing images: A survey. Chin. J. Aeronaut. 2021, 34, 145–163. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, X.; Gao, G.; Lang, H.; Liu, G.; Cao, C.; Song, Y.; Guan, Y.; Dai, Y. Development and application of ship detection and classification datasets: A review. IEEE Geosci. Remote Sens. Mag. 2024, 12, 12–45. [Google Scholar] [CrossRef]
Zhang, L.; Yin, H. Research on ship detection method of optical remote sensing image based on deep learning. In Proceedings of the 2022 International Conference on Sensing, Measurement & Data Analytics in the Era of Artificial Intelligence (ICSMD), Harbin, China, 30 November–2 December 2022; pp. 1–6. [Google Scholar]
Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2018, 145, 3–22. [Google Scholar] [CrossRef]
Li, Q.; Mou, L.; Liu, Q.; Wang, Y.; Zhu, X.X. HSF-Net: Multiscale deep feature embedding for ship detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 7147–7161. [Google Scholar] [CrossRef]
Li, X.; Deng, J.; Fang, Y. Few-shot object detection on remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Ren, Z.; Tang, Y.; He, Z.; Tian, L.; Yang, Y.; Zhang, W. Ship detection in high-resolution optical remote sensing images aided by saliency information. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Zhang, C.; Lam, K.M.; Wang, Q. Cof-net: A progressive coarse-to-fine framework for object detection in remote-sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
Liang, Y.; Feng, J.; Zhang, X.; Zhang, J.; Jiao, L. MidNet: An anchor-and-angle-free detector for oriented ship detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Ding, K.; Yang, J.; Lin, H.; Wang, Z.; Wang, D.; Wang, X.; Ni, K.; Zhou, Q. Towards real-time detection of ships and wakes with lightweight deep learning model in Gaofen-3 SAR images. Remote Sens. Environ. 2023, 284, 113345. [Google Scholar] [CrossRef]
Hu, Q.; Hu, S.; Liu, S. BANet: A balance attention network for anchor-free ship detection in SAR images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Yang, X.; Zhang, X.; Wang, N.; Gao, X. A robust one-stage detector for multiscale ship detection with complex background in massive SAR images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–12. [Google Scholar] [CrossRef]
Tian, L.; Cao, Y.; He, B.; Zhang, Y.; He, C.; Li, D. Image enhancement driven by object characteristics and dense feature reuse network for ship target detection in remote sensing imagery. Remote Sens. 2021, 13, 1327. [Google Scholar] [CrossRef]
Yang, S.; An, W.; Li, S.; Wei, G.; Zou, B. An improved FCOS method for ship detection in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8910–8927. [Google Scholar] [CrossRef]
Wu, Y.; Ma, W.; Gong, M.; Bai, Z.; Zhao, W.; Guo, Q.; Chen, X.; Miao, Q. A coarse-to-fine network for ship detection in optical remote sensing images. Remote Sens. 2020, 12, 246. [Google Scholar] [CrossRef]
Chen, J.; Chen, K.; Chen, H.; Zou, Z.; Shi, Z. A degraded reconstruction enhancement-based method for tiny ship detection in remote sensing images with a new large-scale dataset. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16519–16529. [Google Scholar]
Pan, P.; Wang, H.; Wang, C.; Nie, C. ABC: Attention with bilinear correlation for infrared small target detection. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2381–2386. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
He, J.; Erfani, S.; Ma, X.; Bailey, J.; Chi, Y.; Hua, X.S. alpha-IoU: A family of power intersection over union losses for bounding box regression. Adv. Neural Inf. Process. Syst. 2021, 34, 20230–20242. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Wong, C.; Yifu, Z.; Montes, D.; et al. ultralytics/yolov5: V6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci. ai integrations. Zenodo 2022. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016; Proceedings, Part I 14; Springer: Berlin, Germany, 2016; pp. 21–37. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Wang, M.; Yang, W.; Wang, L.; Chen, D.; Wei, F.; KeZiErBieKe, H.; Liao, Y. FE-YOLOv5: Feature enhancement network based on YOLOv5 for small object detection. J. Vis. Commun. Image Represent. 2023, 90, 103752. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, W.; Yu, H.; Zhou, S.; Qi, W.; Guo, Y.; Li, C. Improved YOLOv5s for small ship detection with optical remote sensing images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–15. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Yin, Y.; Cheng, X.; Shi, F.; Liu, X.; Huo, H.; Chen, S. High-order spatial interactions enhanced lightweight model for optical remote sensing image-based small ship detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, L.; Xiong, B.; Kuang, G. Attention receptive pyramid network for ship detection in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 2738–2756. [Google Scholar] [CrossRef]
Hayat, M. Squeeze & Excitation joint with Combined Channel and Spatial Attention for Pathology Image Super-Resolution. Frankl. Open 2024, 8, 100170. [Google Scholar]
Cui, Z.; Wang, X.; Liu, N.; Cao, Z.; Yang, J. Ship detection in large-scale SAR images via spatial shuffle-group enhance attention. IEEE Trans. Geosci. Remote Sens. 2020, 59, 379–391. [Google Scholar] [CrossRef]
Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16794–16805. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
Xue, Y.; Jin, G.; Shen, T.; Tan, L.; Wang, N.; Gao, J.; Wang, L. Consistent representation mining for multi-drone single object tracking. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 10845–10859. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Ning, T.; Wu, W.; Zhang, J. Small object detection based on YOLOv8 in UAV perspective. Pattern Anal. Appl. 2024, 27, 103. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]

Figure 1. Optical RS ship detection examples at 16 m resolution. The yellow box denotes the annotated bounding box, while the red box delineates the actual ship boundary. As can be observed, due to the limited image resolution, ship targets occupy only a few pixels. Additionally, there are discrepancies between the manually annotated bounding boxes and the actual ship boundaries.

Figure 2. The overall structure of DPCSANet. Backbone: CSPDarknet53 is used as the backbone network to efficiently extract features and use the high-dimensional hybrid multiscale pooling module to expand the feature fusion range. Neck: A simplified bidirectional feature fusion network (PAN) is used, and DPBlock containing a dual-path convolutional self-attention mechanism is introduced to enhance the model’s ability to extract contextual information of the target and surrounding features. Considering the computational complexity and prediction recall rate of the model, three-layer detection branches

C_{3}

,

C_{4}

, and

C_{5}

are used to predict ships of different sizes. Head: Focal CIoU regression loss function is used to improve the detection recall rate.

Figure 2. The overall structure of DPCSANet. Backbone: CSPDarknet53 is used as the backbone network to efficiently extract features and use the high-dimensional hybrid multiscale pooling module to expand the feature fusion range. Neck: A simplified bidirectional feature fusion network (PAN) is used, and DPBlock containing a dual-path convolutional self-attention mechanism is introduced to enhance the model’s ability to extract contextual information of the target and surrounding features. Considering the computational complexity and prediction recall rate of the model, three-layer detection branches

C_{3}

,

C_{4}

, and

C_{5}

are used to predict ships of different sizes. Head: Focal CIoU regression loss function is used to improve the detection recall rate.

Figure 3. Illustration of the perceptual capabilities and feature extraction paths of CNNs, self-attention mechanisms, self-attention with convolution, and the proposed dual-path convolutional self-attention (DPCSA). Different colored arrows indicate independent feature extraction paths for each mechanism. CNNs capture local features through limited receptive fields, focusing primarily on local textures. Self-attention mechanisms capture global dependencies, allowing each position to interact with all other positions in the input. Self-attention with convolution combines both local and global feature extraction but often concatenates these features in the same channel, potentially leading to interference. In contrast, DPCSA parallelizes convolution and self-attention mechanisms, processing them independently to avoid interference. This design maximizes the extraction of both local and global features within a single feature map, enhancing the model’s ability to distinguish between target and background features.

Figure 4. The overall structure of DPBlock and the principle of DPCSA. DPCSA consists of two branches, using a multi-head self-attention mechanism [18] and two-dimensional convolution to extract global and local features, respectively. Among them, the local features are weighted using the learnable parameter

η

and then fused with the global features extracted by the self-attention mechanism.

Figure 4. The overall structure of DPBlock and the principle of DPCSA. DPCSA consists of two branches, using a multi-head self-attention mechanism [18] and two-dimensional convolution to extract global and local features, respectively. Among them, the local features are weighted using the learnable parameter

η

and then fused with the global features extracted by the self-attention mechanism.

Figure 5. The structure of HHSPP.

Figure 6. Samples of (a) LEVIR-ship and (b) OSSD datasets.

Figure 7. Schematic illustration of the process for extracting and cropping ship samples from the DOTA dataset. The original images are segmented into 640 × 640 patches with a 200-pixel overlap, and only those containing ship targets are retained and re-labeled to form the DOTA-ship dataset.

Figure 8. Distribution of ship label length and width in different datasets. Different colors represent varying probability values, with lighter colors indicating higher probabilities of ship label occurrence. (a) LEVIR-ship, (b) OSSD, and (c) DOTA-ship.

Figure 9. Some detection results of the proposed DPCSANet on LEVIR-ship. The red boxes represent the ground truth annotations, while the blue boxes indicate the predicted bounding boxes.

Figure 10. Some detection results of the proposed DPCSANet on OSSD. The red boxes represent the ground truth annotations, while the blue boxes indicate the predicted bounding boxes.

Figure 11. Comparison of the detection results on DOTA-ship for different methods. (a) Ground truth; (b) YOLOv5s (baseline); (c) YOLOv7; (d) YOLOv9; (e) cascade RCNN; (f) RetinaNet; (g) DINO; (h) DPCSANet (ours).

Figure 12. Influence of DPCSA, HHSPP, and Focal CIoU on feature extraction. The brighter color represents that the model pays more attention to that area.

Figure 13. Samples of three different scenes: clear scene, fractus or mist conditions, and thick fog or thick cloud conditions.

Figure 14. Examples of false positives and false negatives in DPCSANet testing. The red mark indicates false positives caused by cloud interference and ground structure classification errors. Yellow indicates missed detection, i.e., false negative. Some small ship targets were not detected due to their limited size and weak feature representation.

Table 1. Comparison with state-of-the-art methods on the LEVIR-ship dataset. The optimal result is represented in bold font.

Model	Backbone	Params/M	FLOPs/G	FPS	P/%	R/%	AP/%
YOLOv5s [25]	-	7.05	10.4	208	58.7	81.9	73.6
YOLOv7tiny [26]	-	6.01	13	256	71.8	65.9	65.2
YOLOv7l	-	36.48	103.2	127	77.4	71.6	69.8
YOLOv7x	-	70.78	188	78	76.9	63.2	70.4
YOLOv8n		3.01	8.1	222	73.3	63.6	67.4
YOLOv8s	-	11.13	28.4	185	79.2	71.9	75.1
YOLOv8m	-	25.84	78.7	96	78	72.6	72.0
YOLOv8l	-	43.61	164.8	54	77.2	71.4	74.0
YOLOv9c [27]	-	50.70	236.6	33	76.7	71.7	73.1
YOLOv9e	-	68.55	240.7	26	78.8	75.9	78.5
Faster RCNN [23]	VGG16	136.7	299.2	34	-	-	70.8
CenterNet [48]	Hourglass-104	191.2	584.6	35	-	-	77.7
RT-DETR [50]	ResNet50	42	136	29	-	-	70.9
	ResNet101	76	259	22	-	-	73.9
Cascade RCNN [49]	ResNet50	69.15	162	22	-	-	81.2
	ResNet101	88.14	209	18	-	-	81.3
DINO [51]	ResNet50	47.54	179	16	-	-	83.8
DRENet [17]	-	4.79	8.4	169	52.7	86.0	81.7
ImYOLOv5 [31]	-	8.64	16.7	139	42.1	86.5	78.9
FFCA-YOLO [43]	-	71.22	51.2	113	75.9	63.8	62.8
Imyolov8 [52]	-	5.31	55.0	135	75.2	68.2	67.6
DPCSANet (Ours)	-	6.20	9.6	149	66.1	88.2	85.0

Table 2. Comparison with the state-of-the-art methods on the OSSD dataset. The optimal result is represented in bold font.

Model	Backbone	P/%	R/%	AP/%
YOLOv5s [25]	-	82.3	85.9	87.0
YOLOX [54]	-	96.3	91.9	91.5
YOLOv7 [26]	-	84.1	85.6	89.7
YOLOv7x	-	87.1	85.4	91.0
EfficientDet [53]	ResNet50	-	-	79.4
Faster RCNN [23]	ResNet50	-	-	85.7
DRENet [17]	-	95.2	93.9	93.9
ImYOLOv5 [31]	-	93.4	91.9	91.5
FFCA-YOLO [43]	-	91.2	83.8	91.4
Imyolov8 [52]	-	97.9	93.5	95.3
DPCSANet (Ours)	-	93.6	95.9	95.9

Table 3. Comparison with the state-of-the-art methods on the DOTA-ship dataset. The optimal result is represented in bold font.

Model	Backbone	P/%	R/%	AP/%
YOLOv5s [25]	-	93.2	91.6	94.9
YOLOv7 [26]	-	92.4	92.8	94.0
YOLOv9 [27]	-	91.1	90.9	95.7
SSD [28]	VGG16	-	-	92.5
Faster RCNN [23]	ResNet50	-	-	82.9
Cascade RCNN [49]	ResNet50	-	-	91.5
RetinaNet [55]	ResNet50	-	-	72.9
DINO [51]	ResNet50	-	-	95.4
FFCA-YOLO [43]	-	90.7	90.4	94.6
Imyolov8 [52]	-	91.7	92.7	95.4
DPCSANet (Ours)	-	92.6	93.0	95.8

Table 4. Ablation experiments for each proposed module in DPCSANet on the LEVIR-Ship test set. The optimal result is represented in bold font.

Baseline	DPCSA	HHSPP	Focal CIoU	Params/M	FLOPs/G	P/%	R/%	AP/%	Time/ms
√				7.05	10.4	58.7	81.9	73.6	4.8
√	√			6.07	9.5	65.3	85.5	83.0	6.6
√		√		7.18	10.5	60.9	82.5	76.3	4.9
√			√	7.05	10.4	59.1	84.4	77.7	4.8
√	√	√		6.20	9.6	65.5	86.1	83.5	6.7
√	√	√	√	6.20	9.6	66.1	88.2	85.0	6.7

Table 5. Ablation studies of different self-attention methods. The optimal result is represented in bold font.

Method	P/%	R/%	AP/%	Time/ms
Baseline	58.7	81.9	73.6	4.8
+ CA [34]	57.9	83.7	77.8	5.1
+ SA [56]	60.9	84.4	78.7	5.6
+ MHSA [18]	63.8	84.8	81.6	6.0
+ CRM [45]	63.2	84.9	82.5	6.5
+ DPCSA	65.3	85.5	83.0	6.6

Table 6. Comparison results of different spatial pyramid pooling methods. The optimal result is represented in bold font.

Method	P/%	R/%	AP/%
SPP (baseline)	58.7	81.9	73.6
SPPF	60.2	81.7	74.3
HSPP [31]	56.5	82.3	74.1
HHSPP	60.9	82.5	76.3

Table 7. Ablation studies of different HHSPP structures. “With Residual”: Half of the channels of the input features go through the residual branch and merge with the pooling features. The optimal result is represented in bold font.

Construction		Params/M	FLOPs/G	P/%	R/%	AP/%	Time/ms
without residual	max 5 + 5	6.92	10.4	56.6	82.1	74.3	4.6
	max 5 + (5, 9)	7.05	10.4	58.7	82.5	74.5	4.7
	max 5 + (5, 9, 13)	7.18	10.5	60.9	82.5	76.3	4.8
with residual	max 5 + (5, 9, 13)	7.05	10.4	58.1	81.5	73.9	4.6

Table 8. Comparison results of different regression IoU loss on DPCSANet. The optimal result is represented in bold font.

Methods	P/%	R/%	AP/%
CIoU [20] (baseline)	65.2	87.1	83.7
DIoU [20]	59.2	84.8	81.1
EIoU [21]	61.6	83.0	79.6
$α CIoU$ [22] $(α = 3)$	68.7	82.2	79.3
Focal CIoU	66.1	88.2	85.0

Table 9. Ablation studies of Focal CIoU with different

γ

values. The optimal result is represented in bold font.

Table 9. Ablation studies of Focal CIoU with different

γ

values. The optimal result is represented in bold font.

$γ$	4	2	1	0.5	0.25
P/%	67.3	65.0	67.0	66.1	63.6
R/%	85.0	85.9	86.6	88.2	85.1
AP/%	82.7	82.3	83.6	85.0	81.4

Table 10. Comparison of AP for various models in different scenes. The optimal result is represented in bold font.

Scenes	YOLOv5s [25]	yolov7 [26]	yolov9 [27]	FFCA-YOLO [43]	Imyolov8 [52]	DPCSANet
Clear	77.4	79.0	67.1	64.1	64.9	84.4
Fractus or Mist	69.8	65.9	73.3	63.9	67.1	83.2
Thick Fog or Thick Clouds	62.8	68.2	64.4	53.6	58.3	68.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Tian, X.; Du, C. DPCSANet: Dual-Path Convolutional Self-Attention for Small Ship Detection in Optical Remote Sensing Images. Electronics 2025, 14, 1225. https://doi.org/10.3390/electronics14061225

AMA Style

Chen J, Tian X, Du C. DPCSANet: Dual-Path Convolutional Self-Attention for Small Ship Detection in Optical Remote Sensing Images. Electronics. 2025; 14(6):1225. https://doi.org/10.3390/electronics14061225

Chicago/Turabian Style

Chen, Jiajie, Xin Tian, and Chong Du. 2025. "DPCSANet: Dual-Path Convolutional Self-Attention for Small Ship Detection in Optical Remote Sensing Images" Electronics 14, no. 6: 1225. https://doi.org/10.3390/electronics14061225

APA Style

Chen, J., Tian, X., & Du, C. (2025). DPCSANet: Dual-Path Convolutional Self-Attention for Small Ship Detection in Optical Remote Sensing Images. Electronics, 14(6), 1225. https://doi.org/10.3390/electronics14061225

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DPCSANet: Dual-Path Convolutional Self-Attention for Small Ship Detection in Optical Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based Model in RS Object Detection

2.2. Remote Sensing Small Ship Detection Methods

2.2.1. Feature Enhancement

2.2.2. Regression Loss Function Design

3. Methodology

3.1. Overall Network Architecture

3.2. DPCSA

3.3. HHSPP

3.4. Focal CIoU

4. Experiments and Analysis

4.1. Datasets

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. State-of-the-Art Comparison

4.4.1. Results on the LEVIR-Ship Dataset

4.4.2. Results on the OSSD Dataset

4.4.3. Results on the DOTA-Ship Dataset

4.5. Ablation Studies

4.5.1. Effectiveness of Each Proposed Module

4.5.2. Ablation Studies of DPCSA

4.5.3. Ablation Studies of HHSPP

4.5.4. Ablation Studies of Focal CIoU

4.6. Additional Experiments Under Diverse Environmental Conditions

4.7. Error Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI