3.2. Proposed C2f_DCNv3
The C2f_DCNv3 structure proposed in this paper is shown in
Figure 3. C2f_DCNv3 first processes the input image using a 1 × 1 convolution (Conv1), doubling the number of channels in the output feature map to enhance the model’s feature representation capability. The input feature map is then split into two parts by the Split module. One part enters the Bottleneck module, while the other part is directly involved in subsequent concatenation. After splitting, the feature map is processed layer by layer through multiple Bottleneck modules, where deformable convolution (DCNv3) is used to extract deeper features. The Bottleneck uses shortcut connections to enhance gradient propagation and information flow. Finally, a convolution (Conv2) is used to compress the number of channels in the concatenated feature map to the desired output channels. One key issue in small target ship detection in UAV aerial images of the ocean is how to adapt to geometric variations in object scale, posture, and deformation, and how to distinguish ship targets from the marine background, such as waves.
Figure 4a illustrates the rigid feature mapping of standard convolution. The blue vertical lines represent the fixed receptive fields of conventional convolutional kernels. The upper-layer red dots indicate the predefined anchor points of the standard convolution kernels, while the lower-layer red dots represent the expanded receptive field regions. The geometric centers of the receptive fields in standard convolution strictly align with the grid coordinates.
Figure 4b shows a regular square grid (referred to as the base grid), where the uniformly distributed black dots denote the fixed sampling locations of standard convolution (e.g., a 3 × 3 grid). The red dots represent the deformed sampling positions obtained by applying learnable offsets to the original black dots. For instance, if the coordinate of a black dot is (0, 0) and the offset is Δp = (0.2, −0.3), then the corresponding sampling location for deformable convolution becomes (0.2, −0.3). These offsets Δp are generated by the network through learning.
Figure 4c illustrates the dynamic adaptability of deformable convolution. The upper-layer red dots represent the predefined anchor points of the deformable convolution kernel, while the lower-layer red dots reflect the receptive field locations that adaptively shift according to the target.
To more intuitively demonstrate the effectiveness of deformable convolution in enhancing the feature extraction of small ship targets, we visualized the feature maps of the 2nd, 4th, and 6th layers in the backbone of YOLO-ssboat during dense small ship detection, corresponding to layers C2, C3, and C4 in
Figure 2.
Figure 5a shows the feature extraction heatmaps using standard convolution, while
Figure 5b presents the results obtained with C2f_DCNv3. In the shallow layers with higher feature resolution (i.e., the 2nd and 4th layers), the standard convolution heatmaps reveal large blue areas in the background, such as ocean waves. In contrast, the use of C2f_DCNv3 significantly reduces the blue background regions and enhances the contrast between the targets and the background. In the deeper layers with lower resolution (e.g., the 6th layer), standard convolution tends to yield dispersed feature responses, whereas deformable convolution maintains concentrated yellow response regions due to its adaptive receptive field, indicating stronger capability in capturing long-range features of deformable targets. The more concentrated distribution of yellow areas in the deformable convolution heatmaps reflects improved target localization accuracy. Notably, in the complex scenes represented by the 6th layer, the target edges and local features still exhibit high-contrast responses. This characteristic enables the network to maintain robust feature extraction performance even when ship targets undergo deformation or occlusion.
3.3. Proposed MSWPN
Small target ships in the ocean occupy very few pixels, and as they pass through multiple convolution layers, their features may be progressively lost. As illustrated in the heatmap in
Figure 6a, the resolution decreases from 160 at layer C2 to 20 at layer C5. As the number of convolution layers increases, the features corresponding to small target ships are gradually diminished. In deeper layers, the activation of small target ships becomes less prominent, while background noise, such as ocean waves, is more likely to be activated. From the heatmap, it is evident that the C2 layer exhibits stronger activation for small target ships compared to other layers. To enhance the detection of small target ships, the C2 detection layer is integrated into the feature fusion network. The fusion of feature maps at different scales is crucial for small-target ship detection. The proposed Multi-Scale Weighted Fusion Network (MSWPN), as depicted in
Figure 6, addresses this by incorporating the C2 feature layer and simultaneously connecting inputs and outputs at the same scale. For instance, P4(1), P4(2), and P4(3) are feature maps with a resolution of 40, and these are fused together to aggregate more features without significantly increasing computational costs. In the typical approach of fusing feature maps from different sources, these are usually resized to a uniform resolution before being added together. However, the contribution of each feature map to the final output is not necessarily uniform. As shown in the heatmap in
Figure 6a, lower-level feature maps clearly provide more significant contributions for small target detection. Hence, the conventional approach may not be optimal. To address this limitation, we assign a weight to each feature map, and the output is given by:
In the equation,
represents the feature map, and
denotes the weight assigned to each feature map, with
= 0.0001 used to maintain numerical stability. This formulation ensures that each normalized weight lies between 0 and 1. For instance, in
Figure 6, node P4(2) has two inputs, P5(1) and P4(1), to which weights
and
are assigned, respectively. The features of this node can be expressed by the following equation:
The heatmap visualization of features fused at different scales using MSWPN is shown in
Figure 6b. After the multi-scale weighted feature fusion, the activation of the small targets is significantly enhanced, while the activation in the background, particularly the ocean waves, is notably reduced.
In MSWPN, the weights are learnable parameters that are continuously adjusted during the training process, and their values may vary across different datasets. These weights are initialized with small random values and constrained to be non-negative through the ReLU activation function, ensuring ≥ 0 to prevent negative weights from disrupting the rationality of feature fusion. During training, the weights are optimized via gradient descent to adaptively balance the contributions of features at different resolutions. A constant is fixed at 0.0001 to avoid division by zero when all weights approach zero and to stabilize gradient computation, thereby preventing numerical explosion during backpropagation. To ensure reproducibility and consistency with the original implementation, it is essential to pay attention to weight initialization using small random values (e.g., from ), enforce non-negativity, and strictly use = 0.0001 in the denominator without omission or arbitrary modification.
3.4. Proposed Dyhead_DCNv3
Traditional detection heads, when processing objects of varying sizes, are typically designed for specific target dimensions at each scale. Predictions for each position are generated independently. Additionally, the original YOLOv8 detection head lacks dynamic learning capabilities, thereby imposing significant limitations on the detection of multi-scale objects, particularly small targets. The variation in object scale is intricately tied to features at different hierarchical levels, and enhancing features at these levels, denoted as , significantly improves scale perception, particularly for small vessel detection. The geometric transformations of vessels, with varying shapes, are intricately associated with spatial positional features across different levels. Enhancing the representational learning of at diverse spatial locations contributes significantly to improved spatial awareness in object detection. Moreover, distinct object representations and associated tasks are frequently linked to features across various channels. Enhancing the representational learning of across different channels significantly benefits task-specific object detection.
The improved Dyhead_v3 proposed in this paper is shown in
Figure 7. It unifies scale-aware attention
, spatial-aware attention
, and task-aware attention
, thereby better integrating contextual information. By utilizing DCNv3, spatial-aware attention enhances the model’s ability to perceive the position of small target vessels.
where
represents the attention function, and L, S, C correspond to the scale, spatial, and task dimensions, respectively. In Dyhead_v3, the attention function is decomposed into three consecutive attention mechanisms, each focusing on a single dimension.
In the equation, , , are three distinct attention functions, each operating on the L, S, C dimensions, respectively.
First, scale-aware attention is introduced to dynamically fuse features based on the semantic importance of different scales.
In the equation, f(⋅) is a linear function approximated by a 1 × 1 convolutional layer, and σ (x) = max (0, min (1, (x + 1)/2)) is a sigmoid function.
- 2.
Spatial Attention
The spatial module is decomposed into two steps. First, deformable convolution DCNv3 is employed to induce sparsity in the attention-learning process. Then, features are aggregated across layers at the same spatial location.
Here, k denotes the number of sparse sampling positions, represents the position shifted by the self-learned spatial offset to focus on a discriminative region, and is the self-learned importance scalar for position . Both are learned from the median-level input features of .
- 3.
Task Attention
To achieve joint learning and generalize diverse object representations, task-aware attention is deployed in the final stage. It dynamically toggles feature channels ON and OFF to support different tasks:
The channels of the feature layer, , serve as hyperfunctions for learning and controlling activation thresholds. First, global average pooling is performed over the L × S dimensions to reduce dimensionality. This is followed by two fully connected layers and a normalization layer. Finally, a shifted sigmoid function is applied to normalize the output to the range [−1, 1].