4.2. Specific Setup and Evaluation Metrics
All experiments detailed in this paper were performed on a system running Ubuntu 20.04, equipped with an NVIDIA GeForce RTX 4090 GPU. To obtain a fairer comparison, for the SSDD+ and RSDD-SAR datasets, we used the image allocation methods mentioned above in [
4] and [
5], respectively, for model training and comparison. The input images were uniformly resized to dimensions of 512 × 512 pixels. For the HRSID dataset, we used the image allocation method mentioned in [
6] for model training and comparison. The input images were uniformly scaled to a size of 800 × 800 pixels. During the training process, only the RRandomFlip online enhancement strategy was executed. RRandomFlip includes horizontal flipping, vertical flipping, and diagonal flipping. These three enhancement methods can effectively improve the orientation of the data. The optimization algorithm utilized was AdamW, with an initial learning rate of 0.0002, a momentum of 0.9, and a weight decay of 0.005. In the initial stage of training, a learning rate warm-up strategy was adopted, with the warm-up starting learning rate set to one-third of the initial learning rate, and 300 batch iterations were performed. The learning rate schedule adopted linear decay. Considering hardware limitations and model performance, the batch size was set to 8. A total of 200 epochs were iterated.
To assess the performance of various methods, we introduced the following metrics for comparison: Precision (P), Recall (R), and Mean Average Precision (mAP). Recall directly reflects the ability of the model to detect positive samples. Precision reflects the classification ability of the model. High recall and precision alone do not reflect the overall performance of the model, so mAP achieves the ability to comprehensively reflect the model’s performance by combining these two metrics. These metrics are defined as
TP refers to the quantity of true positives, FP signifies the quantity of false positives, and FN represents the quantity of false negatives. The average precision (AP) is calculated as the area under the Precision–Recall (PR) curve, which is shown below:
We used the AP to comprehensively evaluate the model’s performance at various Intersection over Union (IoU) thresholds, with the default threshold set to 0.5. In our detection task, the mean average precision (mAP) is the same as the AP, given that the task pertains to a single object class. Additionally, we computed the mAP for both the inshore and offshore test sets to evaluate the efficacy of our method under diverse conditions.
4.3. Ablation Study
To validate the contribution of each component and the overall efficacy of the proposed method, we conducted a sequence of ablation experiments using the SSDD+ dataset. To ensure a valid comparison, all experiments were performed with consistent settings. The Oriented Reppoints network using Resnet-50 as the backbone was used as the baseline. Except for the first module, which replaced the Resnet-50 backbone with the MSTFE backbone, the remaining modules gradually joined the baseline network for experiments, and the overall experimental results are shown in
Table 2. Compared to the baseline, our detection network achieved significant improvement in the evaluation metrics across both inshore and offshore scenarios. Specifically, it improved by 14.1%, 1.9%, and 2.1% with respect to the inshore
, offshore
, and hybrid scenario
, respectively. In addition, we achieved 3.8% and 29.4% improvements in recall and precision, respectively, in the hybrid scenario. These results clearly show that our method can improve the detection coverage and localization accuracy of ship targets in inshore and offshore scenarios, as well as reduce false positives. Next, we analyzed the contribution of each method in detail.
Table 2 and
Table 3 only present the precision and recall metrics in the hybrid scenario. In addition, the offshore scene usually contains a large number of small ships, and the inshore scene contains a large number of large and medium-sized ships with surrounding land interference, which further reflects the contribution of each component in the detection of ships in different scenes.
Table 4,
Table 5,
Table 6 and
Table 7 presents the precision and recall rate indicators of the offshore and inshore scenes.
The Impact of MSTEF: The proposed MSTFE backbone focuses on mitigating the feature annihilation problem for small objects while extracting multiscale ship object features. In this experiment, we performed the experiment by keeping the rest of the network unchanged and only changing the backbone part in the Oriented Reppoints model. As shown in
Table 4, compared to Resnet-50 and Swin-Transform, the MSTFE backbone produced excellent results in both the inshore and offshore scenarios, which indicates that the MSTFE backbone can effectively extract multiscale ship target features. Aiming at the problem of large differences in size change, MSTFE realizes multiscale ship target feature extraction through the MKHP module of parallel multikernel convolution, and it achieved an improvement of 26.7% and 17.4% in the precision of the inshore and offshore scenes compared to the baseline backbone. Aiming at the problem of insufficient feature extraction of tiny ships in offshore scenes, MSTFE introduces a triple-attention module to enhance the key features of tiny targets, which allowed it to finally achieve a precision breakthrough of 17.4% in offshore scenes compared to the baseline backbone. MSTFE not only significantly improves the detection and localization ability of medium and large ship targets in the offshore area, but it also improves the detection ability of small targets in the offshore area. These experimental results all demonstrate the detection capabilities and advantages of our MSTFE backbone in multiscale scenes.
Table 5 demonstrates the advantages of triple attention (TA) over other attention mechanisms in small ships feature extraction. We conducted comparative experiments between triple attention, channel attention (CA), and spatial attention (SA). The results show that, in inshore scenarios, although the precision of TA decreased, its recall and AP for small ships remained comparable to those of channel-only and spatial-only attention. In offshore scenarios, the recall and AP of TA both exceeded those of channel-only and spatial-only attention, indicating that TA performs better in offshore small ship detection. Therefore, although its detection performance in inshore scenarios is slightly insufficient, its detection capability in offshore scenarios meets our requirements.
To further reflect the contribution of each component in the backbone network of MSTFE in the feature extraction process, we also conduct ablation experiments on the three modules of the multikernel heterogeneous perception module (MKHP), triple-attention module (TA), and convolutional perceptron module (CPM). It can be seen from
Table 6 that when only the MKHP was used for feature extraction alone, the R, P, and AP metrics in both the inshore and offshore scenarios were significantly improved, with the precision in the inshore and offshore scenarios increased by 28% and 11.4%, respectively, which indicates that the model is more capable of capturing multiscale features. Secondly, the TA module was introduced based on the MKHP, and its recall rate and precision rate in the offshore scene with a large number of small targets was increased by 2.0% and 3.8%, respectively, indicating that it can effectively improve detection coverage and accuracy of small targets. However, due to the influence of its multiple attention, although the effect is significant in the offshore scene, there is a risk of overfitting in the inshore scene. It affects the discrimination ability of the model. Finally, after removing the TA module and introducing the CPM module, compared to the baseline, the inshore AP was greatly improved to 77.7%, and the offshore AP was stable at 90.7%. Meanwhile, the precision results in the inshore and offshore scenarios were improved by 9.1% and 19.0% compared to those without the introduction of CPM, which suggests that the introduction of nonlinear transformations in CPM enhances the model’s discriminative ability, which shows that the introduction of nonlinear transformation in CPM strengthens the discriminant ability of the model. Finally, when the three components were combined, the inshore AP reached 74.1%, and the offshore AP reached 90.0%, indicating that the modules are complementary. This part of the experiment fully verifies the effectiveness of the MSTFE backbone and the role and effectiveness of each component in the feature extraction process.
In addition, for a more intuitive comparison, we visualized the features extracted by the backbone network. As shown in
Figure 7, compared to the resnet-50 backbone of the baseline network, MSTFE could extract the features of ship targets at different scales more accurately. At the same time, the design could also effectively suppress clutter noise and other interference. Compared to the swin-transform backbone, which is limited to the perception of local details due to its window segmentation and shift mechanism, a large number of unrelated but similar features were also extracted, resulting in a lot of redundancy. Our method uses multiscale, multiattention, and deep multiple architectures to improve the performance of the swin-transform backbone. It can extract features more accurately with less redundancy. The visualized backbone feature extraction results show the accuracy of the features extracted by our method.
The Impact of CAT2D: CAT2D first realizes the adaptive selection of features for each task by introducing the channel attention mechanism, an innovative design that accurately highlights key features relevant to the task. The filtered features are fed into the respective task-adaptive networks for prediction and output, thus endowing the detection head with an extremely high degree of flexibility, enabling it to efficiently cope with the diverse demands of different tasks and significantly optimize the overall performance. According to the experimental results in
Table 7, CAT2D achieved a 31.0% and 17.5% improvement in precision in the inshore and offshore scenarios, respectively, compared to the baseline network. This significant improvement not only highlights the excellent performance of CAT2D in complex backgrounds but also its efficiency in feature selection. Meanwhile, the recall of CAT2D was also improved by 4.9% and 2.0% for the inshore and offshore scenarios, respectively. Furthermore, our method has achieved relatively excellent detection results with only a few enhanced parameters. These results fully demonstrate that CAT2D has significant advantages in feature selection and adaptation between different tasks, which greatly improve the model’s ability to detect ship targets and show its innovation and efficiency in the field of multitask detection.
The Impact of SAB: The size-aware balanced loss focuses on addressing the model-fitting imbalance caused by data imbalance during network training. According to the distribution of ship size in the two public datasets shown in
Table 1, we conducted comparative experiments by setting the modulation factors in Equation (
20) as
,
, and
. Additionally, to guarantee the effectiveness and stability of the model, the sum of
,
, and
was set to 1. This is shown in
Table 3, where
,
, and
are all 0, indicating that no SAB was introduced to the baseline method. The SAB notably improved the recall rate of the model and steadily enhanced the precision and average precision compared to the baseline. The best model accuracy was achieved when
= 0.69,
= 0.30, and
= 0.02. Therefore, selecting appropriate modulation factors
,
, and
are crucial for enhancing the performance of the model in SAR ship detection, particularly when dealing with class imbalance and multiscale targets. These factors increase the weight of large targets in the loss function, permitting the model to concentrate more on these often-overlooked targets, thus optimizing its detection capability. This study is the first to deeply integrate the physical scale prior to loss function design, and through the theory-driven weight allocation mechanism, we not only overcome the size distribution imbalance problem in SAR ship detection but also establish a generalized framework that can be generalized to multiscale target detection. Compared to the method without SAB, the recall and precision were improved by 4.6% and 10.2%, respectively, under the same computational resources, which provides a new paradigm for target detection in data imbalance scenarios.
4.4. Comparison with Other Advanced Methods
To assess the performance of the proposed method more comprehensively, we compared it with other advanced methods based on convolutional neural networks. Nine methods were selected for comparison: Oriented R-CNN [
13], RoI Transformer [
12], Gliding Vertex [
53], ReDet [
14], R3Det [
15], S2A-Net [
54], Rotated Retina [
51], Rotated Fcos [
55], and Oriented RepPoints [
18]. These experiments were conducted on the SSDD+ dataset, covering both the two-stage and one-stage algorithms. We used three evaluation metrics—precision, recall, and average precision—across different inshore and offshore scenarios. The results of the experimental comparison are presented in
Table 8.
As can be seen from
Table 8, the two-stage algorithms (Oriented R-CNN, RoI Transformer, Gliding Vertex, and ReDet) achieved good precision while maintaining recall; however, their drawbacks include high computational costs and numerous parameters, such as Oriented R-CNN, which has 41.35 MB parameters and a computational load of 63.28 GFlops. In contrast, the single-stage algorithms (R3Det, S2A-Net, Rotated Retina, Rotated Fcos, and Oriented RepPoints) have fewer parameters but typically exhibited lower accuracy while maintaining recall. Our method has only 24.36 MB parameters, the fewest among all compared algorithms. Meanwhile, the computational complexity is 45.85 GFlops, which is significantly lower than that of the two-stage algorithms (such as Oriented R-CNN at 63.28 GFlops and RoI-Transformer at 77.15 GFlops) and also lower than some single-stage algorithms (such as R3Det at 82.17 GFlops and S2A-Net at 49.07 GFlops). This indicates that the model has a significant advantage in terms of lightweight design, with lower memory and computational overhead during inference, making it suitable for resource-constrained scenarios. Under the constraint of computational complexity, our method achieved a recall rate of 96.3%, a precision of 85.1%, and an AP50 of 91.3%, significantly outperforming other two-stage and single-stage algorithms in terms of recall rate and average accuracy. For example, compared to the Oriented Reppoints method with a number of similar parameters (36.60 MB parameters and 48.56 GFlops computational load), our method reduces the parameter total by 33.4% and GFlops complexity by 5.6% while improving recall by 3.8% and AP50 by 2.1%, validating the synergistic optimization of computational efficiency and detection performance. Additionally, compared to the baseline method, our method demonstrated improvements in specific scenarios: In inshore scenarios, the recall improved by 9.5%, accuracy by 34.2%, and AP50 by 14.1%; in mixed scenarios, the recall improved by 3.8%, accuracy by 29.4%, and AP50 by 2.1%. The results indicate that this method achieves higher detection coverage and detection accuracy with fewer parameters and reduced computational complexity, significantly enhancing the detection performance of ships in inshore scenarios.
Furthermore, we visualized the detection results of several methods to visually compare the performance of the algorithms. As shown in
Figure 8, the selected test images cover various scenarios, including large, medium, and small ship targets; inshore and offshore; and dense and sparse distributions. In the figure, the green boxes represent ground truth detection boxes, red boxes represent detected targets, yellow boxes represent missed detections, and blue boxes represent false positives. The results indicate that the proposed method effectively detects ship targets under various conditions. In the first two rows, which represent complex inshore scenarios and densely distributed scenes, our method demonstrates strong detection performance, whereas other algorithms suffer from significant missed detections. From the third to the fifth rows, it can be noted that for small targets, our method also shows strong detection coverage, whereas other algorithms exhibit severe missed detections. This demonstrates the robustness and effectiveness of our method for detecting ship targets under a wide range of complex conditions. Overall, compared to all other methods, our approach has superior detection capabilities.
To strengthen the validation of model generalization and robustness, we conducted complementary experiments using the R-SSD SAR dataset and the HRSID dataset, and the results are presented in
Table 9 and
Table 10. The results in the R-SSD SAR dataset show that our method improved the AP of the inshore, offshore, and hybrid scenarios by 5.2%, respectively, as compared to the baseline method at 0.8% and 1.4%. Compared to the other state-of-the-art methods, our method achieved 93.9% and 81.8% in recall and precision, respectively. The results in the HRSID dataset show that our method improved the AP in inshore, offshore, and mixed scenes by 5.6%, 1.5%, and 1.9%, respectively, compared to the baseline method. Our method improves the precision while maintaining the recall, and it also solves the problem that other methods have poor detection performance in inshore scenes. These experimental results further validate the generalization ability of our method, showing that it can maintain stable performance across different datasets and real-world scenarios.
Figure 9 presents the detection results obtained from various methods applied to different maritime regions within the R-SSD SAR dataset. The results clearly indicate that our method can effectively detect all types of ships, including those in inshore regions with complex backgrounds and offshore ships with smaller target sizes. With the size-aware balanced loss function, our method trains large and medium-sized ships more effectively than other methods, ensuring their accurate detection while significantly reducing false detection. The findings underscore the reliability and efficacy of our approach in identifying ships within intricate and varied settings, effectively overcoming the challenges posed by ship target imbalance.