In this section, we evaluate the detection performance of the proposed method through experiments. We first introduce the datasets and experimental settings, and then describe the evaluation criteria. Subsequently, we conduct comparative experiments on the HRSID and SSDD datasets, contrasting our method with mainstream weakly supervised and fully supervised rotation detection methods to validate its effectiveness.
4.1. Datasets and Settings
Our method aims to train the OBB detector using HBB labels to enhance detection performance and inference efficiency. To validate the effectiveness of our method, the experimental datasets need to contain both OBB and HBB annotations. Accordingly, we select the HRSID and SSDD datasets for our experiments. Both HRSID and SSDD are high-resolution SAR image datasets specifically designed for ship detection, semantic segmentation, and instance segmentation tasks. The HRSID dataset comprises 5604 images of size 800 × 800, with 16,951 ship instances annotated, covering various resolutions, polarization modes, sea conditions, and maritime scenes. The SSDD dataset contains 1160 images with pixel lengths ranging from 500 to 600, annotating 2456 ship targets, encompassing diverse sea conditions, ship types, and sizes. In our experiments, the HRSID dataset is split into a 65% training set and a 35% testing set, while the SSDD dataset follows a 7:3 split ratio. Throughout the experiments, the input image size is uniformly set to 800 × 800. The partitioning of the training and testing sets strictly follows the protocols established in the mainstream literature, ensuring that the reported detection metrics can be compared in a direct and quantitative manner.
All experiments were conducted on two NVIDIA GeForce RTX 4090 GPUs using the MMRotate and Ultralytics detection frameworks. For the MMRotate implementation, we employed the AdamW optimizer with an initial learning rate of , with and , trained for 72 epochs with a batch size of 4. The loss weights for the first-stage loss were set to , , and . For the second-stage loss , the term was weighted by . Under the Ultralytics framework, training was performed using the SGD optimizer with an initial learning rate of and a momentum of , for 200 epochs with a batch size of 32.
4.2. Evaluation Metrics
The detection performance of OBB detection models is commonly assessed using precision, recall, and mAP50. The precision and recall are calculated as follows:
where TP, FP, and FN represent true positives, false positives, and false negatives, respectively. Precision indicates the proportion of accurate detections among all predictions, while recall measures the proportion of detected targets relative to the total number of targets. These metrics are influenced by the confidence and IoU thresholds. The mAP50, which fixes the IoU threshold at 0.5, is calculated by integrating the precision–recall curve:
where
r is the recall and
is the precision at different recall levels. The range of mAP50 is 0 to 1, with higher values indicating better detection performance. To eliminate the impact of the confidence threshold, we use mAP50 to evaluate detection performance.
In practical inference, the parameter scale and computational consumption are crucial for cost estimation. Therefore, we focus on the number of parameters and the floating-point operations (GFLOPS) required to process a single image to quantitatively measure the model’s inference cost.
4.3. Experimental Results
To validate the effectiveness of our method, we compared its performance with mainstream weakly supervised methods and fully supervised methods on the HRSID and SSDD datasets. The weakly supervised methods included H2Rbox, H2Rbox-v2, and the Unit Circle Resolver (UCR) [
72]. Among them, UCR, as the state-of-the-art weakly supervised detector that reports superior performance, is adopted as the baseline under weak supervision in our comparative experiments. The fully supervised single-stage detection methods based on convolutional structures are YOLOv8, RetinaNet, S2ANet, FOCS, and RTMDet. YOLOv12, a fully supervised single-stage detection method based on attention mechanisms, is also included. The fully supervised two-stage detection methods are Faster-RCNN, Oriented-RCNN, R3Det, ReDet, and ROI-Transformer. The training and testing results of YOLOv8 and YOLOv12 are computed under the Ultralytics framework, whereas those of all other models are obtained within the MMRotate framework. To ensure that numerical discrepancies faithfully reflect inter-model performance differences, all detection performance metrics are uniformly calculated based on the MMRotate framework.
Table 1 shows the detection performance of different methods in the offshore and inshore scenes, and the entire test set of the HRSID. The last two columns indicate the number of model parameters and the computational cost required to infer an 800 × 800 image. The table demonstrates that our method achieves the best detection performance among all compared weakly supervised methods, with the smallest computational and parameter requirements. Compared to the second-best weakly supervised method, our method improves mAP50 by 13.178% in the entire test set of HRSID, 18.357% in inshore scenes, and 7.416% in offshore scenes. It also reduces computational cost by about 30% and parameter scale by about 90%. The performance boost in inshore scenes is attributed to the parallel integration of Swin-TransformerV2 into the CNN, enhancing the model’s interference resistance by providing both global semantics and local details. The improvement in offshore scenes is relatively modest, likely due to the loss of high-quality pseudo-labels for some targets during filtering, reducing the model’s attention to this part of the targets and increasing the miss rate. The compared weakly supervised methods (H2Rbox, H2Rbox-v2, UCR) all use FCOS with a ResNet50 backbone, accounting for 73.188% of parameters and 41.692% of computations. Our method uses a small number of c2f and Swin-TransformerV2 structures in the backbone, achieving lightweighting while maintaining performance. Thus, our approach effectively enhances detection performance and efficiency in SAR ship detection tasks while reducing model size.
Additionally,
Table 1 shows that our method outperforms most fully supervised methods in entire and inshore scenes, only surpassing RetinaNet and Faster-RCNN in offshore scenes. In terms of parameter scale, our method is comparable to the smallest fully supervised method. Our method, an improvement over H2Rbox-v2, uses FCOS as its fully supervised counterpart. Compared to FCOS, our method improves mAP50 by 1.602% in entire scenes and by 5.506% in inshore scenes, but it decreases by 1.803% in offshore scenes. The performance gains are attributed to improvements in the backbone and neck structures and the use of high-quality pseudo-labels during training. However, the lack of high-quality pseudo-labels for some targets leads to insufficient model focus on these targets during training, resulting in a performance drop in offshore scenes. Compared to single-stage detectors YOLOv12 and YOLOv8, our method has higher computational costs during inference, with the head and neck parts accounting for 71.049% and 23.857% of computations, respectively. This is due to the FCOS model’s shared detection head structure, which requires high-dimensional features for multiscale target detection. YOLOv12 and YOLOv8 use multiple detection heads, reducing computational complexity and better utilizing fully supervised label information. Therefore, compared to YOLOv12 and YOLOv8 under full supervision, our method has gaps in both detection performance and inference efficiency.
Table 2 shows the detection performance of different methods in various scenes of the SSDD dataset, with column meanings consistent with
Table 1. Our method achieves the best detection performance among all compared weakly supervised methods, improving mAP50 by 3.059% in the entire test set of the SSDD, 6.006% in the inshore scene, and 1.288% in the offshore scene compared to the second-best weakly supervised method. This indicates that our method can enhance the detection performance of weakly supervised models in SAR ship detection tasks, even in small-data-scale scenarios. In addition, the model configurations used in this table are the same as those in the HRSID experiments. The optimization of the model parameter scale and inference computing resources also remains unchanged, and details will not be repeated here.
As shown in
Table 2, our method surpasses only R-RetinaNet across all scenarios (entire, inshore, and offshore), a result that contrasts markedly with the findings in
Table 1. This discrepancy is primarily attributed to the limited scale of the SSDD dataset. Our approach integrates self-supervised learning via random rotation and flipping, weakly supervised learning based on HBBs, and pseudo-label-guided learning for angle and scale estimation. The constrained dataset size restricts the acquisition of sufficient multi-angle samples and high-quality pseudo-labels, thereby impairing the model’s capacity to learn discriminative orientation and scale features. Furthermore, the inherent orientation sensitivity of SAR targets diminishes the efficacy of self-supervised augmentation. The limited training samples also increase overfitting risks, while the small test set may fail to reveal such overfitting, contributing to the apparently superior performance of fully supervised models in evaluation metrics.
To verify that the limited scale of the SSDD dataset is the main reason for the lower detection performance under weakly supervised conditions, we mix different proportions of the HRSID training set with the SSDD training set for model training, and evaluate the detection performance on the SSDD test set. The experiment uses SwinV2-CNN as the backbone, USP as the neck structure, and TTM as the training method. Detailed results are shown in
Table 3, where an HRSID ratio of 0 indicates using only the SSDD training set, and bold values represent the best results in each column. As shown in
Table 3, when the HRSID mixing ratio reaches 30% and 70%, the model’s detection performance improves, indicating that expanding the scale of SSDD helps enhance detection effectiveness and further confirming that dataset size is a critical factor limiting model performance. In addition, comparisons between
Table 2 and
Table 3 reveal that although incorporating HRSID training data improves performance to a level comparable with most fully supervised methods, a gap remains with the optimal fully supervised approach. This gap is mainly due to distribution differences between the two datasets: SSDD contains multi-polarization and multi-resolution data, while HRSID is single-polarization and single-resolution. Thus, partially integrating HRSID data is insufficient to fully augment the SSDD dataset, thereby limiting further improvements in model performance.
As shown in
Table 1 and
Table 2, on HRSID, the mAP50 gap between our method and the best fully supervised approach is 3.659% for inshore and 1.947% for offshore scenes; on SSDD, the corresponding gaps are 19.266% and 6.027%. The detection performance of our method in offshore scenarios is close to that of fully supervised methods. This strong performance in offshore scenarios is attributed to the two-stage training strategy we adopted, which reduces the dependency on rotated bounding box annotations through angle self-supervised learning and HBB weak supervision, while enhancing detection performance via pseudo-label-guided training. Furthermore, since generating pseudo-labels is inherently easier in offshore scenarios than in inshore ones, the guidance provided by pseudo-labels in our method is more effective in offshore settings, thereby resulting in detection performance that is closer to fully supervised methods in such scenarios.
To more intuitively illustrate the performance of our method,
Figure 11 and
Figure 12 present the detection results of four different methods on the various scenarios of the HRSID and SSDD. FCOS and YOLOv8, as strong fully supervised detectors, provide the structural basis for our approach, with our backbone design inspired by YOLOv8. UCR, a weakly supervised approach, shares the first-stage training loss with our method and thus serves as our direct baseline. As shown in
Figure 11, our method achieves a lower miss rate in nearshore and offshore scenes with small, dense ships, matching the precision and false alarm rates of fully supervised detectors while outperforming UCR. In
Figure 12, rows 2 and 4 indicate that our method performs only slightly worse than fully supervised YOLOv8 in detecting dense large targets inshore and very small dense targets offshore. Rows 1 and 3 further show that our precision for inshore targets of varying sizes is comparable to YOLOv8 and superior to the other methods. These results confirm that our approach attains detection performance competitive with fully supervised methods.