4.3. Comparison with Other CNN-Based Methods
In the whole training samples, the ratio of labeled images to unlabeled images was included as 1:10. To validate the performance of the proposed approach, we adopted several well-known and popular detectors for comparison, including supervised and semi-supervised detection methods. Among them, supervised detectors included Faster R-CNN [
14], YOLOv8 [
49] and Swin-Transformer [
20], which were trained with only labeled data. For the pretrained weights of yolov8, we chose the best “x” as the weights setting of the model. Semi-supervised detectors included Soft Teacher [
32] and TEST [
34], which were trained with both labeled and unlabeled data.
The quantitative analysis results for each algorithm on SSDD were presented in
Table 3. Several visual detection results on SSDD were presented in
Figure 3.
Faster R-CNN stood out as one of the most classical two-stage algorithms recognized for its high accuracy in supervised detection tasks. As demonstrated in
Table 3, it showcased commendable performance across various COCO metrics, with a notable AP
50 score of 0.839. As shown in the first row of
Figure 3, while it successfully identified several targets in the nearshore area, its performance suffered from missed detections for small targets located farther away. Moreover, it tended to detect partially adjacent targets as a single ship target, including instances where ships were adjacent to other ships or clutter.
The quantitative results highlighted that the single-stage supervised detector YOLOv8 achieved the lowest performance on most indices among these methods. With an AP
50 score of only 0.766 and an AP
l score of merely 0.013, YOLOv8 demonstrated inferior performance compared to other detectors. Furthermore, the detection visualization results in
Figure 3 revealed that YOLOv8’s detection performance was not as good as that of Faster R-CNN. YOLOv8 exhibited significant performance degradation in near-shore scenarios, missing a number of vessel targets. This deficiency may be attributed to its lightweight architecture and rapid detection process.
Among three supervised detectors, Swin-Transformer demonstrated commendably with an AP50 value of 0.878. Swin-Transformer was able to capture image detail information and model global contextual information. Despite advancements, it still suffered from high missed alarms for small targets on the far shore and an increased false alarm rate in nearshore scenarios.
Soft Teacher and TSET were semi-supervised detectors, both of which utilize the Faster R-CNN as the baseline. The former leveraged a special loss function to learn from negative samples addressing issues of low recall rates, while the latter optimized pseudo-labels by employing multiple teacher models.
Soft teacher achieved an AP
50 accuracy of 0.868. Particularly, prioritizing negative samples yielded favorable results in the detection of small far-shore targets. Nevertheless, due to its lack of emphasis on pseudo-label refinement, it was prone to produce missed alarms when filtering targets based on confidence. Soft Teacher achieved degraded detection performance in complex near-shore scenarios (e.g., multiple ships docked at the same port, as illustrated in the second of
Figure 3b). Given that the SSDD dataset primarily consists of small to medium-sized targets, Soft Teacher obtained superior detection performance in larger targets, with an AP
l of 0.392. Soft Teacher achieved AP
50 accuracy of 0.868. Particularly, prioritizing negative samples yielded favorable results in the detection of small far-shore targets. Nevertheless, due to its lack of emphasis on pseudo-label refinement, it was prone to produce missed alarms when filtering targets based on confidence. Soft Teacher achieved degraded detection performance in complex near-shore scenarios (e.g., multiple ships docked at the same port, as illustrated in the second of
Figure 3b). Given that the SSDD dataset primarily consists of small to medium-sized targets, Soft Teacher obtained superior detection performance in larger targets, with an AP
l of 0.392.
Despite leveraging semi-supervised target detection, TSET’s performance remained subpar. As evident from the COCO metrics presented in
Table 3, its AP
50 score was a mere 0.769, falling behind the Swin-Transformer in several metrics. Moreover, as depicted in
Figure 3d, TSET struggled with multiple far-shore targets, often ignoring or completely missing small targets. While there was an improvement in the accuracy of near-shore targets, TSET still exhibited more missed targets compared to the Swin-Transformer.
In contrast, our method outperformed all others in five COCO metrics, namely AP, AP50, AP75, APs, and APm, with respective values of 0.498, 0.905, 0.461, 0.504, and 0.483. Typically, attention was placed on the performance of AP50. In this regard, our method demonstrated a notable improvement of approximately 4% compared to Soft Teacher.
From
Figure 3, it can also be found that our approach excelled in gaining a high recall rate for far-shore targets. In multi-target far-shore scenes, our model succeeded in detecting the majority of ship targets, significantly enhancing the recall rate. Although all other methods failed to distinguish adjacent docked ships accurately, our model effectively discerned ship targets in complex near-shore backgrounds. Specifically, in
Figure 3b, our model successfully distinguished targets docked at the same port. While our model may produce a small number of false positive detections, the overall performance advantage was substantial in terms of decreased missing alarms. In summary, our method outperformed the other five detectors in performance metrics.
The PR curves for each algorithm were depicted in
Figure 4, with the AP
50 values of each algorithm displayed alongside the plot. It was evident that our method achieved the maximum area under the curve (AUC), which was 0.90. This verified that our method exhibited the best performance among all six algorithms.
Additionally, we conducted experiments on the more complex AIR-SARShip-1.0 dataset.
Table 4 gave the quantitative analysis results on this dataset for six algorithms. The detection results on three representative scenes for these algorithms were illustrated in
Figure 5. Similar to
Figure 3, the green boxes denoted all detected targets achieved by each algorithm. The red ellipses represented the false alarms identified by the algorithms. The orange ellipses denoted missed instances that the algorithms failed to detect.
On this dataset, supervised methods exhibited a noticeable decrease in performance compared to semi-supervised methods. This was mainly attributed to the complex environment and low image quality.
In terms of AP, all supervised methods fell below 0.3, while semi-supervised methods reached a minimum of 0.34. MTDSEFN obtained the highest AP value at 0.351. Regarding the crucial metric AP50, our method exhibited the best performance at 0.793. Notably, semi-supervised methods demonstrated a remarkable improvement of 0.1 compared to supervised methods. Additionally, the proposed method achieved a near 2% improvement compared to the second place in this dataset. Due to the low resolution of images in the AIR-SARShip-1.0 dataset, which mainly comprised medium to large targets with very few small targets, all algorithms exhibited low APs values. In a nutshell, the proposed method achieved optimal performance in AP, AP50, APs, APm, and APl metrics, with 0.351, 0.79, 0.097, 0.363, and 0.524, respectively.
As can be observed from
Figure 5, for Faster R-CNN, there were considerable false alarms for near-shore targets. Even under far-shore conditions, significant missed detections occurred. YOLOv8 had more false alarms compared to Faster R-CNN and demonstrated poorer quantification of its performance relative to the COCO metrics. As for Swin-Transformer, it demonstrated an outstanding detection performance, particularly in detecting far-shore targets, which could be observed from the results of the column scene in
Figure 5b.
Semi-supervised models exhibited superior performance in detecting far-shore targets. As can be seen from
Figure 5, most far-shore targets were successfully detected by three semi-supervised models. However, there were still huge challenges for detecting near-shore ships. Although Soft Teacher and TSET struggled to detect near-shore small targets, adjacent ship targets were not distinguished correctly in the second scene of
Figure 5b. Additionally, in scene (c) of
Figure 5, both of them failed to detect two near-shore small targets in the upper right corner. In contrast, for our method, adjacent ships in the second scene were clearly distinguished, and two near-shore small targets were successfully detected in the third scene. Moreover, the proposed method did not exhibit a significant increase in false detections of docked ships. Briefly, the effectiveness of our method was demonstrated on both datasets.
Figure 6 displayed the PR curves on the AIR-SARShip-1.0 dataset for six algorithms. In this plot, the superiority of semi-supervised algorithms was more pronounced. Our approach, compared to all others, performed relatively better overall, maintaining higher precision under both low and high recall conditions. For our approach, the area under the curve (AUC) reached 0.79, indicating its effectiveness and superiority.
4.4. Ablation Study
We conducted ablation experiments on the AIR-SARShip-1.0 dataset with a labeled-to-unlabeled data ratio of 1:10, so as to analyze the effect of different modules of our method. The experimental parameters remained consistent with the comparative experiments, and the results were summarized in
Table 5. The experimental results demonstrated that the joint utilization of the TG and AT modules led to a remarkable increase in detection performance. Within the TG, two parts were employed: multi-teacher and D-S evidence fusion. It was worth noting that D-S evidence fusion required multiple sources of data as input. Thus, it was not applicable when multi-teacher was absent.
From
Table 5, it was evident that our method achieved optimal performance in four out of six COCO indicators. Specifically, the metrics AP, AP50, AP
m, and AP
l reached the highest levels, with values of 0.351, 0.793, 0.363, and 0.524, respectively. Notably, AP50 exhibited a nearly 2% increase. However, for the AP75 and AP
s indicators, the separate TG exhibited superior performance. AP75 and AP
s, which denoted targets with an IoU greater than 0.75 and smaller size, respectively, required more precise bounding box predictions. The potential inconsistency between the pseudo-bboxes generated by the AT and those generated by the TG may introduce bias into the bboxes learned by the student model. Consequently, when only TG was present without the AT, the model for this condition attained more accurate bbox predictions, demonstrating higher AP
75 and AP
s performance than the proposed method. The experimental setup in the fourth row did not employ D-S evidence fusion, despite utilizing both the TG and the AT. As a result, the reliability of the pseudo-labels cannot be guaranteed, leading to suboptimal AP
50 performance. It underscored the crucial role of the D-S fusion mechanism proposed in this paper, which significantly enhanced the quality of pseudo-labels and overall model performance.
In a nutshell, the experimental results indicated that the combination of the two proposed branches in our method effectively boosted the performance of the semi-supervised detector.
4.5. Hyperparameters Experiments
This section would explores the impact of each hyperparameter of the model on the model detection performance.
Firstly, we investigated the influence of the number of teachers in the TG. The experimental results, depicted in
Figure 7, revealed that increasing the number of teachers enhanced the model performance. However, as the number of teachers grew, the computational load during the model training increased remarkably. Notably, when the number of teachers reached 5 from 4, the accuracy improvement was tiny. To mitigate the computational burden, we selected five teachers in the TG in our framework.
Next, we analyzed the impact of parameters and in the loss function on model performance.
Table 6 illustrated the effect of
on model performance, with AP
50 serving as the performance metric. Here, another parameter
was set to 1. The optimal performance was observed when the parameter value was 0.05. Inadequately small parameters such as 0.01 impeded the model from assimilating the latest knowledge. Conversely, excessively large parameters restricted the AT’s ability to guide the student learning, because the negative effect caused by incorrect pseudo-labels would be amplified, resulting in the declined performance of the whole model.
Table 7 displayed the impact of
on model performance, with
fixed as 0.05. The best performance occurred when the value of
was 1. The TG was designed to obtain high-quality pseudo-labels. When the associated parameter was set too low, the generated labels were of lower quality. Conversely, if the parameter was too high, the model became overly reliant on the pseudo-labels generated by the TG, consequently neglecting the guiding information provided by the AT. This imbalance can lead to performance degradation.
Furthermore, we addressed the significance of the threshold hyperparameter
, which was used to judge sample trade-offs after D-S evidence fusion, shown in
Figure 8. An excessively large threshold yielded a low recall rate, while an overly small threshold compromised pseudo-label quality. From the figure, it can be seen that the model had the best performance when the threshold was chosen as 0.6.