Figure 1.
Typical annotated images under different environmental conditions. Sample images reveal the intrinsic complexity of our dataset: specular highlights eroding target contours, low signal-to-noise ratio in dim conditions and partial occlusion from waves, setting a rigorous benchmark for model generalization.
Figure 1.
Typical annotated images under different environmental conditions. Sample images reveal the intrinsic complexity of our dataset: specular highlights eroding target contours, low signal-to-noise ratio in dim conditions and partial occlusion from waves, setting a rigorous benchmark for model generalization.
Figure 2.
The network structure of WA-YOLO.
Figure 2.
The network structure of WA-YOLO.
Figure 3.
Threshold sensitivity analysis. Regarding the blue lines in
Figure 3, they serve to mark the default parameter position (confidence threshold = 0.25, NMS IoU threshold = 0.6, FPS (Embedded)) and delineate the region of stable performance. The intersection of the solid blue lines indicates the optimal default threshold combination identified through preliminary experiments. The performance surface remains stable under ±0.1 parameter perturbation around default thresholds, with mAP fluctuation < 2%, ensuring parametric robustness for embedded deployment.
Figure 3.
Threshold sensitivity analysis. Regarding the blue lines in
Figure 3, they serve to mark the default parameter position (confidence threshold = 0.25, NMS IoU threshold = 0.6, FPS (Embedded)) and delineate the region of stable performance. The intersection of the solid blue lines indicates the optimal default threshold combination identified through preliminary experiments. The performance surface remains stable under ±0.1 parameter perturbation around default thresholds, with mAP fluctuation < 2%, ensuring parametric robustness for embedded deployment.
Figure 4.
Baseline performance across different datasets. Performance comparison reveals YOLOv8’s advantage in detection accuracy versus YOLOv5’s competitiveness in embedded efficiency, with YOLOv8’s precision gain on the difficult subset starkly contrasted by its speed penalty. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 4.
Baseline performance across different datasets. Performance comparison reveals YOLOv8’s advantage in detection accuracy versus YOLOv5’s competitiveness in embedded efficiency, with YOLOv8’s precision gain on the difficult subset starkly contrasted by its speed penalty. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 5.
The frame of YOLOv5 and YOLOv8. (The left is YOLOv5, the right is YOLOv8).
Figure 5.
The frame of YOLOv5 and YOLOv8. (The left is YOLOv5, the right is YOLOv8).
Figure 6.
Attention module performance across different datasets. Attention module performance trends show ECA consistently enhances small-object detection (APS), while CBAM shows more pronounced improvements in overall accuracy (mAP) under difficult conditions. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 6.
Attention module performance across different datasets. Attention module performance trends show ECA consistently enhances small-object detection (APS), while CBAM shows more pronounced improvements in overall accuracy (mAP) under difficult conditions. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 7.
Attention module performance matrix across models and datasets. The performance matrix clearly illustrates the ECA module’s superior adaptability within the YOLOv8 architecture, whereas CBAM’s performance fluctuates across datasets, indicating instability. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 7.
Attention module performance matrix across models and datasets. The performance matrix clearly illustrates the ECA module’s superior adaptability within the YOLOv8 architecture, whereas CBAM’s performance fluctuates across datasets, indicating instability. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 8.
Training convergence analysis. Training convergence curves indicate that SIoU loss accelerates model convergence and maintains better stability in later stages, while Focal-EIoU exhibits oscillations in some configurations. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 8.
Training convergence analysis. Training convergence curves indicate that SIoU loss accelerates model convergence and maintains better stability in later stages, while Focal-EIoU exhibits oscillations in some configurations. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 9.
Localization accuracy analysis. Localization accuracy analysis confirms that SIoU loss provides more precise bounding box regression in wave-disturbed scenarios, with its vector angle cost enhancing direction awareness. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 9.
Localization accuracy analysis. Localization accuracy analysis confirms that SIoU loss provides more precise bounding box regression in wave-disturbed scenarios, with its vector angle cost enhancing direction awareness. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 10.
APS vs. FPS strategy comparison. The APS-FPS trade-off analysis clearly distinguishes two clusters: high-resolution (high accuracy, low FPS) and image tiling (balanced accuracy and FPS), providing intuitive guidance for engineering selection. (conf = 0.25, iou_nms = 0.6, FPS (Embedded), COCO-style mAP@[0.5:0.95] with IoU thresholds from 0.5 to 0.95 (step 0.05). APS/APM/APL are defined by object area thresholds (<322 pixels, 322 ≤ area ≤ 962, >962) in input pixels at evaluation resolution, left figure represents high-resolution, right figure represents image slicing).
Figure 10.
APS vs. FPS strategy comparison. The APS-FPS trade-off analysis clearly distinguishes two clusters: high-resolution (high accuracy, low FPS) and image tiling (balanced accuracy and FPS), providing intuitive guidance for engineering selection. (conf = 0.25, iou_nms = 0.6, FPS (Embedded), COCO-style mAP@[0.5:0.95] with IoU thresholds from 0.5 to 0.95 (step 0.05). APS/APM/APL are defined by object area thresholds (<322 pixels, 322 ≤ area ≤ 962, >962) in input pixels at evaluation resolution, left figure represents high-resolution, right figure represents image slicing).
Figure 11.
Impact of data enhancement strategies. Impact comparison of data enhancement strategies shows low-light augmentation primarily improves recall of targets in dark areas, while reflection enhancement effectively reduces false positives in highlight regions. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 11.
Impact of data enhancement strategies. Impact comparison of data enhancement strategies shows low-light augmentation primarily improves recall of targets in dark areas, while reflection enhancement effectively reduces false positives in highlight regions. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 12.
APS vs. P50 latency trade-off analysis. The APS vs. P50 latency scatter plot reveals two main clusters: an upper-left high-accuracy low-latency region (e.g., YOLOv8 + ECA + SIoU) and a lower-right traditional model region, offering intuitive guidance for efficiency-prioritized model selection. (Jetson Xavier NX, TensorRT FP16, batch = 1).
Figure 12.
APS vs. P50 latency trade-off analysis. The APS vs. P50 latency scatter plot reveals two main clusters: an upper-left high-accuracy low-latency region (e.g., YOLOv8 + ECA + SIoU) and a lower-right traditional model region, offering intuitive guidance for efficiency-prioritized model selection. (Jetson Xavier NX, TensorRT FP16, batch = 1).
Figure 13.
YOLOv5 representative detection results under different conditions. Qualitative results for YOLOv5 demonstrate that the +ECA + SI combination significantly improves the detection rate of small buoys, particularly in wave-disturbed scenes, yielding more complete bounding boxes. (SI is equal to SIOU, FE is equal to Focal-EIoU, conf = 0.25, iou_nms = 0.6, FPS (Embedded). COCO-style mAP@[0.5:0.95] with IoU thresholds from 0.5 to 0.95 (step 0.05). APS/APM/APL are defined by object area thresholds (<322 pixels, 322 ≤ area ≤ 962, >962) in input pixels at evaluation resolution).
Figure 13.
YOLOv5 representative detection results under different conditions. Qualitative results for YOLOv5 demonstrate that the +ECA + SI combination significantly improves the detection rate of small buoys, particularly in wave-disturbed scenes, yielding more complete bounding boxes. (SI is equal to SIOU, FE is equal to Focal-EIoU, conf = 0.25, iou_nms = 0.6, FPS (Embedded). COCO-style mAP@[0.5:0.95] with IoU thresholds from 0.5 to 0.95 (step 0.05). APS/APM/APL are defined by object area thresholds (<322 pixels, 322 ≤ area ≤ 962, >962) in input pixels at evaluation resolution).
Figure 14.
YOLOv8 representative detection results under different conditions. Qualitative results for YOLOv8 highlight that the +CBAM + SIoU configuration maintains optimal localization accuracy under strong reflection, effectively preventing false fusion of adjacent targets. (SI is equal to SIOU, FE is equal to Focal-EIoU, conf = 0.25,iou_nms = 0.6, FPS (Embedded). COCO-style mAP@[0.5:0.95] with IoU thresholds from 0.5 to 0.95 (step 0.05). APS/APM/APL are defined by object area thresholds (<322 pixels, 322 ≤ area ≤ 962, >962) in input pixels at evaluation resolution).
Figure 14.
YOLOv8 representative detection results under different conditions. Qualitative results for YOLOv8 highlight that the +CBAM + SIoU configuration maintains optimal localization accuracy under strong reflection, effectively preventing false fusion of adjacent targets. (SI is equal to SIOU, FE is equal to Focal-EIoU, conf = 0.25,iou_nms = 0.6, FPS (Embedded). COCO-style mAP@[0.5:0.95] with IoU thresholds from 0.5 to 0.95 (step 0.05). APS/APM/APL are defined by object area thresholds (<322 pixels, 322 ≤ area ≤ 962, >962) in input pixels at evaluation resolution).
Table 1.
Quantitative performance comparison between WA-YOLO and related methods on the maritime difficult subset. This table presents the performance of different methods on the difficult subset under a unified experimental framework. All experiments use the same input resolution (960 × 960), training configuration and embedded hardware platform (Jetson Xavier NX, TensorRT FP16). WA-YOLO employs the YOLOv8 + ECA + SIoU configuration.
Table 1.
Quantitative performance comparison between WA-YOLO and related methods on the maritime difficult subset. This table presents the performance of different methods on the difficult subset under a unified experimental framework. All experiments use the same input resolution (960 × 960), training configuration and embedded hardware platform (Jetson Xavier NX, TensorRT FP16). WA-YOLO employs the YOLOv8 + ECA + SIoU configuration.
| Method | mAP@0.5 | APS | FPS (Embedded) |
|---|
| WA-YOLO(Ours) | 0.8616 ± 0.0026 | 0.5347 ± 0.0025 | 24.86 ± 0.8 |
| CBAM + CIOU Refinement [15] | 0.8302 ± 0.0030 | 0.5374 ± 0.0027 | 18.15 ± 1.4 |
| Small-Object Context Aggregation [13] | 0.8172 ± 0.0031 | 0.5095 ± 0.0030 | 13.45 ± 1.6 |
| Image-Enhancement-Based Reflection Suppression [11] | 0.7924 ± 0.0038 | 0.4831 ± 0.0035 | 10.87 ± 1.9 |
Table 2.
Consolidated category distribution after deduplication. The dataset spans 15 key maritime categories. ‘Vessel’ and ‘Ball’ dominate, comprising over 50%, reflecting real-world obstacle distribution, while categories like ‘Animal’ are scarce, indicating a long-tail data characteristic. (Note: the original 19 categories have been consolidated to 15 categories to eliminate label redundancy. The “vessel” category encompasses all watercraft, while “person” includes all human instances. The evaluation uses these consolidated categories to prevent AP dilution. CD is category ID, CN is category name, TS is training set, VS is validation set, TS is test set).
Table 2.
Consolidated category distribution after deduplication. The dataset spans 15 key maritime categories. ‘Vessel’ and ‘Ball’ dominate, comprising over 50%, reflecting real-world obstacle distribution, while categories like ‘Animal’ are scarce, indicating a long-tail data characteristic. (Note: the original 19 categories have been consolidated to 15 categories to eliminate label redundancy. The “vessel” category encompasses all watercraft, while “person” includes all human instances. The evaluation uses these consolidated categories to prevent AP dilution. CD is category ID, CN is category name, TS is training set, VS is validation set, TS is test set).
| CD | CN | TS | VS | TS | Total | Consolidation Notes |
|---|
| 0 | animal | 66 | 17 | 11 | 94 | — |
| 1 | ball | 1760 | 558 | 291 | 2609 | — |
| 2 | vessel | 8299 | 2264 | 1144 | 11,707 | boat + ship + vessel + kayak |
| 3 | bridge | 1394 | 422 | 198 | 2014 | — |
| 4 | buoy | 160 | 29 | 25 | 214 | — |
| 5 | grass | 78 | 20 | 12 | 110 | — |
| 6 | harbor | 859 | 242 | 123 | 1224 | — |
| 7 | mast | 273 | 36 | 45 | 354 | — |
| 8 | person | 774 | 131 | 66 | 971 | person + sailor |
| 9 | platform | 418 | 125 | 71 | 614 | — |
| 10 | rock | 1101 | 315 | 124 | 1540 | — |
| 11 | rubbish | 473 | 125 | 71 | 669 | — |
| 12 | tree | 127 | 50 | 42 | 219 | — |
| 13 | pier | 100 | 23 | 10 | 133 | — |
| 14 | bottle | 2647 | 1067 | 519 | 4233 | — |
| Total | 15 categories | 18,329 | 5424 | 2752 | 26,505 | Original: 19 categories |
Table 3.
Model weights information. YOLOv8 model weights are consistently larger than YOLOv5’s by ~19.3% on average. The YOLOv8 + CBAM + SioU combination is the largest (17.62 MB), suggesting more complex parametric interactions.
Table 3.
Model weights information. YOLOv8 model weights are consistently larger than YOLOv5’s by ~19.3% on average. The YOLOv8 + CBAM + SioU combination is the largest (17.62 MB), suggesting more complex parametric interactions.
| Model Configuration | File Size | MD5 Checksum |
|---|
| YOLOv5 + CBAM + Focal-EIoU | 4.92 MB | 8b68aff5bbcbbb024c5b5dd8b0cfbbe6 |
| YOLOv5 + CBAM + SIoU | 4.92 MB | bdea8ef83bbe1dd9e0130b2cd843dbed |
| YOLOv5 + CBAM | 4.92 MB | b56ff6c65ef7fb0652894ea42181f993 |
| YOLOv5 + ECA + Focal-EIoU | 4.92 MB | c15b9a0936f5897b34a489cfbf99d1e4 |
| YOLOv5 + ECA + SIoU | 4.92 MB | c9aeca9782af22640f08e8ea205a9b1f |
| YOLOv5 + ECA | 4.92 MB | 089826e129f6b67a85eca4a699c8d4fb |
| YOLOv5 | 4.92 MB | 9ec9f88c0b18b307d32b94d2a7439289 |
| YOLOv8 + CBAM + Focal-EIoU | 5.87 MB | 7b7482b83ca5245b68c36ec582616893 |
| YOLOv8 + CBAM + SIoU | 17.62 MB | edf486e51eafc8158e6fbdf06da6447d |
| YOLOv8 + CBAM | 5.87 MB | 1e8643b9d1a2e50df6f96feae85e6485 |
| YOLOv8 + ECA + Focal-EIoU | 5.87 MB | d17af5c94c8432fefb1d3d424f821a49 |
| YOLOv8 + ECA + SIoU | 5.87 MB | 6929f3188221f37be30804421e321f1d |
| YOLOv8 + ECA | 5.87 MB | 269ac4311c76d2632bff3fe114eb2e94 |
| YOLOv8 | 5.87 MB | cfa8d4d677574aa51e0b64e16cde2f17 |
Table 4.
Baseline performance of detection models on the overall dataset (W is workstation, E is embedded). YOLOv8 demonstrates a marked improvement in detection accuracy (mAP@0.5) over YOLOv5 (+7.6%), albeit with a slight reduction in embedded inference speed (−3.2%). YOLOv11 achieves a further refined balance. RT-DETR attains the highest mAP@0.5 (0.7286); however, its embedded frame rate (19.48 FPS) drops substantially compared to YOLOv8 (−16.4%).
Table 4.
Baseline performance of detection models on the overall dataset (W is workstation, E is embedded). YOLOv8 demonstrates a marked improvement in detection accuracy (mAP@0.5) over YOLOv5 (+7.6%), albeit with a slight reduction in embedded inference speed (−3.2%). YOLOv11 achieves a further refined balance. RT-DETR attains the highest mAP@0.5 (0.7286); however, its embedded frame rate (19.48 FPS) drops substantially compared to YOLOv8 (−16.4%).
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| YOLOv5 | 0.6617 ± 0.0032 | 0.3443 ± 0.0021 | 0.6711 ± 0.0028 | 0.6034 ± 0.0035 | 0.4164 ± 0.0029 | 83.30 ± 1.2 | 24.08 ± 0.8 |
| YOLOv8 | 0.7119 ± 0.0028 | 0.3795 ± 0.0024 | 0.7114 ± 0.0025 | 0.6598 ± 0.0031 | 0.4523 ± 0.0026 | 83.37 ± 1.1 | 23.30 ± 0.7 |
| YOLOv11 | 0.7208 ± 0.0029 | 0.3821 ± 0.0025 | 0.7149 ± 0.0026 | 0.6653 ± 0.0032 | 0.4591 ± 0.0027 | 84.50 ± 1.2 | 23.05 ± 0.8 |
| RT-DETR | 0.7286 ± 0.0027 | 0.3897 ± 0.0023 | 0.7432 ± 0.0024 | 0.6630 ± 0.0031 | 0.4487 ± 0.0026 | 78.25 ± 1.5 | 19.48 ± 1.1 |
Table 5.
Baseline performance of detection models on the surface float dataset (W is Workstation, E is Embedded). All models perform excellently and comparably (mAP@0.5 > 0.839), indicating low sensitivity to different model architectures for this category of well-defined targets. RT-DETR shows a marginal lead in precision (0.8695), yet its embedded inference speed (21.36 FPS) is significantly lower than the YOLO series models.
Table 5.
Baseline performance of detection models on the surface float dataset (W is Workstation, E is Embedded). All models perform excellently and comparably (mAP@0.5 > 0.839), indicating low sensitivity to different model architectures for this category of well-defined targets. RT-DETR shows a marginal lead in precision (0.8695), yet its embedded inference speed (21.36 FPS) is significantly lower than the YOLO series models.
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| YOLOv5 | 0.8393 ± 0.0025 | 0.4198 ± 0.0021 | 0.8512 ± 0.0022 | 0.7803 ± 0.0028 | 0.6897 ± 0.0023 | 110.45 ± 1.5 | 25.62 ± 0.6 |
| YOLOv8 | 0.8419 ± 0.0022 | 0.4210 ± 0.0018 | 0.8507 ± 0.0020 | 0.8015 ± 0.0025 | 0.6973 ± 0.0020 | 133.11 ± 1.2 | 25.94±0.5 |
| YOLOv11 | 0.8462 ± 0.0021 | 0.4248 ± 0.0018 | 0.8541 ± 0.0019 | 0.8072 ± 0.0024 | 0.7041 ± 0.0020 | 136.50 ± 1.1 | 26.18 ± 0.5 |
| RT-DETR | 0.8490 ± 0.0019 | 0.4306 ± 0.0016 | 0.8695 ± 0.0017 | 0.8088 ± 0.0022 | 0.6923 ± 0.0018 | 124.75 ± 1.3 | 21.36 ± 0.9 |
Table 6.
Baseline performance of detection models on the difficult subset (W is Workstation, E is Embedded). YOLOv8’s accuracy advantage expands here (+11.8% over YOLOv5), but its embedded speed decreases sharply (−29.7%). Leveraging its strong representational capacity, RT-DETR achieves the best accuracy metrics (mAP@0.5: 0.8320, precision: 0.8983). However, its embedded frame rate (11.83 FPS) drops even further.
Table 6.
Baseline performance of detection models on the difficult subset (W is Workstation, E is Embedded). YOLOv8’s accuracy advantage expands here (+11.8% over YOLOv5), but its embedded speed decreases sharply (−29.7%). Leveraging its strong representational capacity, RT-DETR achieves the best accuracy metrics (mAP@0.5: 0.8320, precision: 0.8983). However, its embedded frame rate (11.83 FPS) drops even further.
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| YOLOv5 | 0.7319 ± 0.0042 | 0.3659 ± 0.0035 | 0.7743 ± 0.0038 | 0.6683 ± 0.0048 | 0.4787 ± 0.0039 | 110.88 ± 1.8 | 20.58 ± 1.2 |
| YOLOv8 | 0.8183 ± 0.0030 | 0.4092 ± 0.0025 | 0.8837 ± 0.0027 | 0.7198 ± 0.0036 | 0.5285 ± 0.0028 | 106.39 ± 1.9 | 14.67 ± 1.7 |
| YOLOv11 | 0.8235 ± 0.0028 | 0.4124 ± 0.0023 | 0.8890 ± 0.0025 | 0.7262 ± 0.0033 | 0.5341 ± 0.0026 | 108.25 ± 1.7 | 15.92 ± 1.4 |
| RT-DETR | 0.8320 ± 0.0026 | 0.4160 ± 0.0022 | 0.8983 ± 0.0024 | 0.7245 ± 0.0031 | 0.5223 ± 0.0025 | 95.50 ± 1.9 | 11.83 ± 1.6 |
Table 7.
Added attention module performance of detection models on the overall dataset. Attention modules performance analysis. The ECA module boosts APS to 0.4783 in YOLOv8, while CBAM causes an mAP@0.5 drop in the same architecture, revealing compatibility issues between attention mechanisms and model design. (CB is CBAM, W is Workstation, E is Embedded).
Table 7.
Added attention module performance of detection models on the overall dataset. Attention modules performance analysis. The ECA module boosts APS to 0.4783 in YOLOv8, while CBAM causes an mAP@0.5 drop in the same architecture, revealing compatibility issues between attention mechanisms and model design. (CB is CBAM, W is Workstation, E is Embedded).
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| YOLOv5 | 0.6617 ± 0.0032 | 0.3443 ± 0.0021 | 0.6711 ± 0.0028 | 0.6034 ± 0.0035 | 0.4164 ± 0.0029 | 83.30 ± 1.2 | 24.08 ± 0.8 |
| YOLOv5 + CB | 0.6623 ± 0.0035 | 0.3285 ± 0.0023 | 0.6806 ± 0.0031 | 0.6251 ± 0.0038 | 0.4319 ± 0.0032 | 81.30 ± 1.3 | 23.40 ± 0.9 |
| YOLOv5 + ECA | 0.6662 ± 0.0034 | 0.3273 ± 0.0022 | 0.6697 ± 0.0030 | 0.6447 ± 0.0037 | 0.4338 ± 0.0031 | 82.28 ± 1.2 | 23.87 ± 0.8 |
| YOLOv8 | 0.7119 ± 0.0028 | 0.3795 ± 0.0024 | 0.7114 ± 0.0025 | 0.6598 ± 0.0031 | 0.4523 ± 0.0026 | 83.37 ± 1.1 | 23.30 ± 0.7 |
| YOLOv8 + CB | 0.6948 ± 0.0031 | 0.3683 ± 0.0026 | 0.7511 ± 0.0029 | 0.6126 ± 0.0034 | 0.4643 ± 0.0029 | 80.27 ± 1.4 | 22.63 ± 0.8 |
| YOLOv8 + ECA | 0.7214 ± 0.0030 | 0.3785 ± 0.0025 | 0.7717 ± 0.0027 | 0.6161 ± 0.0033 | 0.4783 ± 0.0028 | 81.58 ± 1.3 | 23.97 ± 0.7 |
Table 8.
Added attention module performance of detection models on the surface float dataset. For surface floats, both ECA and CBAM bring ~2% mAP@0.5 gains. CBAM achieves the highest APS (0.7278) with YOLOv8, proving the significant benefit of attention for small, distinct targets. (CB is CBAM, W is Workstation, E is Embedded).
Table 8.
Added attention module performance of detection models on the surface float dataset. For surface floats, both ECA and CBAM bring ~2% mAP@0.5 gains. CBAM achieves the highest APS (0.7278) with YOLOv8, proving the significant benefit of attention for small, distinct targets. (CB is CBAM, W is Workstation, E is Embedded).
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| YOLOv5 | 0.8393 ± 0.0025 | 0.4198 ± 0.0021 | 0.8512 ± 0.0022 | 0.7803 ± 0.0028 | 0.6897 ± 0.0023 | 110.45 ± 1.5 | 25.62 ± 0.6 |
| YOLOv5 + CB | 0.8548 ± 0.0023 | 0.4274 ± 0.0019 | 0.8450 ± 0.0021 | 0.8091 ± 0.0026 | 0.7200 ± 0.0021 | 104.69 ± 1.7 | 24.73 ± 0.7 |
| YOLOv5 + ECA | 0.8508 ± 0.0024 | 0.4254 ± 0.0020 | 0.8430 ± 0.0022 | 0.7968 ± 0.0027 | 0.7175 ± 0.0022 | 108.59 ± 1.6 | 25.42 ± 0.6 |
| YOLOv8 | 0.8419 ± 0.0022 | 0.4210 ± 0.0018 | 0.8507 ± 0.0020 | 0.8015 ± 0.0025 | 0.6973 ± 0.0020 | 133.11 ± 1.2 | 25.94 ± 0.5 |
| YOLOv8 + CB | 0.8603 ± 0.0020 | 0.4302 ± 0.0017 | 0.8799 ± 0.0019 | 0.7958 ± 0.0023 | 0.7278 ± 0.0018 | 128.45 ± 1.3 | 25.10 ± 0.6 |
| YOLOv8 + ECA | 0.8638 ± 0.0019 | 0.4319 ± 0.0016 | 0.8852 ± 0.0018 | 0.7874 ± 0.0022 | 0.7319 ± 0.0017 | 129.09 ± 1.2 | 24.44 ± 0.6 |
Table 9.
Added attention module performance of detection models on the difficult subset. On the difficult subset, the ECA module delivers a ~5.5% mAP@0.5 gain for YOLOv5, outperforming CBAM and demonstrating superior feature discrimination in complex scenes. (CB is CBAM, W is Workstation, E is Embedded).
Table 9.
Added attention module performance of detection models on the difficult subset. On the difficult subset, the ECA module delivers a ~5.5% mAP@0.5 gain for YOLOv5, outperforming CBAM and demonstrating superior feature discrimination in complex scenes. (CB is CBAM, W is Workstation, E is Embedded).
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| YOLOv5 | 0.7319 ± 0.0042 | 0.3659 ± 0.0035 | 0.7743 ± 0.0038 | 0.6683 ± 0.0048 | 0.4787 ± 0.0039 | 110.88 ± 1.8 | 20.58 ± 1.2 |
| YOLOv5 + CB | 0.7807 ± 0.0039 | 0.3904 ± 0.0032 | 0.8629 ± 0.0035 | 0.6501 ± 0.0045 | 0.5002 ± 0.0036 | 108.54 ± 1.9 | 19.37 ± 1.3 |
| YOLOv5 + ECA | 0.7869 ± 0.0038 | 0.3935 ± 0.0031 | 0.8291 ± 0.0034 | 0.6634 ± 0.0044 | 0.5139 ± 0.0035 | 112.98 ± 1.7 | 19.23 ± 1.4 |
| YOLOv8 | 0.8183 ± 0.0030 | 0.4092 ± 0.0025 | 0.8837 ± 0.0027 | 0.7198 ± 0.0036 | 0.5285 ± 0.0028 | 106.39 ± 1.9 | 14.67 ± 1.7 |
| YOLOv8 + CB | 0.8334 ± 0.0028 | 0.4167 ± 0.0023 | 0.8461 ± 0.0025 | 0.7427 ± 0.0034 | 0.5408 ± 0.0026 | 110.81 ± 1.7 | 17.88 ± 1.5 |
| YOLOv8 + ECA | 0.8224 ± 0.0029 | 0.4112 ± 0.0024 | 0.8617 ± 0.0026 | 0.7461 ± 0.0035 | 0.5346 ± 0.0027 | 113.98 ± 1.6 | 19.23 ± 1.4 |
Table 10.
Attention module placement ablation analysis. The ECA module achieves optimal APS improvement (+7.3%) when inserted at the P3 layer (small-object features), whereas CBAM suffers significant degradation at Neck-Concat, underscoring the criticality of placement. (Y8 is YOLOv8, Overall Dataset, NC is Neck-Concat, W is Workstation, E is Embedded).
Table 10.
Attention module placement ablation analysis. The ECA module achieves optimal APS improvement (+7.3%) when inserted at the P3 layer (small-object features), whereas CBAM suffers significant degradation at Neck-Concat, underscoring the criticality of placement. (Y8 is YOLOv8, Overall Dataset, NC is Neck-Concat, W is Workstation, E is Embedded).
| Model | Place | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| Y8 | None | 0.7119 ± 0.0028 | 0.3795 ± 0.0024 | 0.7114 ± 0.0025 | 0.6598 ± 0.0031 | 0.4523 ± 0.0026 | 83.37 ± 1.1 | 23.30 ± 0.7 |
| +ECA | P3 | 0.7289 ± 0.0027 | 0.3881 ± 0.0023 | 0.7750 ± 0.0026 | 0.6200 ± 0.0032 | 0.4853 ± 0.0025 | 81.45 ± 1.2 | 22.89 ± 0.7 |
| +ECA | P4 | 0.7256 ± 0.0029 | 0.3867 ± 0.0025 | 0.7730 ± 0.0028 | 0.6180 ± 0.0033 | 0.4798 ± 0.0027 | 81.62 ± 1.2 | 22.95 ± 0.7 |
| +ECA | NC | 0.7214 ± 0.0030 | 0.3785 ± 0.0025 | 0.7717 ± 0.0027 | 0.6161 ± 0.0033 | 0.4783 ± 0.0028 | 81.58 ± 1.3 | 23.97 ± 0.7 |
| +CBAM | P3 | 0.7198 ± 0.0031 | 0.3834 ± 0.0026 | 0.7600 ± 0.0029 | 0.6150 ± 0.0034 | 0.4726 ± 0.0029 | 79.34 ± 1.4 | 22.15 ± 0.8 |
| +CBAM | P4 | 0.7236 ± 0.0028 | 0.3859 ± 0.0024 | 0.7650 ± 0.0027 | 0.6170 ± 0.0032 | 0.4789 ± 0.0026 | 79.87 ± 1.3 | 22.41 ± 0.8 |
| +CBAM | NC | 0.6948 ± 0.0031 | 0.3683 ± 0.0026 | 0.7511 ± 0.0029 | 0.6126 ± 0.0034 | 0.4643 ± 0.0029 | 80.27 ± 1.4 | 22.63 ± 0.8 |
Table 11.
Loss function weight scanning analysis. Reducing the box loss weight from 7.0 to 5.0, combined with a lower learning rate (0.001), yielded the best balance of mAP@0.5 (0.8412) and APS (0.5489) on the difficult subset, effectively mitigating training instability. (YOLOv8 + CBAM, Difficult Subset, BW is Box Weight, LR is Learning Rate, GC is Gradient Clip. W is Workstation, E is Embedded).
Table 11.
Loss function weight scanning analysis. Reducing the box loss weight from 7.0 to 5.0, combined with a lower learning rate (0.001), yielded the best balance of mAP@0.5 (0.8412) and APS (0.5489) on the difficult subset, effectively mitigating training instability. (YOLOv8 + CBAM, Difficult Subset, BW is Box Weight, LR is Learning Rate, GC is Gradient Clip. W is Workstation, E is Embedded).
| BW | LR | GC | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| 7.0 | 0.01 | No | 0.8334 ± 0.0028 | 0.4167 ± 0.0023 | 0.8461 ± 0.0025 | 0.7427 ± 0.0034 | 0.5408 ± 0.0026 | 110.81 ± 1.7 | 17.88 ± 1.5 |
| 7.0 | 0.001 | Yes | 0.8357 ± 0.0025 | 0.4189 ± 0.0021 | 0.8480 ± 0.0023 | 0.7440 ± 0.0032 | 0.5431 ± 0.0023 | 110.50 ± 1.6 | 17.80 ± 1.4 |
| 5.0 | 0.001 | Yes | 0.8412 ± 0.0023 | 0.4215 ± 0.0019 | 0.8520 ± 0.0021 | 0.7460 ± 0.0030 | 0.5489 ± 0.0021 | 110.20 ± 1.5 | 17.75 ± 1.3 |
| 4.5 | 0.001 | Yes | 0.8389 ± 0.0024 | 0.4198 ± 0.0020 | 0.8500 ± 0.0022 | 0.7450 ± 0.0031 | 0.5457 ± 0.0022 | 110.30 ± 1.5 | 17.78 ± 1.4 |
| 4.0 | 0.001 | Yes | 0.8367 ± 0.0026 | 0.4182 ± 0.0022 | 0.8485 ± 0.0024 | 0.7445 ± 0.0033 | 0.5423 ± 0.0024 | 110.40 ± 1.6 | 17.82 ± 1.4 |
Table 12.
Added loss-function modification performance of detection models on the overall datasets. SIoU loss elevates mAP@0.5 to 0.7286 in the YOLOv8 + ECA configuration, while Focal-EIoU performs best (0.7208) with YOLOv8 + CBAM, indicating loss-function efficacy is model and attention dependent. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 12.
Added loss-function modification performance of detection models on the overall datasets. SIoU loss elevates mAP@0.5 to 0.7286 in the YOLOv8 + ECA configuration, while Focal-EIoU performs best (0.7208) with YOLOv8 + CBAM, indicating loss-function efficacy is model and attention dependent. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| Y5 | 0.6617 ± 0.0032 | 0.3443 ± 0.0021 | 0.6711 ± 0.0028 | 0.6034 ± 0.0035 | 0.4164 ± 0.0029 | 83.30 ± 1.2 | 24.08 ± 0.8 |
| Y5 + CB | 0.6623 ± 0.0035 | 0.3285 ± 0.0023 | 0.6806 ± 0.0031 | 0.6251 ± 0.0038 | 0.4319 ± 0.0032 | 81.30 ± 1.3 | 23.40 ± 0.9 |
| Y5 + CB + SI | 0.6788 ± 0.0038 | 0.3623 ± 0.0025 | 0.6522 ± 0.0034 | 0.6698 ± 0.0041 | 0.4253 ± 0.0035 | 80.45 ± 1.4 | 23.17 ± 0.9 |
| Y5 + CB + FE | 0.6762 ± 0.0036 | 0.3593 ± 0.0024 | 0.6785 ± 0.0032 | 0.6097 ± 0.0039 | 0.4245 ± 0.0033 | 81.34 ± 1.3 | 21.30 ± 1.1 |
| Y5 + ECA | 0.6662 ± 0.0034 | 0.3273 ± 0.0022 | 0.6697 ± 0.0030 | 0.6447 ± 0.0037 | 0.4338 ± 0.0031 | 82.28 ± 1.2 | 23.87 ± 0.8 |
| Y5 + ECA + SI | 0.6827 ± 0.0039 | 0.3600 ± 0.0026 | 0.6789 ± 0.0035 | 0.6598 ± 0.0042 | 0.4288 ± 0.0036 | 85.50 ± 1.1 | 22.21 ± 1.0 |
| Y5 + ECA + FE | 0.6694 ± 0.0037 | 0.3565 ± 0.0025 | 0.6729 ± 0.0033 | 0.6431 ± 0.0040 | 0.4207 ± 0.0034 | 87.17 ± 1.0 | 23.05 ± 0.9 |
| Y8 | 0.7119 ± 0.0028 | 0.3795 ± 0.0024 | 0.7114 ± 0.0025 | 0.6598 ± 0.0031 | 0.4523 ± 0.0026 | 83.37 ± 1.1 | 23.30 ± 0.7 |
| Y8 + CB | 0.6948 ± 0.0031 | 0.3683 ± 0.0026 | 0.7511 ± 0.0029 | 0.6126 ± 0.0034 | 0.4643 ± 0.0029 | 80.27 ± 1.4 | 22.63 ± 0.8 |
| Y8 + CB + SI | 0.7173 ± 0.0033 | 0.3713 ± 0.0027 | 0.6880 ± 0.0031 | 0.6936 ± 0.0036 | 0.4621 ± 0.0031 | 80.08 ± 1.4 | 22.00 ± 0.9 |
| Y8 + CB + FE | 0.7208 ± 0.0034 | 0.3913 ± 0.0028 | 0.7401 ± 0.0032 | 0.6598 ± 0.0037 | 0.4618 ± 0.0032 | 89.63 ± 1.0 | 20.20 ± 1.2 |
| Y8 + ECA | 0.7214 ± 0.0030 | 0.3630 ± 0.0025 | 0.7717 ± 0.0027 | 0.6161 ± 0.0033 | 0.4783 ± 0.0028 | 81.58 ± 1.3 | 23.97 ± 0.7 |
| Y8 + ECA + SI | 0.7286 ± 0.0032 | 0.4018 ± 0.0029 | 0.7557 ± 0.0030 | 0.6575 ± 0.0035 | 0.4669 ± 0.0030 | 81.77 ± 1.3 | 22.16 ± 0.9 |
| Y8 + ECA + FE | 0.7131 ± 0.0031 | 0.3924 ± 0.0027 | 0.7140 ± 0.0028 | 0.6704 ± 0.0034 | 0.4583 ± 0.0029 | 82.15 ± 1.2 | 21.38 ± 1.0 |
Table 13.
Added loss-function modification performance of detection models on the surface float dataset. On the surface float dataset, introducing complex loss functions like Focal-EIoU did not yield consistent gains, with some combinations causing performance degradation, suggesting unnecessary optimization complexity for these targets. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 13.
Added loss-function modification performance of detection models on the surface float dataset. On the surface float dataset, introducing complex loss functions like Focal-EIoU did not yield consistent gains, with some combinations causing performance degradation, suggesting unnecessary optimization complexity for these targets. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| Y5 | 0.8393 ± 0.0025 | 0.4198 ± 0.0021 | 0.8512 ± 0.0022 | 0.7803 ± 0.0028 | 0.6897 ± 0.0023 | 110.45 ± 1.5 | 25.62 ± 0.6 |
| Y5 + CB | 0.8548 ± 0.0023 | 0.4274 ± 0.0019 | 0.8450 ± 0.0021 | 0.8091 ± 0.0026 | 0.7200 ± 0.0021 | 104.69 ± 1.7 | 24.73 ± 0.7 |
| Y5 + CB + SI | 0.8486 ± 0.0026 | 0.4199 ± 0.0022 | 0.8605 ± 0.0023 | 0.7842 ± 0.0029 | 0.6904 ± 0.0024 | 113.16 ± 1.4 | 22.16 ± 0.9 |
| Y5 + CB + FE | 0.8421 ± 0.0027 | 0.4264 ± 0.0023 | 0.8824 ± 0.0024 | 0.7664 ± 0.0030 | 0.6778 ± 0.0025 | 110.63 ± 1.5 | 19.86 ± 1.1 |
| Y5 + ECA | 0.8508 ± 0.0024 | 0.4254 ± 0.0020 | 0.8430 ± 0.0022 | 0.7968 ± 0.0027 | 0.7175 ± 0.0022 | 108.59 ± 1.6 | 25.42 ± 0.6 |
| Y5 + ECA + SI | 0.8449 ± 0.0028 | 0.4282 ± 0.0024 | 0.8779 ± 0.0025 | 0.7755 ± 0.0031 | 0.7004 ± 0.0026 | 115.32 ± 1.3 | 25.10 ± 0.7 |
| Y5 + ECA + FE | 0.8548 ± 0.0025 | 0.4236 ± 0.0021 | 0.8961 ± 0.0023 | 0.7812 ± 0.0028 | 0.6914 ± 0.0023 | 112.92 ± 1.4 | 24.63 ± 0.8 |
| Y8 | 0.8419 ± 0.0022 | 0.4210 ± 0.0018 | 0.8507 ± 0.0020 | 0.8015 ± 0.0025 | 0.6973 ± 0.0020 | 133.11 ± 1.2 | 25.94 ± 0.5 |
| Y8 + CBAM | 0.8603 ± 0.0020 | 0.4302 ± 0.0017 | 0.8799 ± 0.0019 | 0.7958 ± 0.0023 | 0.7278 ± 0.0018 | 128.45 ± 1.3 | 25.10 ± 0.6 |
| Y8 + CBAM + SI | 0.8545 ± 0.0023 | 0.4306 ± 0.0020 | 0.8940 ± 0.0021 | 0.7797 ± 0.0026 | 0.7074 ± 0.0021 | 128.03 ± 1.3 | 22.80 ± 0.8 |
| Y8 + CBAM + FE | 0.8481 ± 0.0024 | 0.4353 ± 0.0021 | 0.8708 ± 0.0022 | 0.7924 ± 0.0027 | 0.6849 ± 0.0022 | 99.34 ± 1.8 | 17.61 ± 1.3 |
| Y8 + ECA | 0.8638 ± 0.0019 | 0.4319 ± 0.0016 | 0.8852 ± 0.0018 | 0.7874 ± 0.0022 | 0.7319 ± 0.0017 | 129.09 ± 1.2 | 24.44 ± 0.6 |
| Y8 + ECA + SI | 0.8526 ± 0.0022 | 0.4341 ± 0.0019 | 0.8964 ± 0.0020 | 0.7823 ± 0.0025 | 0.7032 ± 0.0019 | 128.28 ± 1.3 | 26.02 ± 0.5 |
| Y8 + ECA + FE | 0.8594 ± 0.0021 | 0.4382 ± 0.0018 | 0.8897 ± 0.0019 | 0.8082 ± 0.0024 | 0.7065 ± 0.0018 | 134.25 ± 1.1 | 23.99 ± 0.7 |
Table 14.
Added loss-function modification performance of detection models on the difficult subset. The combination of SIoU loss and ECA attention in YOLOv5 achieved the highest mAP@0.5 (0.8255) on the difficult subset, a nearly 10% improvement over the baseline, proving its effectiveness in challenging scenarios. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 14.
Added loss-function modification performance of detection models on the difficult subset. The combination of SIoU loss and ECA attention in YOLOv5 achieved the highest mAP@0.5 (0.8255) on the difficult subset, a nearly 10% improvement over the baseline, proving its effectiveness in challenging scenarios. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| Y5 | 0.7319 ± 0.0042 | 0.3659 ± 0.0035 | 0.7743 ± 0.0038 | 0.6683 ± 0.0048 | 0.4787 ± 0.0039 | 110.88 ± 1.8 | 20.58 ± 1.2 |
| Y5 + CB | 0.7807 ± 0.0039 | 0.3904 ± 0.0032 | 0.8629 ± 0.0035 | 0.6501 ± 0.0045 | 0.5002 ± 0.0036 | 108.54 ± 1.9 | 19.37 ± 1.3 |
| Y5 + CB + SI | 0.7779 ± 0.0041 | 0.3889 ± 0.0034 | 0.8643 ± 0.0037 | 0.6983 ± 0.0047 | 0.4826 ± 0.0038 | 116.11 ± 1.6 | 24.02 ± 1.0 |
| Y5 + CB + FE | 0.7859 ± 0.0040 | 0.3921 ± 0.0033 | 0.8284 ± 0.0036 | 0.7202 ± 0.0046 | 0.4874 ± 0.0037 | 117.90 ± 1.5 | 24.80 ± 0.9 |
| Y5 + ECA | 0.7869 ± 0.0038 | 0.3935 ± 0.0031 | 0.8291 ± 0.0034 | 0.6634 ± 0.0044 | 0.5139 ± 0.0035 | 112.98 ± 1.7 | 19.23 ± 1.4 |
| Y5 + ECA + SI | 0.8255 ± 0.0035 | 0.4128 ± 0.0029 | 0.9004 ± 0.0032 | 0.7351 ± 0.0041 | 0.5105 ± 0.0033 | 81.59 ± 2.1 | 17.55 ± 1.6 |
| Y5 + ECA + FE | 0.7992 ± 0.0037 | 0.4006 ± 0.0030 | 0.8890 ± 0.0033 | 0.6827 ± 0.0043 | 0.4961 ± 0.0034 | 117.68 ± 1.5 | 25.30 ± 0.8 |
| Y8 | 0.8183 ± 0.0030 | 0.4092 ± 0.0025 | 0.8837 ± 0.0027 | 0.7198 ± 0.0036 | 0.5285 ± 0.0028 | 106.39 ± 1.9 | 14.67 ± 1.7 |
| Y8 + CB | 0.8334 ± 0.0028 | 0.4167 ± 0.0023 | 0.8461 ± 0.0025 | 0.7427 ± 0.0034 | 0.5408 ± 0.0026 | 110.81 ± 1.7 | 17.88 ± 1.5 |
| Y8 + CB + SI | 0.8120 ± 0.0032 | 0.4060 ± 0.0027 | 0.9189 ± 0.0029 | 0.7339 ± 0.0038 | 0.5082 ± 0.0031 | 115.96 ± 1.6 | 25.09 ± 0.9 |
| Y8 + CB + FE | 0.8052 ± 0.0033 | 0.4026 ± 0.0028 | 0.9253 ± 0.0030 | 0.7180 ± 0.0039 | 0.4991 ± 0.0032 | 117.17 ± 1.5 | 22.22 ± 1.1 |
| Y8 + ECA | 0.8224 ± 0.0029 | 0.4112 ± 0.0024 | 0.8617 ± 0.0026 | 0.7461 ± 0.0035 | 0.5346 ± 0.0027 | 113.98 ± 1.6 | 19.23 ± 1.4 |
| Y8 + ECA + SI | 0.8616 ± 0.0026 | 0.4308 ± 0.0022 | 0.9091 ± 0.0024 | 0.7921 ± 0.0032 | 0.5347 ± 0.0025 | 117.49 ± 1.5 | 24.86 ± 0.8 |
| Y8 + ECA + FE | 0.8010 ± 0.0031 | 0.4005 ± 0.0026 | 0.8678 ± 0.0028 | 0.7146 ± 0.0037 | 0.4903 ± 0.0030 | 115.51 ± 1.6 | 25.14 ± 0.8 |
Table 15.
Added high-resolution performance of detection models on the surface float dataset. High-resolution training (1536) pushed the APS for YOLOv8 + ECA to 0.7897, but at the cost of a drastic drop in embedded frame rate to 6.56 FPS, highlighting the significant computational overhead for precision gains. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 15.
Added high-resolution performance of detection models on the surface float dataset. High-resolution training (1536) pushed the APS for YOLOv8 + ECA to 0.7897, but at the cost of a drastic drop in embedded frame rate to 6.56 FPS, highlighting the significant computational overhead for precision gains. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| Y5 | 0.8999 ± 0.0021 | 0.4069 ± 0.0018 | 0.8896 ± 0.0019 | 0.8387 ± 0.0024 | 0.7519 ± 0.0020 | 50.74 ± 2.1 | 5.45 ± 0.9 |
| Y5 + CB | 0.8996 ± 0.0022 | 0.4076 ± 0.0019 | 0.8927 ± 0.0020 | 0.8401 ± 0.0025 | 0.7542 ± 0.0021 | 71.55 ± 1.8 | 7.77 ± 0.7 |
| Y5 + CB + SI | 0.9186 ± 0.0019 | 0.4722 ± 0.0017 | 0.8881 ± 0.0018 | 0.8565 ± 0.0022 | 0.7553 ± 0.0019 | 73.94 ± 1.7 | 7.59 ± 0.7 |
| Y5 + CB + FE | 0.9043 ± 0.0020 | 0.4863 ± 0.0018 | 0.8659 ± 0.0019 | 0.8337 ± 0.0023 | 0.7456 ± 0.0020 | 73.49 ± 1.7 | 7.70 ± 0.7 |
| Y5 + ECA | 0.9138 ± 0.0018 | 0.4241 ± 0.0016 | 0.9160 ± 0.0017 | 0.8613 ± 0.0021 | 0.7867 ± 0.0018 | 73.75 ± 1.7 | 7.83 ± 0.6 |
| Y5 + ECA + SI | 0.8979 ± 0.0023 | 0.4615 ± 0.0020 | 0.8975 ± 0.0021 | 0.8263 ± 0.0026 | 0.7456 ± 0.0022 | 74.74 ± 1.6 | 7.73 ± 0.7 |
| Y5 + ECA + FE | 0.9190 ± 0.0017 | 0.4870 ± 0.0015 | 0.8749 ± 0.0016 | 0.8632 ± 0.0020 | 0.7642 ± 0.0017 | 72.11 ± 1.8 | 7.80 ± 0.7 |
| Y8 | 0.9159 ± 0.0016 | 0.4229 ± 0.0014 | 0.8496 ± 0.0015 | 0.9145 ± 0.0018 | 0.7716 ± 0.0016 | 58.51 ± 1.9 | 4.89 ± 1.0 |
| Y8 + CB | 0.9297 ± 0.0014 | 0.4269 ± 0.0012 | 0.8652 ± 0.0013 | 0.9114 ± 0.0016 | 0.7803 ± 0.0014 | 64.43 ± 1.7 | 6.66 ± 0.8 |
| Y8 + CB + SI | 0.9251 ± 0.0015 | 0.4839 ± 0.0013 | 0.8843 ± 0.0014 | 0.8844 ± 0.0017 | 0.7680 ± 0.0015 | 63.31 ± 1.7 | 6.52 ± 0.8 |
| Y8 + CB + FE | 0.9234 ± 0.0016 | 0.4836 ± 0.0014 | 0.9070 ± 0.0015 | 0.8690 ± 0.0018 | 0.7665 ± 0.0016 | 54.97 ± 1.9 | 4.83 ± 1.0 |
| Y8 + ECA | 0.9357 ± 0.0013 | 0.4270 ± 0.0011 | 0.8766 ± 0.0012 | 0.9037 ± 0.0015 | 0.7897 ± 0.0013 | 66.50 ± 1.6 | 6.56 ± 0.8 |
| Y8 + ECA + SI | 0.9276 ± 0.0015 | 0.4920 ± 0.0013 | 0.9055 ± 0.0014 | 0.8673 ± 0.0017 | 0.7620 ± 0.0015 | 55.69 ± 1.9 | 4.96 ± 1.0 |
| Y8 + ECA + FE | 0.9191 ± 0.0016 | 0.4883 ± 0.0014 | 0.8979 ± 0.0015 | 0.8642 ± 0.0018 | 0.7606 ± 0.0016 | 60.02 ± 1.8 | 4.94 ± 1.0 |
Table 16.
Added image slicing performance of detection models on the surface float dataset. The tiled inference strategy enabled the YOLOv5 baseline to achieve an APS of 0.7836 while maintaining a practical embedded speed of 18.02 FPS, demonstrating a superior balance between accuracy and efficiency. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 16.
Added image slicing performance of detection models on the surface float dataset. The tiled inference strategy enabled the YOLOv5 baseline to achieve an APS of 0.7836 while maintaining a practical embedded speed of 18.02 FPS, demonstrating a superior balance between accuracy and efficiency. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| Y5 | 0.8973 ± 0.0020 | 0.4249 ± 0.0017 | 0.8892 ± 0.0018 | 0.8543 ± 0.0022 | 0.7836 ± 0.0019 | 112.81 ± 1.4 | 18.02 ± 1.0 |
| Y5 + CB | 0.8975 ± 0.0021 | 0.4184 ± 0.0018 | 0.9046 ± 0.0019 | 0.8503 ± 0.0023 | 0.7794 ± 0.0020 | 114.92 ± 1.3 | 17.13 ± 1.1 |
| Y5 + CB + SI | 0.9084 ± 0.0019 | 0.4757 ± 0.0016 | 0.8949 ± 0.0017 | 0.8553 ± 0.0021 | 0.7640 ± 0.0018 | 97.22 ± 1.6 | 17.00 ± 1.1 |
| Y5 + CB + FE | 0.9002 ± 0.0020 | 0.4741 ± 0.0017 | 0.8980 ± 0.0018 | 0.8255 ± 0.0022 | 0.7600 ± 0.0019 | 91.48 ± 1.7 | 13.57 ± 1.4 |
| Y5 + ECA | 0.8855 ± 0.0023 | 0.4428 ± 0.0019 | 0.8825 ± 0.0021 | 0.8512 ± 0.0025 | 0.7662 ± 0.0022 | 65.86 ± 2.0 | 12.59 ± 1.5 |
| Y5 + ECA + SI | 0.9103 ± 0.0018 | 0.4797 ± 0.0015 | 0.9180 ± 0.0016 | 0.8337 ± 0.0020 | 0.7754 ± 0.0017 | 91.25 ± 1.7 | 11.95 ± 1.6 |
| Y5 + ECA + FE | 0.8952 ± 0.0021 | 0.4826 ± 0.0018 | 0.8693 ± 0.0019 | 0.8468 ± 0.0023 | 0.7453 ± 0.0020 | 114.35 ± 1.4 | 17.04 ± 1.1 |
| Y8 | 0.9055 ± 0.0017 | 0.4314 ± 0.0014 | 0.8562 ± 0.0015 | 0.8923 ± 0.0019 | 0.7845 ± 0.0016 | 118.53 ± 1.2 | 17.73 ± 0.9 |
| Y8 + CB | 0.9056 ± 0.0018 | 0.4403 ± 0.0015 | 0.8914 ± 0.0016 | 0.8606 ± 0.0020 | 0.7844 ± 0.0017 | 116.30 ± 1.3 | 16.54 ± 1.0 |
| Y8 + CB + SI | 0.9118 ± 0.0016 | 0.4915 ± 0.0013 | 0.8967 ± 0.0014 | 0.8647 ± 0.0018 | 0.7723 ± 0.0015 | 97.46 ± 1.6 | 12.32 ± 1.3 |
| Y8 + CB + FE | 0.9001 ± 0.0019 | 0.4910 ± 0.0016 | 0.9039 ± 0.0017 | 0.8475 ± 0.0021 | 0.7500 ± 0.0018 | 115.59 ± 1.3 | 16.52 ± 1.0 |
| Y8 + ECA | 0.9070 ± 0.0017 | 0.4359 ± 0.0014 | 0.9158 ± 0.0015 | 0.8538 ± 0.0019 | 0.7771 ± 0.0016 | 116.22 ± 1.3 | 16.95 ± 1.0 |
| Y8 + ECA + SI | 0.9146 ± 0.0015 | 0.4855 ± 0.0013 | 0.8968 ± 0.0014 | 0.8608 ± 0.0017 | 0.7721 ± 0.0015 | 115.92 ± 1.3 | 17.20 ± 0.9 |
| Y8 + ECA + FE | 0.9025 ± 0.0018 | 0.4860 ± 0.0015 | 0.9112 ± 0.0016 | 0.8433 ± 0.0020 | 0.7525 ± 0.0017 | 98.70 ± 1.6 | 13.61 ± 1.2 |
Table 17.
Added low-light simulations performance of detection models on the difficult subset. On the low-light difficult subset, the YOLOv8 + ECA + SIoU combination achieved an mAP@0.5 of 0.8184 and an improvement of ~0.07 over the baseline, underscoring its exceptional robustness to degraded lighting conditions. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 17.
Added low-light simulations performance of detection models on the difficult subset. On the low-light difficult subset, the YOLOv8 + ECA + SIoU combination achieved an mAP@0.5 of 0.8184 and an improvement of ~0.07 over the baseline, underscoring its exceptional robustness to degraded lighting conditions. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| Y5 | 0.7059 ± 0.0048 | 0.4348 ± 0.0040 | 0.7857 ± 0.0043 | 0.6224 ± 0.0052 | 0.4383 ± 0.0044 | 118.36 ± 1.6 | 19.67 ± 1.3 |
| Y5 + CB | 0.7745 ± 0.0042 | 0.4727 ± 0.0035 | 0.8648 ± 0.0038 | 0.6953 ± 0.0047 | 0.4981 ± 0.0039 | 103.53 ± 1.9 | 14.01 ± 1.6 |
| Y5 + CB + SI | 0.7393 ± 0.0045 | 0.4629 ± 0.0038 | 0.8293 ± 0.0041 | 0.6298 ± 0.0050 | 0.4658 ± 0.0042 | 95.48 ± 2.1 | 12.63 ± 1.8 |
| Y5 + CB + FE | 0.7577 ± 0.0043 | 0.4973 ± 0.0036 | 0.8395 ± 0.0039 | 0.6544 ± 0.0048 | 0.4708 ± 0.0040 | 97.14 ± 2.0 | 12.94 ± 1.7 |
| Y5 + ECA | 0.7930 ± 0.0040 | 0.4855 ± 0.0033 | 0.8494 ± 0.0036 | 0.7351 ± 0.0045 | 0.5241 ± 0.0037 | 97.16 ± 2.0 | 12.83 ± 1.7 |
| Y5 + ECA + SI | 0.7603 ± 0.0044 | 0.5113 ± 0.0037 | 0.8385 ± 0.0040 | 0.6773 ± 0.0049 | 0.4779 ± 0.0041 | 96.59 ± 2.1 | 11.55 ± 1.9 |
| Y5 + ECA + FE | 0.7620 ± 0.0043 | 0.5058 ± 0.0036 | 0.8288 ± 0.0039 | 0.6522 ± 0.0048 | 0.4730 ± 0.0040 | 97.17 ± 2.0 | 14.07 ± 1.6 |
| Y8 | 0.7499 ± 0.0041 | 0.5250 ± 0.0034 | 0.8790 ± 0.0037 | 0.6505 ± 0.0046 | 0.4748 ± 0.0038 | 105.42 ± 1.8 | 13.88 ± 1.7 |
| Y8 + CB | 0.7990 ± 0.0037 | 0.5275 ± 0.0031 | 0.9074 ± 0.0034 | 0.7157 ± 0.0042 | 0.5106 ± 0.0035 | 117.61 ± 1.5 | 20.01 ± 1.2 |
| Y8 + CB + SI | 0.7793 ± 0.0040 | 0.5564 ± 0.0033 | 0.8699 ± 0.0036 | 0.6989 ± 0.0044 | 0.4910 ± 0.0037 | 102.68 ± 1.9 | 14.16 ± 1.6 |
| Y8 + CB + FE | 0.7670 ± 0.0042 | 0.5378 ± 0.0035 | 0.8653 ± 0.0038 | 0.6927 ± 0.0045 | 0.4857 ± 0.0039 | 101.45 ± 1.9 | 10.97 ± 2.0 |
| Y8 + ECA | 0.8046 ± 0.0036 | 0.5327 ± 0.0030 | 0.8367 ± 0.0033 | 0.7531 ± 0.0041 | 0.5092 ± 0.0034 | 100.03 ± 2.0 | 12.01 ± 1.8 |
| Y8 + ECA + SI | 0.8184 ± 0.0034 | 0.5595 ± 0.0029 | 0.8834 ± 0.0031 | 0.7139 ± 0.0039 | 0.5074 ± 0.0032 | 101.49 ± 1.9 | 13.86 ± 1.7 |
| Y8 + ECA + FE | 0.7817 ± 0.0039 | 0.5541 ± 0.0032 | 0.8812 ± 0.0035 | 0.6980 ± 0.0043 | 0.4991 ± 0.0036 | 101.80 ± 1.9 | 13.94 ± 1.7 |
Table 18.
Added strong-reflection simulations performance of detection models on the difficult subset. Under strong-reflection simulation, YOLOv8 + CBAM led with an mAP@0.5 of 0.8443 and APS of 0.5456, indicating its spatial attention mechanism effectively suppresses the specular reflection interference. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 18.
Added strong-reflection simulations performance of detection models on the difficult subset. Under strong-reflection simulation, YOLOv8 + CBAM led with an mAP@0.5 of 0.8443 and APS of 0.5456, indicating its spatial attention mechanism effectively suppresses the specular reflection interference. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
| Model | mAP@0.5 | mAP@[0.5:0.95] | Precision | Recall | APS | FPS (W) | FPS (E) |
|---|
| Y5 | 0.7466 ± 0.0045 | 0.4488 ± 0.0038 | 0.8336 ± 0.0041 | 0.6747 ± 0.0049 | 0.4841 ± 0.0042 | 116.28 ± 1.7 | 18.45 ± 1.4 |
| Y5 + CB | 0.7765 ± 0.0041 | 0.4688 ± 0.0034 | 0.8086 ± 0.0037 | 0.6864 ± 0.0046 | 0.5048 ± 0.0038 | 105.67 ± 1.9 | 15.32 ± 1.6 |
| Y5 + CB + SI | 0.7185 ± 0.0047 | 0.4583 ± 0.0039 | 0.7757 ± 0.0043 | 0.6749 ± 0.0051 | 0.4480 ± 0.0044 | 96.84 ± 2.1 | 13.28 ± 1.8 |
| Y5 + CB + FE | 0.7157 ± 0.0046 | 0.4642 ± 0.0039 | 0.7769 ± 0.0042 | 0.6446 ± 0.0050 | 0.4353 ± 0.0043 | 98.51 ± 2.0 | 13.61 ± 1.7 |
| Y5 + ECA | 0.7793 ± 0.0040 | 0.4774 ± 0.0033 | 0.8176 ± 0.0036 | 0.6613 ± 0.0045 | 0.5214 ± 0.0037 | 98.73 ± 2.0 | 13.42 ± 1.7 |
| Y5 + ECA + SI | 0.7315 ± 0.0045 | 0.4576 ± 0.0038 | 0.8178 ± 0.0041 | 0.6512 ± 0.0049 | 0.4556 ± 0.0042 | 97.95 ± 2.1 | 12.18 ± 1.9 |
| Y5 + ECA + FE | 0.7171 ± 0.0046 | 0.4494 ± 0.0039 | 0.7845 ± 0.0042 | 0.6262 ± 0.0050 | 0.4513 ± 0.0043 | 98.82 ± 2.0 | 15.24 ± 1.6 |
| Y8 | 0.8175 ± 0.0035 | 0.5024 ± 0.0029 | 0.8436 ± 0.0032 | 0.7227 ± 0.0040 | 0.5221 ± 0.0033 | 104.18 ± 1.8 | 14.53 ± 1.7 |
| Y8 + CB | 0.8443 ± 0.0031 | 0.5200 ± 0.0026 | 0.8787 ± 0.0028 | 0.7453 ± 0.0037 | 0.5456 ± 0.0029 | 116.24 ± 1.5 | 19.67 ± 1.2 |
| Y8 + CB + SI | 0.7777 ± 0.0039 | 0.5406 ± 0.0033 | 0.8453 ± 0.0035 | 0.7132 ± 0.0043 | 0.4981 ± 0.0036 | 101.32 ± 1.9 | 14.89 ± 1.6 |
| Y8 + CB + FE | 0.7536 ± 0.0041 | 0.5195 ± 0.0034 | 0.8427 ± 0.0037 | 0.6666 ± 0.0045 | 0.4623 ± 0.0038 | 100.18 ± 1.9 | 11.64 ± 1.9 |
| Y8 + ECA | 0.8098 ± 0.0034 | 0.5058 ± 0.0028 | 0.8203 ± 0.0031 | 0.7508 ± 0.0039 | 0.5266 ± 0.0032 | 99.67 ± 2.0 | 12.78 ± 1.8 |
| Y8 + ECA + SI | 0.7913 ± 0.0037 | 0.5566 ± 0.0031 | 0.8569 ± 0.0034 | 0.7048 ± 0.0042 | 0.5036 ± 0.0035 | 100.15 ± 1.9 | 14.53 ± 1.7 |
| Y8 + ECA + FE | 0.7887 ± 0.0038 | 0.5600 ± 0.0032 | 0.8661 ± 0.0035 | 0.6978 ± 0.0043 | 0.4856 ± 0.0036 | 100.47 ± 1.9 | 14.61 ± 1.7 |
Table 19.
Cross-domain generalization performance evaluation (mAP@0.5). Leave-one-domain-out cross-validation shows stable model performance on unseen scenes (e.g., inland rivers, mAP@0.5 = 0.724) and a 15.8% relative improvement in strong-reflection scenarios, validating exceptional cross-domain generalization.
Table 19.
Cross-domain generalization performance evaluation (mAP@0.5). Leave-one-domain-out cross-validation shows stable model performance on unseen scenes (e.g., inland rivers, mAP@0.5 = 0.724) and a 15.8% relative improvement in strong-reflection scenarios, validating exceptional cross-domain generalization.
| Training Domains | Test Domain | YOLOv8 + ECA + SIoU | YOLOv8 |
|---|
| Harbor Areas + Nearshore Waters + Strong Reflection | Inland Rivers | 0.724 | 0.658 |
| Inland Rivers + Nearshore Waters + Strong Reflection | Harbor Areas | 0.706 | 0.642 |
| Inland Rivers + Harbor Areas + Strong Reflection | Nearshore Waters | 0.718 | 0.651 |
| Inland Rivers + Harbor Areas + Nearshore Waters | Strong Reflection | 0.682 | 0.589 |
| All Domains | Overall | 0.745 | 0.712 |
Table 20.
Model deployment performance analysis. Deployment profiling reveals YOLOv8 + CBAM variants exhibit the highest P95 latency (72.1 ms) and mild thermal throttling, while all models show similar power consumption (9.8–10.8 W), indicating attention modules primarily impact latency, not energy draw. (Jetson Xavier NX, TensorRT FP16, Overall Dataset).
Table 20.
Model deployment performance analysis. Deployment profiling reveals YOLOv8 + CBAM variants exhibit the highest P95 latency (72.1 ms) and mild thermal throttling, while all models show similar power consumption (9.8–10.8 W), indicating attention modules primarily impact latency, not energy draw. (Jetson Xavier NX, TensorRT FP16, Overall Dataset).
| Model | APS | P50 Latency (ms) | P95 Latency (ms) | Power (W) | Thermal Throttling |
|---|
| YOLOv5 | 0.4164 | 41.5 | 63.2 | 9.8 | No |
| YOLOv5 + CBAM | 0.4319 | 42.7 | 67.3 | 10.3 | No |
| YOLOv5 + ECA | 0.4338 | 41.9 | 64.1 | 10.1 | No |
| YOLOv5 + CBAM + SIOU | 0.4253 | 42.9 | 68.5 | 10.5 | No |
| YOLOv5 + CBAM + Focal-EIoU | 0.4245 | 42.5 | 67.8 | 10.4 | No |
| YOLOv5 + ECA + SIOU | 0.4288 | 42.2 | 65.3 | 10.3 | No |
| YOLOv5 + ECA + Focal-EIoU | 0.4207 | 41.8 | 64.5 | 10.2 | No |
| YOLOv8 | 0.4523 | 42.9 | 68.5 | 10.5 | No |
| YOLOv8 + CBAM | 0.4643 | 44.2 | 72.1 | 10.8 | Mild |
| YOLOv8 + ECA | 0.4783 | 41.7 | 65.8 | 10.2 | No |
| YOLOv8 + CBAM + SIOU | 0.4621 | 44.5 | 73.2 | 10.9 | Mild |
| YOLOv8 + CBAM + Focal-EIoU | 0.4618 | 44.0 | 71.5 | 10.7 | Mild |
| YOLOv8 + ECA + SIOU | 0.4669 | 41.5 | 64.2 | 10.1 | No |
| YOLOv8 + ECA + Focal-EIoU | 0.4583 | 41.3 | 63.8 | 10.0 | No |