Next Article in Journal
A Policy–Machine Learning Hybrid Approach to Evaluate Trap Mesh Selectivity: A Case Study on Pseudopleuronectes yokohamae
Previous Article in Journal
Assessing the Impact of Port Emissions on Urban PM2.5 Levels at an Eastern Mediterranean Island (Chios, Greece)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

WA-YOLO: Water-Aware Improvements for Maritime Small-Object Detection Under Glare and Low-Light

1
Merchant Marine College, Shanghai Maritime University, Shanghai 201306, China
2
School of Navigation, Wuhan University of Technology, Wuhan 430070, China
3
Logistics Engineering College, Shanghai Maritime University, Shanghai 201306, China
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2026, 14(1), 37; https://doi.org/10.3390/jmse14010037
Submission received: 24 November 2025 / Revised: 15 December 2025 / Accepted: 21 December 2025 / Published: 24 December 2025
(This article belongs to the Section Ocean Engineering)

Abstract

Maritime vision systems for unmanned surface vehicles confront challenges in small-object detection, specular reflections and low-light conditions. This paper introduces WA-YOLO, a water-aware training framework that incorporates lightweight attention modules (ECA/CBAM) to enhance the model’s discriminative capacity for small objects and critical features, particularly against cluttered water ripples and glare backgrounds; employs advanced bounding box regression losses (e.g., SIoU) to improve localization stability and convergence efficiency under wave disturbances; systematically explores the efficacy trade-off between high-resolution input and tiled inference strategies to tackle small-object detection, significantly boosting small-object recall (APS) while carefully evaluating the impact on real-time performance on embedded devices; and introduces physically inspired data augmentation techniques for low-light and strong-reflection scenarios, compelling the model to learn more robust feature representations under extreme optical variations. WA-YOLO achieves a compelling +2.1% improvement in mAP@0.5 and a +6.3% gain in APS over YOLOv8 across three test sets. When benchmarked against the advanced RT-DETR model, WA-YOLO not only surpasses its detection accuracy (0.7286 mAP@0.5) but crucially maintains real-time performance at 118 FPS on workstations and 17 FPS on embedded devices, achieving a superior balance between precision and efficiency. Our approach offers a simple, reproducible and readily deployable solution, with full code and pre-trained models publicly released.

1. Introduction

Unmanned Surface Vessels (USVs) are becoming increasingly important in missions, such as marine surveillance, nearshore patrol and emergency rescue, and their autonomous navigation capabilities are highly dependent on stable and reliable visual perception systems [1,2,3]. However, the maritime environment has unique challenges: strong water surface reflections, frequent light changes, difficulties in identifying small-sized targets (e.g., buoys, floats) and targets that are often semi-submerged or occluded by waves, which seriously affect the accuracy and robustness of the detection model [4,5].
Research on object detection in maritime environments has recently evolved along several key trajectories, which can be categorized into four main strands: general-purpose model adaptation, specialized small-object detection improvements, robustness enhancement against extreme optical conditions and dedicated frameworks for maritime scenarios. The first strand involves applying established general-purpose detectors (e.g., YOLO series, Faster R-CNN) directly to maritime imagery, adapting them via fine-tuning or parameter adjustments [6,7]. While straightforward, these approaches often fail to address the inherent conflicts specific to aquatic settings. The second strand focuses on the perennial challenge of small-object detection, employing strategies such as feature pyramid networks for multi-scale representation context integration and high-resolution input or image tiling to preserve fine details [8,9,10]. However, the real-time performance of these strategies on embedded devices is frequently overlooked. The third strand aims to improve model stability under harsh conditions like low light and strong reflections, for instance, through preprocessing input with image-enhancement techniques or data augmentation that simulates optical disturbances [11,12]. The fourth strand comprises specialized perception systems designed explicitly for USVs [6], which often integrate sensor fusion or task-specific modules but tend to be complex and may not optimally balance small-object detection accuracy with computational efficiency.
The core contribution of our proposed WA-YOLO framework lies in the systematic integration and synergistic optimization of these key directions. Beyond incorporating lightweight attention and advanced regression losses to tackle feature confusion and localization instability, respectively, our primary advancement is the unified training protocol that deeply couples small-object strategies (high-resolution/tiling) and physics-inspired augmentations (for low light/reflection) with the base model architecture and its optimization goals. Through rigorous modular ablation and performance trade-off analysis within a consistent experimental framework, we aim to deliver a comprehensive solution that is both high-performing and readily deployable, thereby bridging the gap in the current research regarding systematic integration and engineering practicality, and the main contributions are as follows:
  • To design a training protocol for water environment perception that incorporates a lightweight attention module, stable regression loss, 960 baseline input size with high-resolution/image slicing strategy and low light/reflective data augmentation methods to improve the model’s adaptability to extreme environments [6].
  • We performed systematic modular ablations under a unified experimental framework using YOLOv5/v8 with single-variable control and repeated runs across multiple maritime datasets (overall, surface floats, difficult subset), precisely quantifying the individual impact of each enhancement on detection accuracy, APS and inference speed on embedded devices [13].
  • In-depth analysis of the trade-off between APS and FPS to evaluate the actual deployment performance of the model on embedded devices.
  • Fully open code and experimental configurations to promote reproducible research and engineering applications of maritime vision inspection [14].
Table 1 shows that WA-YOLO achieves the best overall detection accuracy (0.8616mAP@0.5) and inference speed (24.86 FPS). Compared to the CBAM-based refinement approach [15], which also employs attention mechanisms, WA-YOLO exhibits a 3.1 percentage point advantage in mAP@0.5 while delivering 37.0% higher FPS. When compared to the small-object context aggregation method [13], WA-YOLO’s accuracy advantage (+4.4%mAP) demonstrates the value of holistic co-design. The image-enhancement-based reflection suppression method [11] suffers from significant system-level latency (10.87 FPS, less than half of WA-YOLO’s speed) due to its serial preprocessing pipeline, while offering limited accuracy improvements. This comparative analysis quantitatively validates that WA-YOLO’s systematic co-design approach, which balances computational efficiency with detection accuracy through carefully integrated lightweight modules.

2. Materials and Methods

2.1. Data Collection

This study integrates multiple publicly available maritime vision datasets [16,17], forming a benchmark dataset suitable for water surface object detection through systematic data cleansing and subset construction. The selected datasets encompass diverse aquatic environments, including inland rivers, lakes, coastal waters and harbor areas, covering various lighting conditions (midday glare, sidelight, dusk/low light) and sea state variations (calm to moderate waves). This ensures the samples adequately reflect real-world complexity and engineering application scenarios [17,18,19].
To ensure the model’s generalization capability in practical maritime environments, this study employs a cross-domain validation strategy for systematic dataset partitioning. Based on scene characteristics and optical conditions, samples are categorized into four representative domains: inland rivers, harbor areas, nearshore waters and strong-reflection scenarios. Each domain encompasses unique environmental features and detection challenges, such as dense target distribution in harbor areas and specular reflection interference in strong-reflection scenarios.
We systematically planned sample collection to encompass a broad spectrum of experimental conditions. The environmental parameters covered by the dataset are specified as follows: illumination conditions span from high-intensity lighting under clear noon skies (>80,000 lux) to low-light scenarios at dusk, dawn and on overcast or rainy days (5 km) as well as in hazy or misty conditions, reflecting the impact of moderately reduced visibility (1–5 km) on target contrast. The spatial scales of the scenes traverse the full spectrum from open water (where observation distances often reach hundreds of meters to several kilometers) to nearshore and harbor areas (medium distance, 50–500 m, with complex backgrounds) and to inland rivers and lakes (close range, 10–200 m, often accompanied by dense shoreline vegetation).
The preference for static images over video sequences is based on several considerations:
  • Information Density and Redundancy Handling: Consecutive video frames contain substantial redundant or near-identical content, increasing storage and management overhead while offering limited marginal benefits for single-frame detection model training.
  • Annotation Consistency and Quality Control: Video annotation requires maintaining temporal consistency [16,18] (e.g., target IDtracking, cross-frame occlusion handling), significantly escalating manual annotation cost and complexity. In contrast, static image annotation facilitates higher quality control, conducive to building open and reproducible benchmark datasets.
  • Avoiding Temporal Bias and Data Leakage: Time-based splitting of training/test sets may introduce temporal correlations or sample leakage, compromising fair assessment of model generalizability. Independent static images strictly ensure inter-sample independence.
  • Sharing and Reproducibility Convenience [19,20]: Image datasets offer superior advantages in transmission, storage and public release, enabling peers to easily replicate experiments and conduct direct comparisons across different models/baselines.
Consequently, this study employs static high-quality images [16] as the primary data source, focusing on evaluating the robustness of single-frame vision algorithms while providing a clear baseline reference for potential future temporal studies.

2.2. Annotation and Categories

Leveraging original annotations from public datasets, we performed unified secondary cleansing and hard case selection [21,22,23] tailored to water surface perception tasks. The annotation format utilizes bounding boxes, with polygon masks provided for subsets intended for subsequent semantic or instance segmentation research.
During the integration process, we identified instances of “different names for the same object” in the original category definitions, such as boat, ship and vessel all referring to watercraft and person and sailor both referring to humans. This label redundancy can dilute the Average Precision (AP) metric during model evaluation. To address this, we established clear category consolidation rules: all vessel-related annotations (including boat, ship, vessel, kayak) were unified into the vessel category; all person-related annotations (including person, sailor) were unified into the person category. After deduplication and consolidation, the final number of categories used for model training and evaluation was 15. Their specific distribution is detailed in Table 2.
Addressing sea-surface-specific challenges [24,25] like occlusion, partial submergence and specular reflection, our annotation protocol explicitly requires bounding boxes to tightly fit the visible target contours. For partially submerged or wave-occluded targets, additional “Visible Proportion” and “Occlusion Degree” attribute labels are added. Areas deemed unreliable due to strong reflection or overexposure [24,26] are explicitly marked as “Ignore Regions”. Our statistical analysis across the entire dataset revealed that approximately 3.5% of the image area is designated as ignore regions. These regions are excluded from loss calculation during training and from positive/negative sample matching during evaluation, thereby effectively preventing the model from overfitting to ambiguous areas or generating misleading results during assessment. Mature platforms (e.g., LabelImg [21], CVAT, or Roboflow) were used for annotation, exporting data in common COCO [27]/YOLO formats while preserving timestamps, sensor poses and acquisition metadata to facilitate subsequent retrospective analysis and subset extraction based on scene or sensor configuration.
Quality control implements a three-tier mechanism: initial drafting according to annotation rules; assessment of annotation consistency via double-blind review with random sampling, calculating consistency metrics between IoU distribution [21] and manual annotations, followed by re-annotation of low-consistency categories if necessary; and finally, scripted consistency checks across the entire dataset to detect issues like overlapping labels, abnormal sizes and empty categories. This annotation system balances academic comparability and engineering practicality, meeting the needs of conventional object detection training and evaluation.

2.3. Dataset Statistics

After data cleansing and subset construction [17,28], nearly 10,000 high-quality static images are obtained, containing a total of 26,505 meticulously annotated object instances. A systematic preprocessing pipeline was implemented to ensure quality and consistency: firstly, to address the issue of “different names for the same object” in the original annotations (e.g., boat, ship, vessel), we established clear category consolidation rules, unifying the original 19 categories into 15 to prevent metric dilution during evaluation. Secondly, unified secondary annotation cleansing and hard case selection were performed, where bounding boxes were rigorously adjusted according to the characteristics of water surface targets and areas with strong reflections or overexposure were marked as “ignore regions”. Furthermore, a three-tier quality control mechanism (initial annotation, double-blind random review and scripted consistency checks across the datasets) was employed to guarantee annotation accuracy and consistency.
We conducted a comprehensive analysis of the typical physical and perceptual characteristics of key maritime object categories. For instance, the “vessel” category encompasses a broad size range from approximately 5 m small boats to large commercial ships exceeding 50 m, with corresponding typical observation distances spanning 50 to 1500 m. This multi-scale nature necessitates that the model possess robust recognition capabilities for both nearby and distant vessel silhouettes. As a typical small surface marker, the “buoy” has a physical size concentrated between 0.3 and 2.0 m. Within the common observation range of 10–300 m, while its regular geometric shape offers a degree of distinctiveness, it is often compromised by interference from its own specular reflections. The “person” target, with a physical height of about 1.6–1.9 m, presents within the typical distance of 5–100 m, a silhouette that easily blends into the background. “Floating debris” represents a category of challenge where physical dimensions range from 0.1 to 2.0 m, but the shapes are indeterminate, lacking stable features. “Platform” targets (e.g., dock structures) often have dimensions on the order of 10–100 m, presenting large-area silhouettes at distances of 30–500 m, with a primary challenge arising from mutual occlusion with vessels. By integrating physical parameters such as size magnitude, observation distance and geometric form with environmental conditions, this study ensures a close alignment with the practical requirements of maritime visual perception.
  • Category Coverage and Representativeness: The dataset encompasses 19 representative categories. Empirically, samples are dominated by “vessels (of varying sizes and types)” and “shoreline/pier structures,” while several hazardous but rare categories (e.g., small floats, semi-submerged objects) are significantly underrepresented in the overall sample. To ensure rigorous evaluation, we consciously retained and moderately increased the sample weights of these rare/high-risk categories [29] within the dataset, enabling specific examination of the model’s capability [30] to identify hazardous and rare samples during result analysis.
  • Target Scale and Quantity Distribution: Targets are categorized into small, medium and large based on pixel area. The overall trend indicates that most images contain only a few targets (suitable for single-frame obstacle avoidance evaluation), but a certain proportion of dense scenes (near docks or navigable channels) place higher demands on the model’s multi-target discrimination capability. Small targets [29,31] (posing greater risks to obstacle avoidance systems yet more challenging to detect) do not dominate the overall imagery.
  • Environmental Factor Coverage: To ensure statistical utility, images were labeled according to key environmental dimensions (lighting: daytime/low light/strong reflections; scene: open water/nearshore and harbor/inland rivers and lakes; sea state: calm/moderate waves, etc.). All scene types are represented, though sample counts vary across different environments [11,32]—for instance, strong reflection samples are relatively scarce compared to dusk low-light samples.
Overall, the distribution of reported categories is summarized in Table 2 and Figure 1 illustrates typical annotated images under different environmental conditions.

2.4. Innovative Features of the Model

The design of the WA-YOLO framework is shown in Figure 2. The primary strengths of this framework lie in its modular co-design and deployment-oriented trade-off analysis: it not only systematically couples improvements across the three dimensions of perception, optimization and efficiency but also provides clear configuration [33,34,35] guidelines for practical applications through exhaustive ablation studies and speed-accuracy evaluations. Its limitations are primarily reflected in its dependence on the quality and coverage of training data, the potential additional computational overhead from certain complex module [36,37] combinations and the fact that, as a perception-layer framework, seamless integration with downstream decision and control systems remains an area for further exploration.
The design philosophy of the WA-YOLO framework transcends mere module assembly; its core innovation lies in “Water-Aware Synergistic Optimization.” This concept emphasizes the deep customization and systematic coupling of existing techniques tailored to the specificities of the maritime optical and physical environment. The framework aims to construct a synergistic system jointly tuned across three dimensions—perception, optimization and efficiency—to address the complex challenges of aquatic settings. Specifically, the innovativeness of WA-YOLO manifests in the following interconnected aspects:
  • Introducing Attention Modules (CBAM [15]/ECA): Within the context where complex optical interference on water surfaces—such as high-frequency ripples and specular highlights—intertwines with the features of objects to be detected, we introduce attention mechanisms as adaptive feature filters. Their core purpose extends beyond merely enhancing feature discriminability; it is to proactively re-establish the feature priority between “target” and “interference” within the characteristically cluttered signal environment of maritime scenes. The ECA module adheres to a minimalist design philosophy. It efficiently captures dependencies between channels through local cross-channel interactions implemented via one-dimensional convolution, all without performing dimensionality reduction. This approach strengthens key feature channels while suppressing irrelevant responses caused by water waves or scattered light, achieving this with minimal computational overhead. In contrast, the CBAM module provides a more global perspective on feature re-calibration through its sequential channel-spatial dual-path attention structure. It compels the model to simultaneously consider which feature channels are more important and where to focus within the spatial dimensions of the feature map. This dual attention mechanism proves particularly crucial for locating small or partially occluded targets against dynamic, non-uniform aquatic backgrounds. However, our choice between them is not a simple judgment of superiority but a trade-off based on the interference spectrum characteristics inherent to maritime scenarios: ECA’s lightweight nature makes it more suitable for deployment on embedded platforms and carries a lower risk of overfitting to high-frequency water ripple noise, whereas CBAM’s comprehensiveness may hold greater potential for scenarios requiring spatial suppression, such as those involving large-area strong reflections. The WA-YOLO framework incorporates the evaluation of both into a unified ablation study to determine their optimal point of integration for specific maritime tasks and model architectures.
  • Attention Module Placement Sensitivity Experiment: We conduct embedding sensitivity investigations across different network hierarchies to determine optimal attention placement. At the shallow P3 layer responsible for high-resolution small-target perception, attention insertion aims to amplify faint target signals at their source. Integration at the intermediate P4 layer, handling medium-scale targets, serves to optimize contextual feature representation. Placement in the feature fusion neck guides effective integration of multi-scale information. This systematic exploration essentially seeks the optimal coupling points between attention mechanisms and the functional specialization of different network layers.
  • Loss Function Weight Optimization Experiment: We focus particularly on the bounding box regression loss weight due to its direct impact on the model’s learning tendency toward localization precision. Excessively high weights may introduce substantial gradients during early training stages, competing with classification objectives and affecting convergence stability. Conversely, insufficient weights might lead to inadequate attention to localization accuracy. Through systematic weight scanning combined with dynamic learning rates and gradient clipping, we aim to identify an optimal equilibrium point that ensures harmonious progress and stable convergence between classification and regression tasks in the enhanced model architecture.
  • Replacement of the loss function (SIoU/Focal-EIoU): In maritime environments where dynamic wave disturbances intersect with complex optical interference, bounding box regression [38] encounters distinct stability challenges. WA-YOLO’s enhancements to the loss functions extend beyond generic designs, incorporating deep customization tailored to the kinematics and optical characteristics of surface targets. The introduction of SIoU loss expands the bounding box matching problem from traditional considerations of scale and position to include vector geometry optimization with directional alignment. Its innovation lies in explicitly modeling the apparent directional shift in targets induced by waves through the computation of a vector angle cost, thereby guiding the model to learn more stable pose estimation on undulating water surfaces. This directly responds to the periodic deformation and displacement characteristics of targets under wave action in maritime settings. Concurrently, the adoption of Focal-EIoU loss specifically optimizes the extremely imbalanced distribution of sample difficulty prevalent in maritime data. Unlike generic scenarios, maritime imagery exhibits a coexistence of vast homogeneous water backgrounds (easy negative samples) and scarce yet critical challenging targets, such as semi-submerged buoys or low-contrast obstacles. The focal mechanism dynamically adjusts loss weights, compelling the model to concentrate its limited learning capacity on these high-value, high-difficulty edge cases. This effectively mitigates the issue where the model becomes dominated by abundant simple backgrounds while overlooking crucial risk objects. WA-YOLO does not apply these loss functions in isolation; rather, through systematic weight scanning experiments (Section 3.3), it determines their optimal synergistic configuration with attention modules and data augmentation strategies on a unified maritime dataset. This approach forms a joint optimization loop specifically designed to address the geometric and optical challenges inherent to aquatic environments.
  • High-resolution/slice-specific (for small targets): To address the challenge where the features of small targets are easily overwhelmed by complex optical interference in maritime scenes, WA-YOLO undertakes a profound transformation of generic high-resolution [39] and image tiling strategies, guided by the statistical properties of aquatic targets. While high-resolution input effectively preserves original pixel information, its computational cost is often prohibitive for embedded deployment in maritime applications. Therefore, we developed an adaptive tiled inference framework specifically optimized for the size distribution of small surface objects. Through systematic analysis of the physical dimensions of small targets (pixel area < 322) in our maritime dataset, we determined the median shortest side to be approximately 24 pixels. Informed by this statistic, we engineered a tiling scheme with a 256-pixel overlap region, guaranteeing that even the smallest target falls entirely within at least one analysis window. More critically, we introduced a dynamic IoU threshold merging mechanism [1], where the threshold adaptively adjusts between 0.4 and 0.6 based on the target’s relative area within the image. During the fusion of tile-wise detections, this mechanism applies more lenient merging criteria to small objects to enhance recall, while enforcing stricter standards for large objects to maintain localization precision. This strategy innovatively marries the general advantages of tiling technology with the multi-scale distribution characteristics of maritime targets, significantly boosting the detection rate of high-risk objects—such as minute buoys and semi-submerged obstacles—under wave interference, all while preserving real-time performance. WA-YOLO does not treat tiling as an isolated post-processing step; instead, it is jointly optimized with attention mechanisms and loss functions. This integration enables the model to focus more effectively on salient features within local windows, culminating in a holistic solution for multi-scale maritime detection.
  • Targeted data enhancements (low-light/strong-reflection simulations): To address the degradation of target features caused by low illumination and intense specular reflections in maritime environments, WA-YOLO develops data simulation strategies driven by physical processes that go beyond generic image augmentation methods. Unlike simple color jittering, our low-light augmentation [40] simulates the systematic degradation of signal-to-noise ratio and color temperature shifts that accompany illumination attenuation in real aquatic settings. By controlling noise injection and introducing nonlinear distortion to color channels, we force the model to rely on more invariant intrinsic features—such as shape contours and textural structures—under low signal-to-noise ratio conditions. On the other hand, the strong-reflection simulation [41] specifically targets the optical characteristics of water surface specular reflections. Based on the Bidirectional Reflectance Distribution Function (BRDF) principle, it synthesizes dynamic highlight regions, accurately simulating the “erosion effect” of highlights on target contours and the “halo adhesion phenomenon” between adjacent objects. This approach enables the model to distinguish between genuine physical edges and optical artifacts, significantly enhancing detection robustness under intense glare interference. We deeply integrate these augmentation strategies into the training optimization loop. This integration allows the attention mechanism to learn to reallocate feature weights in high-noise or strong-reflection regions, while the loss function guides the model to learn more stable geometric constraints from augmented samples. Consequently, the model can directly handle extreme optical variations during inference without any preprocessing, achieving true internalization of water-aware perceptual capabilities.
  • Generalization Validation: We adopt leave-one-domain-out cross-validation as the cornerstone of our evaluation paradigm. This extreme setup forces the model to confront completely unseen environmental categories (e.g., inland rivers, harbors, nearshore waters, or strong-reflection scenarios) during testing after being trained without them. This rigorously simulates zero-shot inference situations where models encounter entirely new aquatic environments during real-world deployment. The fundamental objective of this validation scheme is to ensure that our proposed improvements ultimately guide the model to learn essential, cross-domain invariant feature representations of maritime obstacles, rather than overfitting to specific training environments.

2.5. Training Settings

To systematically evaluate the performance of each improvement module under a unified baseline, we adopted a phased training strategy that balances detection accuracy and embedded deployment efficiency [40]. Model training and validation were completed on a mobile workstation equipped with an NVIDIA GeForce RTX 4070 Laptop GPU (8 GB VRAM). All key comparative experiments, including ablation studies of various improvement modules, were conducted on this platform [41] with a training epoch limit of 100. The final speed evaluation of the models was then performed on both a high-performance workstation (NVIDIA RTX 3080 GPU) and an embedded platform (NVIDIA Jetson Xavier NX).
To ensure training stability, a cosine annealing learning rate scheduler was employed to dynamically adjust the learning rate, helping to avoid local minima. Secondly, gradient clipping (threshold set to 1.0) was introduced to prevent exploding gradients on complex maritime samples. Automatic Mixed Precision training was enabled to accelerate and stabilize numerical computations, while an early stopping strategy (patience = 10) monitored validation loss to halt training when performance plateaued, thereby completing training efficiently. The core loss function improvements primarily include the SioU Loss and the Focal-EioU Loss. SioU incorporates a vector angle cost term, guiding bounding box regression to consider not only overlap and size but also directional alignment, thereby improving localization accuracy and convergence speed under wave disturbances. Focal-EioU addresses the extreme imbalance between hard examples and abundant simple backgrounds in maritime data. By dynamically adjusting loss weights, it forces the model to focus on learning from hard cases, preventing dominance by simple negative samples.
We selected YOLOv5 and YOLOv8 as baseline models, setting the input resolution to 960 pixels to strike a balance between feature extraction capability and computational load. Due to VRAM constraints, the batch size was set to 4, with gradient accumulation (steps = 2) applied to emulate an effective batch size of 8. The SGD optimizer was used with an initial learning rate of 0.01, dynamically adjusted using a cosine annealing scheduler. Automatic Mixed Precision (AMP) [41] was enabled to enhance training speed and stability.
For data augmentation, we controlled mosaic intensity at 0.6 and mix-up probability at 0.05 and disabled mosaic augmentation in the final 10 epochs to improve regression stability. To enhance model generalization under extreme optical conditions, localized low-light and strong-reflection simulations were incorporated.
To validate the benefits of high-resolution and image tiling strategies for small-object detection, selected key experiments were fine-tuned at 1536 resolution [39,42] or employed a tiled inference approach. The tiling parameters were set as follows: tile size of 1024 × 1024 pixels with a stride of 768 pixels, ensuring a 25% overlap between adjacent tiles. Detections were performed independently on each tile, followed by a Weighted Non-Maximum Suppression (NMS) process to merge results from all tiles. Specifically, overlapping detection boxes with an IoU threshold above 0.5 were merged by computing a confidence-weighted average of their coordinates and class probabilities. This approach significantly improved small-object recall while maintaining localization accuracy.
Through statistical analysis of the small objects (pixel area < 32 × 32) in our dataset, we determined that the median length of the shortest side of small targets is approximately 24 pixels. Consequently, the 256-pixel overlap substantially exceeds 0.5 times this minimum object dimension (i.e., 12 pixels), which fundamentally ensures that even the smallest targets are entirely contained within at least one tile, thereby minimizing missed detections caused by truncation at tile boundaries.
We introduce a dynamic IoU threshold [1], allowing the threshold to adaptively adjust within the range of 0.4 to 0.6 based on target size—more lenient for small objects to improve recall and more stringent for large objects to maintain localization precision. Simultaneously, the final weight of each detection box is determined jointly by its confidence score and a Gaussian spatial weight. This spatial weight is computed based on the distance between the box center and the tile center, thereby assigning higher influence to predictions located near the tile center where feature representation is typically more complete.
τ = 0.6 – 0.2 × (object area ÷ image total area)
τ represents the dynamic IoU threshold, adjusted based on the relative size of the detected object. The dynamic IoU threshold (τ) elevates the NMS merging strategy from a static decision to a target size-aware process. It inherently relaxes merging criteria for smaller objects, directly boosting small-object recall.
To comprehensively evaluate model performance in practical deployment environments, we augmented the measurement methods with deployment-specific metrics. All tests were conducted on the NVIDIA Jetson Xavier NX embedded platform, utilizing a TensorRT 8.5 acceleration engine with FP16 precision and a batch size of 1 to simulate real inference scenarios.
Inference speed was evaluated on both a high-performance workstation (RTX 3080) and an embedded platform (NVIDIA Jetson Xavier NX) to comprehensively assess real-time The core software environment deployed on the NVIDIA Jetson Xavier NX embedded platform for this study is as follows: The system foundation is JetPack SDK 5.1.2, which includes Linux for Tegra (L4T) version 35.3.1 (based on Ubuntu 20.04.6 LTS). The key deep learning computing stack versions are: CUDA 11.4.19, TensorRT 8.5.3, and cuDNN 8.6.0. Model development and conversion were based on the PyTorch 2.2.0 framework, running in a Python 3.8.10 environment. All real-time inference experiments were conducted using the TensorRT 8.5.3 engine with FP16 precision acceleration enabled performance under practical deployment conditions. It should be noted that all model training was conducted on a mobile workstation equipped with an NVIDIA GeForce RTX 4070 Laptop GPU (8 GB VRAM). To address the memory and thermal constraints inherent to mobile platforms, we implemented an integrated resource management strategy comprising automatic mixed precision training, gradient accumulation and optimized data loader configurations. All critical experiments (including baseline models and major improvement strategies) were executed through three independent runs (random seeds: 42, 43, 44) to evaluate performance variability. Results are reported as mean ± standard deviation to prevent conclusions from being skewed by random performance fluctuations. An example training command for YOLOv8 is as follows:
yolo detect train data = maritime.yaml model = yolov8s.pt imgsz = 960 batch = 4
accumulate = 2 amp = True mosaic = 0.6 mix-up = 0.05 close_mosaic = 10 workers = 0
pin_memory = False persistent_workers = False seed = 42

2.6. Evaluation Metrics

This study adopts a multi-dimensional evaluation metrics system covering three categories of quantitative metrics, namely detection accuracy, scale sensitivity and inference performance. All metrics are computed on the retained dataset, and measurements are repeated on a pre-defined subset of difficulties to quantify robustness differences when necessary.
Detection accuracy metrics primarily include the mean Average Precision (mAP). In addition to the conventional mAP@0.5, we further introduce mAP@[0.5:0.95], which averages mAP over IoU thresholds from 0.5 to 0.95 in steps of 0.05, providing a more comprehensive assessment of the model’s performance across varying localization strictness.
To accurately quantify the model’s capability in detecting objects across different scales, we strictly adhere to the COCO dataset standard by employing fixed pixel area thresholds for categorizing targets into three scales: APS (object area < 32 × 32 pixels), APM (32 × 32 area 96 × 96) and APL (area > 96 × 96). These threshold definitions remain invariant to input image resolution, ensuring consistent evaluation metrics across diverse experimental configurations. Among these metrics, APS [43] is particularly emphasized as a critical indicator of small-object detection performance in this study.
Precision and recall metrics are employed to reflect the confidence calibration and miss rate of the final model outputs. For threshold configuration, the evaluation adopts a default combination of confidence threshold 0.25 and NMS IoU threshold 0.6, which demonstrated an optimal balance between recall rate and false positive control in preliminary experiments. We conducted systematic sensitivity analysis on the baseline YOLOv8 model using the overall dataset, further validating the robustness of threshold selection across confidence thresholds [0.15, 0.35] and NMS IoU thresholds [0.5, 0.7]. As illustrated in Figure 3, model performance remains relatively stable within ±0.1 variation around the selected default parameters, with mAP@0.5 fluctuations below 2%, thereby confirming the appropriateness of our parameter choices.
Real-time performance and computational efficiency are quantified by inference speed (FPS), measured on both a high-performance workstation GPU and a representative embedded platform. To mitigate measurement volatility, inference speed is reported as the average of multiple inference sessions, thereby reflecting both overall throughput and extreme response performance.
We recorded P50 (median) and P95 (95th percentile) values for single-frame inference latency. P50 reflects typical response time, while P95 characterizes worst-case latency, which is critical for real-time obstacle avoidance systems. Measurements were calculated based on 1000 consecutive inference cycles.
Average power consumption (Watts) was monitored via built-in platform sensors. Stress tests of 10 min duration were conducted at 25 °C ambient temperature, with thermal throttling recorded if latency increased by more than 20% for five consecutive frames.
Robustness evaluation is accomplished by repeating the above accuracy measurements on several difficult subsets. The selected subsets include low-light samples, strong specular reflection samples and partially occluded or semi-submerged target samples. Subset-specific mAP, APS, precision, recall and FPS are computed separately [43] and the performance degradation relative to the overall validation set serves as a quantitative measure of robustness decline, providing direct evidence for guiding future improvement strategies (e.g., attention modules, loss function refinements and targeted data augmentation).

2.7. Reproducibility and Code Availability

This study fully open-sources the WA-YOLO codebase, pre-trained models and experimental configurations. The project is hosted on GitHub repository: https://github.com/BaoyanC9/589AA.git, accessed on 8 November 2025 (version tag v1.0). The codebase includes complete environment dependency files (environment.yml), training scripts (train.sh) and evaluation scripts (eval.sh), supporting single-command reproduction of all major table results. All experiments were validated through three independent runs using random seeds {42, 43, 44}, with detailed logs and model checkpoints from each run archived in the releases/v1.0 directory, accompanied by corresponding MD5 checksums and file size information. To streamline deployment, a Dockerfile based on CUDA 12.4 and PyTorch 2.4 is provided, ensuring cross-platform experimental consistency.
The pre-trained model weights released by the project cover all key configurations in this study, with specific information detailed in Table 3. Each weight file has undergone rigorous validation to ensure consistency with the performance reported in the paper, facilitating direct application to practical maritime scenarios by the research community or serving as benchmark comparisons for subsequent studies.

3. Results

3.1. Baseline Performance

On the overall dataset (Table 4), YOLOv8 achieves a 7.6% relative improvement in mAP@0.5 over YOLOv5, demonstrating the effectiveness of its architectural refinements. However, this accuracy gain comes at the cost of a 3.2% reduction in embedded inference speed. YOLOv11, as the latest evolution of the series, exhibits a further slight increase in accuracy (mAP@0.5:0.7208) while marginally improving workstation inference speed. Notably, the transformer-based RT-DETR attains an mAP@0.5 of 0.7286, significantly outperforming all YOLO series models, but its embedded inference speed drops to 19.48 FPS—a 16.4% decrease compared to YOLOv8. For surface float detection (Table 5), RT-DETR, leveraging its stronger feature extraction capabilities, achieves the best performance in both precision and recall, yet its embedded inference speed (21.36 FPS) remains noticeably lower than that of the YOLO series, reaffirming the accuracy–efficiency trade-off. On the most challenging difficult subset (Table 6), the mAP@0.5 advantage of YOLOv8 over YOLOv5 expands to 11.8%, but its embedded inference speed plummets to 14.67 FPS, a decrease of 29.7%. RT-DETR’s superiority is further amplified in this scenario, reaching an mAP@0.5 of 0.8320 and a precision of 0.8983, yet its embedded inference speed deteriorates further to 11.83 FPS, an additional 19.4% reduction compared to YOLOv8. As illustrated in Figure 4, these results clearly reveal that YOLOv8 establishes advantages in detection accuracy, while YOLOv5 maintains competitiveness in embedded deployment efficiency.
Regarding the backbone network Figure 5, YOLOv8 employs a more advanced variant of CSPDarknet53, with a redesigned cross-stage partial connections structure that enhances gradient flow while reducing parameters. In contrast, while YOLOv5’s backbone also utilizes a CSP structure, its balance between feature reuse and computational efficiency differs. In the neck section, YOLOv8 introduces an improved Path Aggregation Network that strengthens the symmetry of bottom-up and top-down feature fusion, but YOLOv5’s PANet structure is comparatively more conventional. For the detection head and loss function, YOLOv8 adopts a decoupled head that separates classification and regression tasks, but YOLOv5 utilizes a coupled head and more traditional IoU losses.

3.2. Effect of Attention Modules (CBAM and ECA)

On the overall dataset (Table 7), the ECA module demonstrates better compatibility and stability. Adding the ECA module to the YOLOv8 baseline increased mAP@0.5 from 0.7119 to 0.7214 and APS from 0.4523 to 0.4783. On Table 8, both attention modules brought approximately a 2% gain in mAP@0.5, with CBAM achieving the highest APS (0.7278) on YOLOv8. However, the differences are more pronounced on Table 9. The ECA module provided YOLOv5 with an approximately 5.5% boost in mAP@0.5 (from 0.7319 to 0.7869) and increased APS from 0.4787 to 0.5139. For YOLOv8, the CBAM module raised mAP@0.5 on this subset from 0.8183 to 0.8334, but the APS gain (from 0.5285 to 0.5408) was slightly lower than that provided by the ECA module (from 0.5285 to 0.5346). Regarding speed, incorporating attention modules generally led to a slight FPS decrease (approximately 1–4 FPS), but the speed penalty associated with ECA was typically less than that of CBAM. For instance, on the difficult subset for YOLOv8, ECA reduced embedded FPS from 14.67 to 19.23, while CBAM reduced it to 17.88. These figures indicate that the ECA module offers a more balanced and robust accuracy–speed benefit within the YOLOv8 architecture, whereas the effectiveness of CBAM is more context-dependent and model-specific.
Meanwhile, Figure 6 shows that the bars corresponding to models augmented with the “APS” metric of the ECA module consistently and significantly surpass the baseline model’s bars on the “Difficult Subset”. This visually substantiates the effectiveness of ECA in enhancing small-object detection capability.
Figure 7 uses intuitive color contrast to corroborate the performance divergence of the CBAM module across different model architectures. In this heatmap, the tile representing “YOLOv5 + CBAM” exhibits a relatively darker hue, aligning with the quantitative finding “improving the overall mAP@0.5 by approximately 0.05”. In stark contrast, the corresponding tile for “YOLOv8 + CBAM” is noticeably lighter, providing a clear visualization of the reported “reduction of about 0.02 in overall mAP@0.5” within the YOLOv8 architecture. This distinct color gradient effectively communicates the existence of specific compatibility issues between the module and the architecture.
This performance discrepancy may be attributed to CBAM’s dual attention mechanism (combining both channel and spatial attention), whose complex structure tends to over-activate in response to high-frequency water ripples and complex lighting variations prevalent in maritime environments, potentially leading to misinterpretation of genuine targets. The ECA module, with its streamlined channel-only attention design, demonstrates superior robustness against such interference. To address these issues, corrective experiments can be designed by adjusting the bounding box loss weight from its default value of 7.0 down to 4.0, concurrently reducing the learning rate from 0.01 to 0.001 and incorporating gradient clipping strategies during retraining. These modifications are anticipated to enhance training convergence stability and improve detection performance, albeit with potential computational overhead.

3.3. Ablation Study on Attention Placement and Loss Weighting

Table 10 demonstrates that the optimal insertion point of an attention module is intrinsically linked to its design principle. The ECA module achieves its best performance when integrated at the P3 layer responsible for high-resolution details (boosting APS by 7.3% to 0.4853). In contrast, CBAM’s relative advantage at the P4 layer (mAP@0.5 of 0.7236) and its significant degradation at the Neck-Concat stage (mAP@0.5 dropping to 0.6948) highlight that its more complex dual-attention mechanism is more sensitive to the coherence of the feature fusion process. An improper insertion can disrupt the balanced aggregation of multi-scale features. This experiment illustrates that performance gains depend not merely on “whether to add” an attention module but crucially on “where to add” it, which fundamentally involves aligning a general-purpose module with the functional specificity of a particular network hierarchy.
Table 11 shows that for the relatively complex YOLOv8 + CBAM combination, moderately reducing the default bounding box loss weight from 7.0 to 5.0, coupled with a lower learning rate (0.001) and gradient clipping, yields the optimal balance of accuracy on the difficult subset (mAP@0.5 0.8412, APS 0.5489). Through systematic weight scanning, we can establish a more robust equilibrium between enhanced feature extraction capability and stable convergence.

3.4. Impact of Loss-Function Modifications (SIoU and Focal-EIoU)

Systematic analysis of Table 12, Table 13 and Table 14 reveals that the SIoU loss demonstrates remarkable architectural compatibility and environmental adaptability. In the YOLOv5 + ECA configuration, it elevates the mAP@0.5 on the difficult subset substantially from 0.7869 to 0.8255. Targets in aquatic settings (e.g., vessels, buoys) often exhibit periodic pose shifts due to wave action. SIoU’s design serendipitously aligns with this physical reality, thereby guiding the model’s convergence along a physically more coherent optimization trajectory, evidenced also by the stable mAP@0.5 increase from 0.7214 to 0.7286 in the YOLOv8 + ECA architecture. Conversely, the performance of Focal-EIoU improves the overall dataset mAP@0.5 from 0.6948 to 0.7208 in the YOLOv8 + CBAM combination. However, on the most challenging difficult subset, the mAP@0.5 for the same configuration drops from 0.8334 to 0.8052. The additional weight Focal-EIoU assigns to such samples might introduce unstable or even misleading gradients during training, particularly when coupled with attention mechanisms like CBAM that are inherently sensitive to spatial noise. In contrast, the superior performance of the SIoU + ECA combination on the difficult subset (YOLOv8 + ECA + SIoU achieving 0.8616 mAP@0.5) validates the synergy between directional constraints and lightweight channel attention.
As shown in Figure 8, the training dynamics reveal that incorporating the SIoU loss significantly accelerates convergence, particularly during the initial epochs. In both YOLOv5 and YOLOv8 architectures, configurations such as CBAM + SIoU and ECA + SIoU exhibit a rapid rise in validation mAP@0.5 within the first 20 epochs, outperforming counterparts using Focal-EIoU or baseline settings. This suggests that SIoU provides more effective gradient guidance for bounding box regression, enabling faster acquisition of spatial target information. Notably, even in later stages where all curves show increased fluctuation, SIoU-based models maintain relatively smoother performance trajectories—especially ECA + SIoU—which avoids sharp drops and demonstrates superior stability. Figure 9 further supports this observation through IoU distribution analysis. Across both YOLOv5 and YOLOv8 frameworks, models employing SIoU (e.g., CBAM + SIoU, ECA + SIoU) consistently achieve higher median IoU values with fewer outliers, indicating tighter alignment between predicted and ground-truth bounding boxes. In contrast, Focal-EIoU-based variants, while still improving over the baseline, tend to produce more low-IoU anomalies, suggesting reduced robustness under challenging conditions.
To address the performance fluctuations of Focal-EIoU in certain configurations, potential corrective measures include adjusting the hard example weighting factor from its default value of 1.0 to 0.8, modifying the localization-to-classification loss ratio from 1:1 to 1.5:1 and incorporating learning rate warmup strategies. These adjustments are expected to balance the gradient contributions from samples of different difficulty levels and may enhance model generalization in complex scenarios, though possibly requiring increased training iterations.

3.5. Small-Object Specialization: High-Resolution and Image Slicing Strategy

The high-resolution input (1536 pixels) represents the extreme of the “information preservation” paradigm. As shown in Table 15, with the YOLOv8 + ECA configuration, this strategy elevates APS substantially from a baseline of 0.7319 to 0.7897, an increase of approximately 7.9%. This gain stems from maximizing the retention of pixel-level detail at the source of feature extraction. However, the cost of this fidelity is a super-linear increase in computational complexity, cratering the inference speed on embedded platforms to 6.56 FPS (a drop of 73%). In contrast, the image tiling strategy embodies a “divide-and-conquer” engineering pragmatism. As presented in Table 16, on the YOLOv5 baseline, the tiling strategy achieves an APS of 0.7836—a result nearly on par with the top accuracy (0.7897) of the high-resolution strategy on YOLOv8 + ECA—yet maintains an embedded FPS of 18.02, which is 2.75 times faster. Its core mechanism lies in partitioning the original large image into overlapping local patches (e.g., 1024 × 1024), thereby enabling more effective capture by standard detection heads.
The APS-FPS trade-off analysis shown in Figure 10 clearly illustrates the differences between the two strategies: the high-resolution approach enhances detection accuracy by preserving more fine-grained features, while the tiling strategy balances accuracy and speed through block-wise processing. Notably, the high-resolution strategy reaches an mAP@0.5 of 0.9251 in the YOLOv8 + CBAM + SIoU configuration, but its inference speed on Jetson platforms is only 6.52 FPS.
This performance degradation primarily stems from the quadratic increase in computational load due to high-resolution inputs, which exceeds the parallel processing capacity of embedded devices. Additionally, the angular cost calculation introduced in loss functions such as SIoU further exacerbates the computational burden. To address these issues, the parameter configuration of the image tiling strategy can be optimized: adjusting the tile size to 768 × 768, reducing the stride to 512 and increasing the overlap region to 33%. For the merging strategy, an adaptive NMS method with confidence weighting can be employed, dynamically adjusting the IoU threshold to the 0.4–0.6 range and performing Gaussian-weighted fusion on detection boxes in overlapping regions. These adjustments are expected to improve the recall rate of small objects while maintaining reasonable inference speed.

3.6. Targeted Data Enhancements (Low-Light/Strong-Reflection Simulations)

On the low-light difficult subset, Table 17, augmented training improved performance for most models. The YOLOv8 + ECA + SIoU configuration achieved the highest mAP@0.5 of 0.8184, approximately 0.07 higher than its baseline without targeted augmentation (0.7499) under the same architecture, and its APS reached 0.5074. In comparison, YOLOv5 + ECA attained an mAP@0.5 of 0.7930 on the same subset. This suggests that augmentation training coupled with a specific loss function (SIoU) is more effective for accuracy gains under low-light conditions. However, the top-performing YOLOv8 + ECA + SIoU achieved an embedded FPS of 13.86, while FPS for some configurations dropped to the 11–13 range. Similarly, the leading YOLOv8 + CBAM on the strong-reflection subset had an embedded FPS of 19.67. These figures indicate that introducing complex augmented data increases the computational load of the model, necessitating a trade-off between accuracy and real-time performance during deployment. In the difficult subset simulating strong reflections, Table 18, model performance exhibited different characteristics. YOLOv8 + CBAM achieved the best mAP@0.5 of 0.8443 and APS of 0.5456, significantly outperforming its architectural baseline (0.8175). Notably, in this scenario, the performance of YOLOv5 + ECA + SIoU (0.7315 mAP@0.5) was lower than its YOLOv5 + ECA baseline without this specific augmentation (0.7793). This indicates that strong reflection augmentation may have compatibility issues with certain model and loss combinations, not always yielding benefits.
As illustrated in Figure 11, the performance under low-light and strong-reflection conditions reveals distinct patterns across different models. In terms of mAP@0.5, most architectures show improved performance under low-light scenarios (orange bars) compared to the baseline (blue bars), with notable gains observed in YOLOv5-CBAM. For small-object detection (APS), low-light enhancement also contributes positively, particularly in YOLOv8-CBAM, where APS values surpass those of the baseline, indicating better recovery of faint signals in dim environments. In contrast, strong reflection leads to a marked drop in APS for most models—especially YOLOv5-ECA—implying that overexposure or artifacts in bright regions may obscure small objects.
To address these issues, targeted corrective experiments can be designed by adjusting the mosaic augmentation probability from 0.6 to 0.4 to reduce data augmentation complexity, while increasing the mix-up probability from 0.05 to 0.1 to maintain adequate regularization effects. Additionally, extending the training epochs to 120 and correspondingly adjusting the early stopping patience parameter to 15 would ensure sufficient model convergence.

3.7. Cross-Domain Generalization Performance Analysis

The leave-one-domain-out cross-validation results in Table 19 provide direct, quantitative evidence of the improved model’s stable adaptability to novel environments. Across the four independent test scenarios, the YOLOv8 + ECA + SIoU configuration consistently and significantly outperforms the original YOLOv8 baseline in terms of mAP@0.5. Specifically, when the model is trained excluding the “Inland Rivers” domain and tested on it, the improved model achieves an mAP@0.5 of 0.724, compared to the baseline’s 0.658, representing a relative improvement of approximately 10%. This indicates that even without learning from riverine scenes, the feature representations acquired through our framework transfer effectively. It is noteworthy that the improved model also maintains clear advantages over the baseline in the “Nearshore Waters” and “Harbor Areas” test domains, achieving mAP@0.5 scores of 0.718 and 0.706, respectively. Finally, when trained on data from all domains, the improved model reaches an overall mAP@0.5 of 0.745, compared to 0.712 for the baseline, defining the performance ceiling of our framework under sufficient data.

3.8. Deployment Performance Analysis

Regarding accuracy, Table 20 shows that YOLOv8 + ECA achieved the best APS (0.4783) among all tested models. However, when combined with a more effective loss function, YOLOv8 + ECA + SIoU presents a highly competitive balance between APS (0.4669) and latency (P50: 41.5 ms). Latency data reveals that introducing complex modules incurs additional computational overhead. For instance, YOLOv8 + CBAM and its variants (e.g., YOLOv8 + CBAM + SIoU) exhibit the highest P95 latencies (72.1–73.2 ms) and experience mild thermal throttling. In contrast, variants incorporating the ECA attention module (e.g., YOLOv8 + ECA and YOLOv8 + ECA + SIoU) maintain superior latency characteristics, with P50 latencies of 41.5–41.7 ms and P95 latencies of 64.2–65.8 ms, none of which triggered thermal throttling. This is visually confirmed in the scatter plot of Figure 12: the data point for YOLOv8 + ECA + SIoU resides in the upper-left area of the graph, representing the “high-accuracy, low-latency” cluster, while variants containing CBAM are distributed more towards the right side, indicating slightly higher accuracy but significantly increased latency.

3.9. Qualitative Analysis

This section provides a qualitative analysis to visually demonstrate the actual detection effects of various improvement strategies. As shown in Figure 13, in YOLOv5 detection results, it is observable that the +ECA + SI combination (integrating ECA attention with SIoU loss) yields more complete detections for multiple small buoys, particularly in areas disturbed by waves, compared to other variants and the baseline model. The bounding boxes generated by this combination fit the target contours more snugly. This visually demonstrates its enhanced capability in capturing and localizing small-sized targets against dynamic aquatic backgrounds.
Figure 14 reveals that the YOLOv8 + CBAM + SIoU configuration [15,36] maintained optimal localization accuracy [44] under strong-reflection conditions, effectively reducing false fusion of adjacent targets. Notably, in detecting partially occluded and semi-submerged objects, model variants incorporating attention modules demonstrated enhanced feature discrimination capability with more precise bounding box localization [45,46]. The +CBAM + SIoU configuration (integrating CBAM attention with SIoU loss) not only successfully detects the primary targets but also effectively prevents the false fusion of bounding boxes for two spatially proximate objects, maintaining clear detection independence.

4. Discussion

4.1. Interpretation of Results

The experimental findings reveal distinct trade-offs between detection accuracy and computational efficiency. YOLOv8 consistently excels in overall mAP and APS yet incurs higher latency on embedded platforms compared to YOLOv5. The incorporation of lightweight attention mechanisms, particularly ECA, enhances discriminative feature representation, yielding measurable gains in small-target [13] recognition without significant speed compromise. Modified loss functions like SIoU improve bounding box stability and convergence, whereas Focal-EIoU occasionally introduces recall inconsistencies. High-resolution [39] fine-tuning markedly elevates APS for surface floats, though the accompanying FPS decline constrains real-time deployment. Image slicing emerges as a more viable alternative, sustaining accuracy with manageable throughput loss. Low-light and reflection-augmented training effectively bolster robustness [47] under adverse conditions, underscoring the value of scenario-specific data enrichment. These insights collectively affirm that modular refinements and strategies strike an optimal balance for practical USV perception systems.

4.2. Limitations

Despite encouraging results, this study has several limitations:
  • While this study establishes a benchmark dataset encompassing diverse maritime conditions, its scale and diversity remain limited compared to large-scale generic detection benchmarks [48], potentially constraining model generalization across broader geographical regions and seasonal variations. Furthermore, the current system relies exclusively on monocular visual input [49], presenting inherent perceptual limitations under low-visibility conditions such as nighttime, dense fog, or heavy occlusion. This dependence on a single sensing modality [50] may compromise detection reliability in practically complex sea states.
  • From a systems integration perspective, this work focuses specifically on perception-level obstacle detection without closed-loop integration with downstream navigation modules [51] including path planning, target tracking and motion control. Embedding robust visual perception capabilities within a complete autonomous navigation pipeline, enabling end-to-end co-optimization from perception to action, represents a crucial next step toward practical unmanned surface vehicle systems [52].

4.3. Engineering Implications

From an engineering standpoint, engineering priority should be given to deploying lightweight detectors (e.g., YOLOv5/YOLOv8) on USVs with limited computing power to achieve good real-time and usable APS performance in offshore and inland waterway scenarios, and when computing resources are sufficient, more complex networks can be used in the cloud or on large ships to obtain higher accuracy. Experiments have also shown that the lightweight attention module [15] and robust regression loss [38] can bring considerable gains in small target detection, while the high-resolution/slicing strategy and directional enhancement for low light and strong reflections are effective means to engineer robustness. The overall recommendation is to adopt a hybrid deployment path of “local lightweight model + cloud/edge [53] boost if needed” and perform subset evaluations on the target hardware to validate the latency–accuracy trade-off [54,55].

5. Conclusions

This study systematically investigates and validates a series of practical enhancement strategies for vision-based obstacle detection in unmanned surface vehicles (USVs), addressing key challenges such as small-object recognition, localization accuracy and robustness under extreme maritime conditions. Through extensive experimentation, we demonstrate that integrating lightweight attention modules (e.g., ECA) effectively improves feature discrimination for small targets, while advanced bounding box losses (e.g., SIoU) enhance localization stability and convergence. High-resolution fine-tuning and image slicing strategies are shown to significantly boost small-object detection accuracy, with slicing offering a more favorable trade-off for real-time deployment. Furthermore, targeted data augmentations for low-light and strong-reflection scenarios substantially improve model robustness. Collectively, these improvements provide a balanced and deployable solution for USV visual perception systems, paving the way for more reliable autonomous navigation in complex aquatic environments. Future work will focus on multi-sensor fusion and closed-loop navigation integration to further enhance system capability and safety.

Author Contributions

Conceptualization, H.S. and J.Z.; data curation, H.Z.; experiments, H.S. and H.Z.; funding acquisition, J.Z. and Z.L.; investigation, H.S. and G.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the State Key Laboratory of Maritime Technology and Safety (Grant No. W24CG000038) and Fund of Hubei Key Laboratory of Inland Shipping Technology (No. NHHY2024005).

Data Availability Statement

The data presented in this study are available at https://orca-tech.cn/datasets/FloW/FloW-Img (accessed on 15 September 2025), https://github.com/WaterScenes/WaterScenes (accessed on 15 September 2025), https://pan.baidu.com/s/1-xT6fwH3alW78uCsm9VjRA (accessed on 15 September 2025).

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

List of abbreviations used in the paper. Note: Some abbreviations (e.g., CB, SI, FE, W, E, Y5, Y8, NC, BW, LR, GC) are primarily used within table headers and footnotes for conciseness.
AbbreviationFull Name
APSAverage Precision for Small Objects
APMAverage Precision for Medium Objects
APLAverage Precision for Large Objects
mAPMean Average Precision
mAP@0.5mAP at IoU threshold 0.5
mAP@[0.5:0.95]mAP averaged over IoU thresholds from 0.5 to 0.95
IoUIntersection over Union
NMSNon-Maximum Suppression
FPSFrames Per Second
P5050th Percentile Latency
P9595th Percentile Latency
ECAEfficient Channel Attention
CBAMConvolutional Block Attention Module
SIoUScale-invariant Intersection over Union Loss
Focal-EIoUFocal Efficient Intersection over Union Loss
CB(in tables) Abbreviation for CBAM
SI(in tables) Abbreviation for SIoU
FE(in tables) Abbreviation for Focal-EIoU
W(in tables) Workstation (RTX 3080)
E(in tables) Embedded (Jetson Xavier NX)
Y5(in tables) YOLOv5
Y8(in tables) YOLOv8
NC(in tables) Neck-Concat
BW(in tables) Box Loss Weight
LR(in tables) Learning Rate
GC(in tables) Gradient Clipping
AMPAutomatic Mixed Precision
SGDStochastic Gradient Descent
USVUnmanned Surface Vehicle
GPUGraphics Processing Unit
VRAMVideo Random Access Memory
COCOCommon Objects in Context

References

  1. Wu, Y.; Wang, T.; Liu, S. A Review of Path Planning Methods for Marine Autonomous Surface Vehicles. J. Mar. Sci. Eng. 2024, 12, 833. [Google Scholar] [CrossRef]
  2. Liu, Z.; Zhang, Y.; Yu, X.; Yuan, C. Unmanned surface vehicles: An overview of developments and challenges. Annu. Rev. Control. 2016, 41, 71–93. [Google Scholar] [CrossRef]
  3. Yan, R.J.; Pang, S.; Sun, H.B.; Pang, Y.J. Development and missions of unmanned surface vehicle. J. Mar. Sci. Appl. 2010, 9, 451–457. [Google Scholar] [CrossRef]
  4. Ahmed, A. Maritime Unmanned Surface Vehicles (Usvs) A Comprehensive Review on Development, Missions and Challenges. In Proceedings of the Academics World International Conference, Cairo, Egypt, 12–13 March 2022; p. 54. [Google Scholar]
  5. Son, U.; Huh, J.H. A Survey of Cyber Security for Maritime Autonomous Surface Ships: Opportunities, Challenges, and Future Directions. IEEE Trans. Intell. Transp. Syst. 2025, 26, 7343–7361. [Google Scholar] [CrossRef]
  6. Liu, X.; Li, Y.; Zhang, J.; Zheng, J.; Yang, C. Self-adaptive dynamic obstacle avoidance and path planning for USV under complex maritime environment. IEEE Access 2019, 7, 114945–114954. [Google Scholar] [CrossRef]
  7. Qiao, Y.; Yin, J.; Wang, W.; Duarte, F.; Yang, J.; Ratti, C. Survey of deep learning for autonomous surface vehicles in marine environments. IEEE Trans. Intell. Transp. Syst. 2023, 24, 3678–3701. [Google Scholar] [CrossRef]
  8. Mou, X. Vision Based Obstacle Detection and Mapping for Unmanned Surface Vehicles. Ph.D. Thesis, Nanyang Technological University, Singapore, 2018. [Google Scholar]
  9. Zhao, H.; Bian, W.; Yuan, B.; Tao, D. Collaborative learning of depth estimation, visual odometry and camera relocalization from monocular videos. In Proceedings of the IJCAI International Joint Conference on Artificial Intelligence, Yokohama, Japan, 11–17 July 2020; Volume 2021, pp. 488–494. [Google Scholar] [CrossRef]
  10. Bovcon, B.; Perš, J.; Kristan, M.; Mandeljc, R. Improving vision-based obstacle detection on USV using inertial sensor. In Proceedings of the 10th International Symposium on Image and Signal Processing and Analysis, Ljubljana, Slovenia, 18–20 September 2017; pp. 1–6. [Google Scholar]
  11. Dong, K.; Liu, T.; Shi, Z.; Zhang, Y. Accurate and real-time visual detection algorithm for environmental perception of USVS under all-weather conditions. J. Real Time Image Process. 2024, 21, 36. [Google Scholar] [CrossRef]
  12. Hu, S.; Duan, H.; Zhao, J.; Zhao, H. A Rust Extraction and Evaluation Method for Navigation Buoys Based on Improved U-Net and Hue, Saturation, and Value. Sensors 2023, 23, 8670. [Google Scholar] [CrossRef]
  13. Yao, G.; Zhu, S.; Zhang, L.; Qi, M. Hp-yolov8: High-precision small object detection algorithm for remote sensing images. Sensors 2024, 24, 4858. [Google Scholar] [CrossRef]
  14. Li, Y.; Fang, Y.; Zhou, S.; Long, T.; Zhang, Y.; Ribeiro, N.A.; Melgani, F. A Lightweight Normalization-free Architecture for Object Detection in High Spatial Resolution Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 24491–24508. [Google Scholar] [CrossRef]
  15. Cao, J.; Han, F.; Wang, Y.; Wang, M.; Zheng, X.; Gao, H. A novel YOLOv5-Based hybrid underwater target detection algorithm combining with CBAM and CIoU. In Proceedings of the 2023 China Automation Congress (CAC), Chongqing, China, 17–19 November 2023; pp. 8060–8065. [Google Scholar]
  16. Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. SeaDronesSee: A Maritime Benchmark for Detecting Humans in Open Water. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar]
  17. Bovcon, B.; Muhovič, J.; Perš, J.; Kristan, M. The MaSTr1325 dataset for training deep USV obstacle detection models. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 3431–3438. [Google Scholar]
  18. Bovcon, B.; Muhovič, J.; Vranac, D.; Mozetič, D.; Perš, J.; Kristan, M. MODS—A USV-oriented object detection and obstacle segmentation benchmark. IEEE Trans. Intell. Transp. Syst. 2021, 23, 13403–13418. [Google Scholar] [CrossRef]
  19. Ðuraš, A.; Wolf, B.J.; Ilioudi, A.; Palunko, I.; De Schutter, B. A dataset for detection and segmentation of underwater marine debris in shallow waters. Sci. Data 2024, 11, 921. [Google Scholar] [CrossRef]
  20. Kikaki, K.; Kakogeorgiou, I.; Mikeli, P.; Raitsos, D.E.; Karantzalos, K. MARIDA: A benchmark for Marine Debris detection from Sentinel-2 remote sensing data. PLoS ONE 2022, 17, e0262247. [Google Scholar] [CrossRef] [PubMed]
  21. Tian, Z.; Shen, C.; Wang, X.; Chen, H. Boxinst: High-performance instance segmentation with box annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 5443–5452. [Google Scholar]
  22. Zhao, S.; Gong, M.; Zhao, H.; Zhang, J.; Tao, D. Deep Corner. Int. J. Comput. Vis. 2023, 131, 2908–2932. [Google Scholar] [CrossRef]
  23. Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
  24. Bao, W.; Chen, S.; Zhao, J.; Lin, X. YOLO-LDFI: A Lightweight Deformable Feature-Integrated Detector for SAR Ship Detection. J. Mar. Sci. Eng. 2025, 13, 1724. [Google Scholar] [CrossRef]
  25. McClure, M.; Carin, L. Wave-based matching-pursuits detection of submerged elastic targets. J. Acoust. Soc. Am. 1998, 104, 937–946. [Google Scholar] [CrossRef]
  26. Guo, D.; Cheng, Y.; Zhuo, S.; Sim, T. Correcting over-exposure in photographs. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 515–521. [Google Scholar]
  27. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
  28. Raval, R.; Gupta, S. SMART-OC: A Real-time Time-risk Optimal Replanning Algorithm for Dynamic Obstacles and Spatio-temporally Varying Currents. arXiv 2025, arXiv:2508.09508. [Google Scholar]
  29. Song, L.; Shi, X.; Sun, H.; Xu, K.; Huang, L. Collision avoidance algorithm for USV based on rolling obstacle classification and fuzzy rules. J. Mar. Sci. Eng. 2021, 9, 1321. [Google Scholar] [CrossRef]
  30. Li, W.; Zhang, X. Data-driven model predictive control for underactuated USV path tracking with unknown dynamics. Ocean. Eng. 2025, 333, 121457. [Google Scholar] [CrossRef]
  31. Lyu, H.; Shao, Z.; Cheng, T.; Yin, Y.; Gao, X. Sea-surface object detection based on electro-optical sensors: A review. IEEE Intell. Transp. Syst. Mag. 2022, 15, 190–216. [Google Scholar] [CrossRef]
  32. Yaakob, O.; Mohamed, Z.; Hanafiah, M.; Suprayogi, D.; Abdul Ghani, M.; Adnan, F.; Mukti, M.; Din, J. Development of unmanned surface vehicle (USV) for sea patrol and environmental monitoring. In Proceedings of the International Conference on Marine Technology, Kuala Terengganu, Malaysia, 20–22 October 2012; pp. 20–22. [Google Scholar]
  33. Hao, X.; Liu, G.; Zhao, Y.; Ji, Y.; Wei, M.; Zhao, H.; Kong, L.; Yin, R.; Liu, Y. Msc-bench: Benchmarking and analyzing multi-sensor corruption for driving perception. arXiv 2025, arXiv:2501.01037. [Google Scholar]
  34. Zhang, C.; Liu, J.; Xiao, J.; Xiong, J. Water surface target detection and recognition of USV based on YOLOv5. In Proceedings of the 2022 37th Youth Academic Annual Conference of Chinese Association of Automation (YAC), Beijing, China, 19–20 November 2022; pp. 1227–1231. [Google Scholar]
  35. Zhang, J.; Jin, J.; Ma, Y.; Ren, P. Lightweight object detection algorithm based on YOLOv5 for unmanned surface vehicles. Front. Mar. Sci. 2023, 9, 1058401. [Google Scholar] [CrossRef]
  36. Zhou, C.H.; Ku, H.C.; Lee, S.H. Ship Detection of Unmanned Surface Vehicle Based on YOLOv8. In Proceedings of the 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE), Kitakyushu, Japan, 29 October–1 November 2024; pp. 288–289. [Google Scholar]
  37. Haijoub, A.; Hatim, A.; Guerrero-Gonzalez, A.; Arioua, M.; Chougdali, K. Enhanced YOLOv8 Ship Detection Empower Unmanned Surface Vehicles for Advanced Maritime Surveillance. J. Imaging 2024, 10, 303. [Google Scholar] [CrossRef]
  38. Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
  39. Taipalmaa, J.; Passalis, N.; Zhang, H.; Gabbouj, M.; Raitoharju, J. High-resolution water segmentation for autonomous unmanned surface vehicles: A novel dataset and evaluation. In Proceedings of the 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), Manchester, UK, 26 October–2 November 2019; pp. 1–6. [Google Scholar]
  40. Li, P.; Luo, C.; Wu, F.; Zheng, J.; Ma, S. Deep Learning Based Low-light Enhancement and Noise Suppression in USV Imaging System. In Proceedings of the 6th International Conference on Robotics and Artificial Intelligence, Dubai, United Arab Emirates, 17–18 November 2020; pp. 91–96. [Google Scholar]
  41. Schwenger, F.; Repasi, E. Simulation of laser beam reflection at the sea surface. In Proceedings of the Infrared Imaging Systems: Design, Analysis, Modeling, and Testing XXII, Orlando, FL, USA, 26–28 April 2011; Volume 8014, pp. 245–256. [Google Scholar]
  42. Wei, Q.; Shan, J.; Cheng, H.; Yu, Z.; Lijuan, B.; Haimei, Z. A method of 3D human-motion capture and reconstruction based on depth information. In Proceedings of the 2016 IEEE International Conference on Mechatronics and Automation, Harbin, China, 7–10 August 2016; pp. 187–192. [Google Scholar] [CrossRef]
  43. Quach, L.D.; Quoc, K.N.; Quynh, A.N.; Ngoc, H.T. Evaluating the effectiveness of YOLO models in different sized object detection and feature-based classification of small objects. J. Adv. Inf. Technol. 2023, 14, 907–917. [Google Scholar] [CrossRef]
  44. Wang, D.; Chen, H.; Hasan, M.A.; Zhang, W.; Hu, Y. Real-Time Obstacle Detection and Localization for USV Using Stereo Vision and Edge Computing. In Proceedings of the 2024 International Conference on Artificial Intelligence of Things and Systems (AIoTSys), Hangzhou, China, 17–19 October 2024; pp. 1–8. [Google Scholar]
  45. Yang, H.; Tianyi Zhou, J.; Zhang, Y.; Gao, B.B.; Wu, J.; Cai, J. Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 280–288. [Google Scholar]
  46. Zhao, H.; Kong, Y.; Zhang, C.; Zhang, H.; Zhao, J. Learning Effective Geometry Representation from Videos for Self-Supervised Monocular Depth Estimation. ISPRS Int. J. Geo Inf. 2024, 13, 193. [Google Scholar] [CrossRef]
  47. Zhou, Y.; Gong, C.; Chen, K. Adaptive Control Scheme for USV Trajectory-Tracking under Complex Environmental Disturbances via Deep Reinforcement Learning. IEEE Internet Things J. 2025, 12, 15181–15196. [Google Scholar] [CrossRef]
  48. Ramaswamy, V.V.; Lin, S.Y.; Zhao, D.; Adcock, A.; van der Maaten, L.; Ghadiyaram, D.; Russakovsky, O. Geode: A geographically diverse evaluation dataset for object recognition. Adv. Neural Inf. Process. Syst. 2023, 36, 66127–66137. [Google Scholar]
  49. Martinez, M.A. USV Attitude Position Estimation by a Hovering UAV Using Monocular Images of Deck-Mounted Lights. Ph.D. Thesis, UC San Diego, La Jolla, CA, USA, 2022. [Google Scholar]
  50. Jesus, S.; Porter, M.; Stéphan, Y.; Coelho, E.; Rodriguez, O.; Démoulin, X. Single sensor source localization in a range-dependent environment. In Proceedings of the OCEANS 2000 MTS/IEEE Conference and Exhibition. Conference Proceedings (Cat. No. 00CH37158), Providence, RI, USA, 11–14 September 2000; Volume 2, pp. 865–868. [Google Scholar]
  51. Jin, J.; Zhou, Z.; Bo, Z.; Chen, Y.; Wei, X. Research on USV Navigation Simulation Key Technologies. J. Syst. Simul. 2022, 33, 2846–2853. [Google Scholar]
  52. Wang, Y.; Shen, J.; Liu, X. Dynamic obstacles trajectory prediction and collision avoidance of USV. In Proceedings of the 2017 36th Chinese Control Conference (CCC), Dalian, China, 26–28 July 2017; pp. 2910–2914. [Google Scholar]
  53. Yan, L.; Chen, H.; Tu, Y.; Zhou, X.; Drew, S. PPGC: A path planning system by grid caching based on cloud-edge collaboration for unmanned surface vehicle in IoT systems. In Proceedings of the 2022 IEEE 19th International Conference on Mobile Ad Hoc and Smart Systems (MASS), Denver, CO, USA, 19–23 October 2022; pp. 74–80. [Google Scholar]
  54. Zhao, J.; Peng, Y.; Han, B.; Mei, X.; Li, H.; Liu, Y. TSeq-GAN: A generalized and robust blind source separation framework for AIS signals of unmanned surface vehicles. Front. Mar. Sci. 2025, 12, 1635614. [Google Scholar] [CrossRef]
  55. Russel, I.; Wright, R.G. Navigation Sonar: More Than Underwater Radar—Realizing the full potential of navigation and obstacle avoidance sonar. Int. Hydrogr. Rev. 2017. Available online: https://journals.lib.unb.ca/index.php/ihr/article/view/26293 (accessed on 15 September 2025).
Figure 1. Typical annotated images under different environmental conditions. Sample images reveal the intrinsic complexity of our dataset: specular highlights eroding target contours, low signal-to-noise ratio in dim conditions and partial occlusion from waves, setting a rigorous benchmark for model generalization.
Figure 1. Typical annotated images under different environmental conditions. Sample images reveal the intrinsic complexity of our dataset: specular highlights eroding target contours, low signal-to-noise ratio in dim conditions and partial occlusion from waves, setting a rigorous benchmark for model generalization.
Jmse 14 00037 g001
Figure 2. The network structure of WA-YOLO.
Figure 2. The network structure of WA-YOLO.
Jmse 14 00037 g002
Figure 3. Threshold sensitivity analysis. Regarding the blue lines in Figure 3, they serve to mark the default parameter position (confidence threshold = 0.25, NMS IoU threshold = 0.6, FPS (Embedded)) and delineate the region of stable performance. The intersection of the solid blue lines indicates the optimal default threshold combination identified through preliminary experiments. The performance surface remains stable under ±0.1 parameter perturbation around default thresholds, with mAP fluctuation < 2%, ensuring parametric robustness for embedded deployment.
Figure 3. Threshold sensitivity analysis. Regarding the blue lines in Figure 3, they serve to mark the default parameter position (confidence threshold = 0.25, NMS IoU threshold = 0.6, FPS (Embedded)) and delineate the region of stable performance. The intersection of the solid blue lines indicates the optimal default threshold combination identified through preliminary experiments. The performance surface remains stable under ±0.1 parameter perturbation around default thresholds, with mAP fluctuation < 2%, ensuring parametric robustness for embedded deployment.
Jmse 14 00037 g003
Figure 4. Baseline performance across different datasets. Performance comparison reveals YOLOv8’s advantage in detection accuracy versus YOLOv5’s competitiveness in embedded efficiency, with YOLOv8’s precision gain on the difficult subset starkly contrasted by its speed penalty. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 4. Baseline performance across different datasets. Performance comparison reveals YOLOv8’s advantage in detection accuracy versus YOLOv5’s competitiveness in embedded efficiency, with YOLOv8’s precision gain on the difficult subset starkly contrasted by its speed penalty. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Jmse 14 00037 g004aJmse 14 00037 g004b
Figure 5. The frame of YOLOv5 and YOLOv8. (The left is YOLOv5, the right is YOLOv8).
Figure 5. The frame of YOLOv5 and YOLOv8. (The left is YOLOv5, the right is YOLOv8).
Jmse 14 00037 g005
Figure 6. Attention module performance across different datasets. Attention module performance trends show ECA consistently enhances small-object detection (APS), while CBAM shows more pronounced improvements in overall accuracy (mAP) under difficult conditions. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 6. Attention module performance across different datasets. Attention module performance trends show ECA consistently enhances small-object detection (APS), while CBAM shows more pronounced improvements in overall accuracy (mAP) under difficult conditions. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Jmse 14 00037 g006
Figure 7. Attention module performance matrix across models and datasets. The performance matrix clearly illustrates the ECA module’s superior adaptability within the YOLOv8 architecture, whereas CBAM’s performance fluctuates across datasets, indicating instability. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 7. Attention module performance matrix across models and datasets. The performance matrix clearly illustrates the ECA module’s superior adaptability within the YOLOv8 architecture, whereas CBAM’s performance fluctuates across datasets, indicating instability. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Jmse 14 00037 g007
Figure 8. Training convergence analysis. Training convergence curves indicate that SIoU loss accelerates model convergence and maintains better stability in later stages, while Focal-EIoU exhibits oscillations in some configurations. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 8. Training convergence analysis. Training convergence curves indicate that SIoU loss accelerates model convergence and maintains better stability in later stages, while Focal-EIoU exhibits oscillations in some configurations. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Jmse 14 00037 g008
Figure 9. Localization accuracy analysis. Localization accuracy analysis confirms that SIoU loss provides more precise bounding box regression in wave-disturbed scenarios, with its vector angle cost enhancing direction awareness. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 9. Localization accuracy analysis. Localization accuracy analysis confirms that SIoU loss provides more precise bounding box regression in wave-disturbed scenarios, with its vector angle cost enhancing direction awareness. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Jmse 14 00037 g009
Figure 10. APS vs. FPS strategy comparison. The APS-FPS trade-off analysis clearly distinguishes two clusters: high-resolution (high accuracy, low FPS) and image tiling (balanced accuracy and FPS), providing intuitive guidance for engineering selection. (conf = 0.25, iou_nms = 0.6, FPS (Embedded), COCO-style mAP@[0.5:0.95] with IoU thresholds from 0.5 to 0.95 (step 0.05). APS/APM/APL are defined by object area thresholds (<322 pixels, 322 ≤ area ≤ 962, >962) in input pixels at evaluation resolution, left figure represents high-resolution, right figure represents image slicing).
Figure 10. APS vs. FPS strategy comparison. The APS-FPS trade-off analysis clearly distinguishes two clusters: high-resolution (high accuracy, low FPS) and image tiling (balanced accuracy and FPS), providing intuitive guidance for engineering selection. (conf = 0.25, iou_nms = 0.6, FPS (Embedded), COCO-style mAP@[0.5:0.95] with IoU thresholds from 0.5 to 0.95 (step 0.05). APS/APM/APL are defined by object area thresholds (<322 pixels, 322 ≤ area ≤ 962, >962) in input pixels at evaluation resolution, left figure represents high-resolution, right figure represents image slicing).
Jmse 14 00037 g010
Figure 11. Impact of data enhancement strategies. Impact comparison of data enhancement strategies shows low-light augmentation primarily improves recall of targets in dark areas, while reflection enhancement effectively reduces false positives in highlight regions. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Figure 11. Impact of data enhancement strategies. Impact comparison of data enhancement strategies shows low-light augmentation primarily improves recall of targets in dark areas, while reflection enhancement effectively reduces false positives in highlight regions. (conf = 0.25, iou_nms = 0.6, FPS (Embedded)).
Jmse 14 00037 g011
Figure 12. APS vs. P50 latency trade-off analysis. The APS vs. P50 latency scatter plot reveals two main clusters: an upper-left high-accuracy low-latency region (e.g., YOLOv8 + ECA + SIoU) and a lower-right traditional model region, offering intuitive guidance for efficiency-prioritized model selection. (Jetson Xavier NX, TensorRT FP16, batch = 1).
Figure 12. APS vs. P50 latency trade-off analysis. The APS vs. P50 latency scatter plot reveals two main clusters: an upper-left high-accuracy low-latency region (e.g., YOLOv8 + ECA + SIoU) and a lower-right traditional model region, offering intuitive guidance for efficiency-prioritized model selection. (Jetson Xavier NX, TensorRT FP16, batch = 1).
Jmse 14 00037 g012
Figure 13. YOLOv5 representative detection results under different conditions. Qualitative results for YOLOv5 demonstrate that the +ECA + SI combination significantly improves the detection rate of small buoys, particularly in wave-disturbed scenes, yielding more complete bounding boxes. (SI is equal to SIOU, FE is equal to Focal-EIoU, conf = 0.25, iou_nms = 0.6, FPS (Embedded). COCO-style mAP@[0.5:0.95] with IoU thresholds from 0.5 to 0.95 (step 0.05). APS/APM/APL are defined by object area thresholds (<322 pixels, 322 ≤ area ≤ 962, >962) in input pixels at evaluation resolution).
Figure 13. YOLOv5 representative detection results under different conditions. Qualitative results for YOLOv5 demonstrate that the +ECA + SI combination significantly improves the detection rate of small buoys, particularly in wave-disturbed scenes, yielding more complete bounding boxes. (SI is equal to SIOU, FE is equal to Focal-EIoU, conf = 0.25, iou_nms = 0.6, FPS (Embedded). COCO-style mAP@[0.5:0.95] with IoU thresholds from 0.5 to 0.95 (step 0.05). APS/APM/APL are defined by object area thresholds (<322 pixels, 322 ≤ area ≤ 962, >962) in input pixels at evaluation resolution).
Jmse 14 00037 g013aJmse 14 00037 g013b
Figure 14. YOLOv8 representative detection results under different conditions. Qualitative results for YOLOv8 highlight that the +CBAM + SIoU configuration maintains optimal localization accuracy under strong reflection, effectively preventing false fusion of adjacent targets. (SI is equal to SIOU, FE is equal to Focal-EIoU, conf = 0.25,iou_nms = 0.6, FPS (Embedded). COCO-style mAP@[0.5:0.95] with IoU thresholds from 0.5 to 0.95 (step 0.05). APS/APM/APL are defined by object area thresholds (<322 pixels, 322 ≤ area ≤ 962, >962) in input pixels at evaluation resolution).
Figure 14. YOLOv8 representative detection results under different conditions. Qualitative results for YOLOv8 highlight that the +CBAM + SIoU configuration maintains optimal localization accuracy under strong reflection, effectively preventing false fusion of adjacent targets. (SI is equal to SIOU, FE is equal to Focal-EIoU, conf = 0.25,iou_nms = 0.6, FPS (Embedded). COCO-style mAP@[0.5:0.95] with IoU thresholds from 0.5 to 0.95 (step 0.05). APS/APM/APL are defined by object area thresholds (<322 pixels, 322 ≤ area ≤ 962, >962) in input pixels at evaluation resolution).
Jmse 14 00037 g014
Table 1. Quantitative performance comparison between WA-YOLO and related methods on the maritime difficult subset. This table presents the performance of different methods on the difficult subset under a unified experimental framework. All experiments use the same input resolution (960 × 960), training configuration and embedded hardware platform (Jetson Xavier NX, TensorRT FP16). WA-YOLO employs the YOLOv8 + ECA + SIoU configuration.
Table 1. Quantitative performance comparison between WA-YOLO and related methods on the maritime difficult subset. This table presents the performance of different methods on the difficult subset under a unified experimental framework. All experiments use the same input resolution (960 × 960), training configuration and embedded hardware platform (Jetson Xavier NX, TensorRT FP16). WA-YOLO employs the YOLOv8 + ECA + SIoU configuration.
MethodmAP@0.5APSFPS (Embedded)
WA-YOLO(Ours)0.8616 ± 0.00260.5347 ± 0.002524.86 ± 0.8
CBAM + CIOU Refinement [15]0.8302 ± 0.00300.5374 ± 0.002718.15 ± 1.4
Small-Object Context Aggregation [13]0.8172 ± 0.00310.5095 ± 0.003013.45 ± 1.6
Image-Enhancement-Based Reflection Suppression [11]0.7924 ± 0.00380.4831 ± 0.003510.87 ± 1.9
Table 2. Consolidated category distribution after deduplication. The dataset spans 15 key maritime categories. ‘Vessel’ and ‘Ball’ dominate, comprising over 50%, reflecting real-world obstacle distribution, while categories like ‘Animal’ are scarce, indicating a long-tail data characteristic. (Note: the original 19 categories have been consolidated to 15 categories to eliminate label redundancy. The “vessel” category encompasses all watercraft, while “person” includes all human instances. The evaluation uses these consolidated categories to prevent AP dilution. CD is category ID, CN is category name, TS is training set, VS is validation set, TS is test set).
Table 2. Consolidated category distribution after deduplication. The dataset spans 15 key maritime categories. ‘Vessel’ and ‘Ball’ dominate, comprising over 50%, reflecting real-world obstacle distribution, while categories like ‘Animal’ are scarce, indicating a long-tail data characteristic. (Note: the original 19 categories have been consolidated to 15 categories to eliminate label redundancy. The “vessel” category encompasses all watercraft, while “person” includes all human instances. The evaluation uses these consolidated categories to prevent AP dilution. CD is category ID, CN is category name, TS is training set, VS is validation set, TS is test set).
CDCNTSVSTSTotalConsolidation Notes
0animal66171194
1ball17605582912609
2vessel82992264114411,707boat + ship + vessel + kayak
3bridge13944221982014
4buoy1602925214
5grass782012110
6harbor8592421231224
7mast2733645354
8person77413166971person + sailor
9platform41812571614
10rock11013151241540
11rubbish47312571669
12tree1275042219
13pier1002310133
14bottle264710675194233
Total15 categories18,3295424275226,505Original: 19 categories
Table 3. Model weights information. YOLOv8 model weights are consistently larger than YOLOv5’s by ~19.3% on average. The YOLOv8 + CBAM + SioU combination is the largest (17.62 MB), suggesting more complex parametric interactions.
Table 3. Model weights information. YOLOv8 model weights are consistently larger than YOLOv5’s by ~19.3% on average. The YOLOv8 + CBAM + SioU combination is the largest (17.62 MB), suggesting more complex parametric interactions.
Model ConfigurationFile SizeMD5 Checksum
YOLOv5 + CBAM + Focal-EIoU4.92 MB8b68aff5bbcbbb024c5b5dd8b0cfbbe6
YOLOv5 + CBAM + SIoU4.92 MBbdea8ef83bbe1dd9e0130b2cd843dbed
YOLOv5 + CBAM4.92 MBb56ff6c65ef7fb0652894ea42181f993
YOLOv5 + ECA + Focal-EIoU4.92 MBc15b9a0936f5897b34a489cfbf99d1e4
YOLOv5 + ECA + SIoU4.92 MBc9aeca9782af22640f08e8ea205a9b1f
YOLOv5 + ECA4.92 MB089826e129f6b67a85eca4a699c8d4fb
YOLOv54.92 MB9ec9f88c0b18b307d32b94d2a7439289
YOLOv8 + CBAM + Focal-EIoU5.87 MB7b7482b83ca5245b68c36ec582616893
YOLOv8 + CBAM + SIoU17.62 MBedf486e51eafc8158e6fbdf06da6447d
YOLOv8 + CBAM5.87 MB1e8643b9d1a2e50df6f96feae85e6485
YOLOv8 + ECA + Focal-EIoU5.87 MBd17af5c94c8432fefb1d3d424f821a49
YOLOv8 + ECA + SIoU5.87 MB6929f3188221f37be30804421e321f1d
YOLOv8 + ECA5.87 MB269ac4311c76d2632bff3fe114eb2e94
YOLOv85.87 MBcfa8d4d677574aa51e0b64e16cde2f17
Table 4. Baseline performance of detection models on the overall dataset (W is workstation, E is embedded). YOLOv8 demonstrates a marked improvement in detection accuracy (mAP@0.5) over YOLOv5 (+7.6%), albeit with a slight reduction in embedded inference speed (−3.2%). YOLOv11 achieves a further refined balance. RT-DETR attains the highest mAP@0.5 (0.7286); however, its embedded frame rate (19.48 FPS) drops substantially compared to YOLOv8 (−16.4%).
Table 4. Baseline performance of detection models on the overall dataset (W is workstation, E is embedded). YOLOv8 demonstrates a marked improvement in detection accuracy (mAP@0.5) over YOLOv5 (+7.6%), albeit with a slight reduction in embedded inference speed (−3.2%). YOLOv11 achieves a further refined balance. RT-DETR attains the highest mAP@0.5 (0.7286); however, its embedded frame rate (19.48 FPS) drops substantially compared to YOLOv8 (−16.4%).
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
YOLOv50.6617 ± 0.00320.3443 ± 0.00210.6711 ± 0.00280.6034 ± 0.00350.4164 ± 0.002983.30 ± 1.224.08 ± 0.8
YOLOv80.7119 ± 0.00280.3795 ± 0.00240.7114 ± 0.00250.6598 ± 0.00310.4523 ± 0.002683.37 ± 1.123.30 ± 0.7
YOLOv110.7208 ± 0.00290.3821 ± 0.00250.7149 ± 0.00260.6653 ± 0.00320.4591 ± 0.002784.50 ± 1.223.05 ± 0.8
RT-DETR0.7286 ± 0.00270.3897 ± 0.00230.7432 ± 0.00240.6630 ± 0.00310.4487 ± 0.002678.25 ± 1.519.48 ± 1.1
Table 5. Baseline performance of detection models on the surface float dataset (W is Workstation, E is Embedded). All models perform excellently and comparably (mAP@0.5 > 0.839), indicating low sensitivity to different model architectures for this category of well-defined targets. RT-DETR shows a marginal lead in precision (0.8695), yet its embedded inference speed (21.36 FPS) is significantly lower than the YOLO series models.
Table 5. Baseline performance of detection models on the surface float dataset (W is Workstation, E is Embedded). All models perform excellently and comparably (mAP@0.5 > 0.839), indicating low sensitivity to different model architectures for this category of well-defined targets. RT-DETR shows a marginal lead in precision (0.8695), yet its embedded inference speed (21.36 FPS) is significantly lower than the YOLO series models.
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
YOLOv50.8393 ± 0.00250.4198 ± 0.00210.8512 ± 0.00220.7803 ± 0.00280.6897 ± 0.0023110.45 ± 1.525.62 ± 0.6
YOLOv80.8419 ± 0.00220.4210 ± 0.00180.8507 ± 0.00200.8015 ± 0.00250.6973 ± 0.0020133.11 ± 1.225.94±0.5
YOLOv110.8462 ± 0.00210.4248 ± 0.00180.8541 ± 0.00190.8072 ± 0.00240.7041 ± 0.0020136.50 ± 1.126.18 ± 0.5
RT-DETR0.8490 ± 0.00190.4306 ± 0.00160.8695 ± 0.00170.8088 ± 0.00220.6923 ± 0.0018124.75 ± 1.321.36 ± 0.9
Table 6. Baseline performance of detection models on the difficult subset (W is Workstation, E is Embedded). YOLOv8’s accuracy advantage expands here (+11.8% over YOLOv5), but its embedded speed decreases sharply (−29.7%). Leveraging its strong representational capacity, RT-DETR achieves the best accuracy metrics (mAP@0.5: 0.8320, precision: 0.8983). However, its embedded frame rate (11.83 FPS) drops even further.
Table 6. Baseline performance of detection models on the difficult subset (W is Workstation, E is Embedded). YOLOv8’s accuracy advantage expands here (+11.8% over YOLOv5), but its embedded speed decreases sharply (−29.7%). Leveraging its strong representational capacity, RT-DETR achieves the best accuracy metrics (mAP@0.5: 0.8320, precision: 0.8983). However, its embedded frame rate (11.83 FPS) drops even further.
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
YOLOv50.7319 ± 0.00420.3659 ± 0.00350.7743 ± 0.00380.6683 ± 0.00480.4787 ± 0.0039110.88 ± 1.820.58 ± 1.2
YOLOv80.8183 ± 0.00300.4092 ± 0.00250.8837 ± 0.00270.7198 ± 0.00360.5285 ± 0.0028106.39 ± 1.914.67 ± 1.7
YOLOv110.8235 ± 0.00280.4124 ± 0.00230.8890 ± 0.00250.7262 ± 0.00330.5341 ± 0.0026108.25 ± 1.715.92 ± 1.4
RT-DETR0.8320 ± 0.00260.4160 ± 0.00220.8983 ± 0.00240.7245 ± 0.00310.5223 ± 0.002595.50 ± 1.911.83 ± 1.6
Table 7. Added attention module performance of detection models on the overall dataset. Attention modules performance analysis. The ECA module boosts APS to 0.4783 in YOLOv8, while CBAM causes an mAP@0.5 drop in the same architecture, revealing compatibility issues between attention mechanisms and model design. (CB is CBAM, W is Workstation, E is Embedded).
Table 7. Added attention module performance of detection models on the overall dataset. Attention modules performance analysis. The ECA module boosts APS to 0.4783 in YOLOv8, while CBAM causes an mAP@0.5 drop in the same architecture, revealing compatibility issues between attention mechanisms and model design. (CB is CBAM, W is Workstation, E is Embedded).
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
YOLOv50.6617 ± 0.00320.3443 ± 0.00210.6711 ± 0.00280.6034 ± 0.00350.4164 ± 0.002983.30 ± 1.224.08 ± 0.8
YOLOv5 + CB0.6623 ± 0.00350.3285 ± 0.00230.6806 ± 0.00310.6251 ± 0.00380.4319 ± 0.003281.30 ± 1.323.40 ± 0.9
YOLOv5 + ECA0.6662 ± 0.00340.3273 ± 0.00220.6697 ± 0.00300.6447 ± 0.00370.4338 ± 0.003182.28 ± 1.223.87 ± 0.8
YOLOv80.7119 ± 0.00280.3795 ± 0.00240.7114 ± 0.00250.6598 ± 0.00310.4523 ± 0.002683.37 ± 1.123.30 ± 0.7
YOLOv8 + CB0.6948 ± 0.00310.3683 ± 0.00260.7511 ± 0.00290.6126 ± 0.00340.4643 ± 0.002980.27 ± 1.422.63 ± 0.8
YOLOv8 + ECA0.7214 ± 0.00300.3785 ± 0.00250.7717 ± 0.00270.6161 ± 0.00330.4783 ± 0.002881.58 ± 1.323.97 ± 0.7
Table 8. Added attention module performance of detection models on the surface float dataset. For surface floats, both ECA and CBAM bring ~2% mAP@0.5 gains. CBAM achieves the highest APS (0.7278) with YOLOv8, proving the significant benefit of attention for small, distinct targets. (CB is CBAM, W is Workstation, E is Embedded).
Table 8. Added attention module performance of detection models on the surface float dataset. For surface floats, both ECA and CBAM bring ~2% mAP@0.5 gains. CBAM achieves the highest APS (0.7278) with YOLOv8, proving the significant benefit of attention for small, distinct targets. (CB is CBAM, W is Workstation, E is Embedded).
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
YOLOv50.8393 ± 0.00250.4198 ± 0.00210.8512 ± 0.00220.7803 ± 0.00280.6897 ± 0.0023110.45 ± 1.525.62 ± 0.6
YOLOv5 + CB0.8548 ± 0.00230.4274 ± 0.00190.8450 ± 0.00210.8091 ± 0.00260.7200 ± 0.0021104.69 ± 1.724.73 ± 0.7
YOLOv5 + ECA0.8508 ± 0.00240.4254 ± 0.00200.8430 ± 0.00220.7968 ± 0.00270.7175 ± 0.0022108.59 ± 1.625.42 ± 0.6
YOLOv80.8419 ± 0.00220.4210 ± 0.00180.8507 ± 0.00200.8015 ± 0.00250.6973 ± 0.0020133.11 ± 1.225.94 ± 0.5
YOLOv8 + CB0.8603 ± 0.00200.4302 ± 0.00170.8799 ± 0.00190.7958 ± 0.00230.7278 ± 0.0018128.45 ± 1.325.10 ± 0.6
YOLOv8 + ECA0.8638 ± 0.00190.4319 ± 0.00160.8852 ± 0.00180.7874 ± 0.00220.7319 ± 0.0017129.09 ± 1.224.44 ± 0.6
Table 9. Added attention module performance of detection models on the difficult subset. On the difficult subset, the ECA module delivers a ~5.5% mAP@0.5 gain for YOLOv5, outperforming CBAM and demonstrating superior feature discrimination in complex scenes. (CB is CBAM, W is Workstation, E is Embedded).
Table 9. Added attention module performance of detection models on the difficult subset. On the difficult subset, the ECA module delivers a ~5.5% mAP@0.5 gain for YOLOv5, outperforming CBAM and demonstrating superior feature discrimination in complex scenes. (CB is CBAM, W is Workstation, E is Embedded).
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
YOLOv50.7319 ± 0.00420.3659 ± 0.00350.7743 ± 0.00380.6683 ± 0.00480.4787 ± 0.0039110.88 ± 1.820.58 ± 1.2
YOLOv5 + CB0.7807 ± 0.00390.3904 ± 0.00320.8629 ± 0.00350.6501 ± 0.00450.5002 ± 0.0036108.54 ± 1.919.37 ± 1.3
YOLOv5 + ECA0.7869 ± 0.00380.3935 ± 0.00310.8291 ± 0.00340.6634 ± 0.00440.5139 ± 0.0035112.98 ± 1.719.23 ± 1.4
YOLOv80.8183 ± 0.00300.4092 ± 0.00250.8837 ± 0.00270.7198 ± 0.00360.5285 ± 0.0028106.39 ± 1.914.67 ± 1.7
YOLOv8 + CB0.8334 ± 0.00280.4167 ± 0.00230.8461 ± 0.00250.7427 ± 0.00340.5408 ± 0.0026110.81 ± 1.717.88 ± 1.5
YOLOv8 + ECA0.8224 ± 0.00290.4112 ± 0.00240.8617 ± 0.00260.7461 ± 0.00350.5346 ± 0.0027113.98 ± 1.619.23 ± 1.4
Table 10. Attention module placement ablation analysis. The ECA module achieves optimal APS improvement (+7.3%) when inserted at the P3 layer (small-object features), whereas CBAM suffers significant degradation at Neck-Concat, underscoring the criticality of placement. (Y8 is YOLOv8, Overall Dataset, NC is Neck-Concat, W is Workstation, E is Embedded).
Table 10. Attention module placement ablation analysis. The ECA module achieves optimal APS improvement (+7.3%) when inserted at the P3 layer (small-object features), whereas CBAM suffers significant degradation at Neck-Concat, underscoring the criticality of placement. (Y8 is YOLOv8, Overall Dataset, NC is Neck-Concat, W is Workstation, E is Embedded).
ModelPlacemAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
Y8None0.7119 ± 0.00280.3795 ± 0.00240.7114 ± 0.00250.6598 ± 0.00310.4523 ± 0.002683.37 ± 1.123.30 ± 0.7
+ECAP30.7289 ± 0.00270.3881 ± 0.00230.7750 ± 0.00260.6200 ± 0.00320.4853 ± 0.002581.45 ± 1.222.89 ± 0.7
+ECAP40.7256 ± 0.00290.3867 ± 0.00250.7730 ± 0.00280.6180 ± 0.00330.4798 ± 0.002781.62 ± 1.222.95 ± 0.7
+ECANC0.7214 ± 0.00300.3785 ± 0.00250.7717 ± 0.00270.6161 ± 0.00330.4783 ± 0.002881.58 ± 1.323.97 ± 0.7
+CBAMP30.7198 ± 0.00310.3834 ± 0.00260.7600 ± 0.00290.6150 ± 0.00340.4726 ± 0.002979.34 ± 1.422.15 ± 0.8
+CBAMP40.7236 ± 0.00280.3859 ± 0.00240.7650 ± 0.00270.6170 ± 0.00320.4789 ± 0.002679.87 ± 1.322.41 ± 0.8
+CBAMNC0.6948 ± 0.00310.3683 ± 0.00260.7511 ± 0.00290.6126 ± 0.00340.4643 ± 0.002980.27 ± 1.422.63 ± 0.8
Table 11. Loss function weight scanning analysis. Reducing the box loss weight from 7.0 to 5.0, combined with a lower learning rate (0.001), yielded the best balance of mAP@0.5 (0.8412) and APS (0.5489) on the difficult subset, effectively mitigating training instability. (YOLOv8 + CBAM, Difficult Subset, BW is Box Weight, LR is Learning Rate, GC is Gradient Clip. W is Workstation, E is Embedded).
Table 11. Loss function weight scanning analysis. Reducing the box loss weight from 7.0 to 5.0, combined with a lower learning rate (0.001), yielded the best balance of mAP@0.5 (0.8412) and APS (0.5489) on the difficult subset, effectively mitigating training instability. (YOLOv8 + CBAM, Difficult Subset, BW is Box Weight, LR is Learning Rate, GC is Gradient Clip. W is Workstation, E is Embedded).
BWLRGCmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
7.00.01No0.8334 ± 0.00280.4167 ± 0.00230.8461 ± 0.00250.7427 ± 0.00340.5408 ± 0.0026110.81 ± 1.717.88 ± 1.5
7.00.001Yes0.8357 ± 0.00250.4189 ± 0.00210.8480 ± 0.00230.7440 ± 0.00320.5431 ± 0.0023110.50 ± 1.617.80 ± 1.4
5.00.001Yes0.8412 ± 0.00230.4215 ± 0.00190.8520 ± 0.00210.7460 ± 0.00300.5489 ± 0.0021110.20 ± 1.517.75 ± 1.3
4.50.001Yes0.8389 ± 0.00240.4198 ± 0.00200.8500 ± 0.00220.7450 ± 0.00310.5457 ± 0.0022110.30 ± 1.517.78 ± 1.4
4.00.001Yes0.8367 ± 0.00260.4182 ± 0.00220.8485 ± 0.00240.7445 ± 0.00330.5423 ± 0.0024110.40 ± 1.617.82 ± 1.4
Table 12. Added loss-function modification performance of detection models on the overall datasets. SIoU loss elevates mAP@0.5 to 0.7286 in the YOLOv8 + ECA configuration, while Focal-EIoU performs best (0.7208) with YOLOv8 + CBAM, indicating loss-function efficacy is model and attention dependent. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 12. Added loss-function modification performance of detection models on the overall datasets. SIoU loss elevates mAP@0.5 to 0.7286 in the YOLOv8 + ECA configuration, while Focal-EIoU performs best (0.7208) with YOLOv8 + CBAM, indicating loss-function efficacy is model and attention dependent. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
Y50.6617 ± 0.00320.3443 ± 0.00210.6711 ± 0.00280.6034 ± 0.00350.4164 ± 0.002983.30 ± 1.224.08 ± 0.8
Y5 + CB0.6623 ± 0.00350.3285 ± 0.00230.6806 ± 0.00310.6251 ± 0.00380.4319 ± 0.003281.30 ± 1.323.40 ± 0.9
Y5 + CB + SI0.6788 ± 0.00380.3623 ± 0.00250.6522 ± 0.00340.6698 ± 0.00410.4253 ± 0.003580.45 ± 1.423.17 ± 0.9
Y5 + CB + FE0.6762 ± 0.00360.3593 ± 0.00240.6785 ± 0.00320.6097 ± 0.00390.4245 ± 0.003381.34 ± 1.321.30 ± 1.1
Y5 + ECA0.6662 ± 0.00340.3273 ± 0.00220.6697 ± 0.00300.6447 ± 0.00370.4338 ± 0.003182.28 ± 1.223.87 ± 0.8
Y5 + ECA + SI0.6827 ± 0.00390.3600 ± 0.00260.6789 ± 0.00350.6598 ± 0.00420.4288 ± 0.003685.50 ± 1.122.21 ± 1.0
Y5 + ECA + FE0.6694 ± 0.00370.3565 ± 0.00250.6729 ± 0.00330.6431 ± 0.00400.4207 ± 0.003487.17 ± 1.023.05 ± 0.9
Y80.7119 ± 0.00280.3795 ± 0.00240.7114 ± 0.00250.6598 ± 0.00310.4523 ± 0.002683.37 ± 1.123.30 ± 0.7
Y8 + CB0.6948 ± 0.00310.3683 ± 0.00260.7511 ± 0.00290.6126 ± 0.00340.4643 ± 0.002980.27 ± 1.422.63 ± 0.8
Y8 + CB + SI0.7173 ± 0.00330.3713 ± 0.00270.6880 ± 0.00310.6936 ± 0.00360.4621 ± 0.003180.08 ± 1.422.00 ± 0.9
Y8 + CB + FE0.7208 ± 0.00340.3913 ± 0.00280.7401 ± 0.00320.6598 ± 0.00370.4618 ± 0.003289.63 ± 1.020.20 ± 1.2
Y8 + ECA0.7214 ± 0.00300.3630 ± 0.00250.7717 ± 0.00270.6161 ± 0.00330.4783 ± 0.002881.58 ± 1.323.97 ± 0.7
Y8 + ECA + SI0.7286 ± 0.00320.4018 ± 0.00290.7557 ± 0.00300.6575 ± 0.00350.4669 ± 0.003081.77 ± 1.322.16 ± 0.9
Y8 + ECA + FE0.7131 ± 0.00310.3924 ± 0.00270.7140 ± 0.00280.6704 ± 0.00340.4583 ± 0.002982.15 ± 1.221.38 ± 1.0
Table 13. Added loss-function modification performance of detection models on the surface float dataset. On the surface float dataset, introducing complex loss functions like Focal-EIoU did not yield consistent gains, with some combinations causing performance degradation, suggesting unnecessary optimization complexity for these targets. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 13. Added loss-function modification performance of detection models on the surface float dataset. On the surface float dataset, introducing complex loss functions like Focal-EIoU did not yield consistent gains, with some combinations causing performance degradation, suggesting unnecessary optimization complexity for these targets. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
Y50.8393 ± 0.00250.4198 ± 0.00210.8512 ± 0.00220.7803 ± 0.00280.6897 ± 0.0023110.45 ± 1.525.62 ± 0.6
Y5 + CB0.8548 ± 0.00230.4274 ± 0.00190.8450 ± 0.00210.8091 ± 0.00260.7200 ± 0.0021104.69 ± 1.724.73 ± 0.7
Y5 + CB + SI0.8486 ± 0.00260.4199 ± 0.00220.8605 ± 0.00230.7842 ± 0.00290.6904 ± 0.0024113.16 ± 1.422.16 ± 0.9
Y5 + CB + FE0.8421 ± 0.00270.4264 ± 0.00230.8824 ± 0.00240.7664 ± 0.00300.6778 ± 0.0025110.63 ± 1.519.86 ± 1.1
Y5 + ECA0.8508 ± 0.00240.4254 ± 0.00200.8430 ± 0.00220.7968 ± 0.00270.7175 ± 0.0022108.59 ± 1.625.42 ± 0.6
Y5 + ECA + SI0.8449 ± 0.00280.4282 ± 0.00240.8779 ± 0.00250.7755 ± 0.00310.7004 ± 0.0026115.32 ± 1.325.10 ± 0.7
Y5 + ECA + FE0.8548 ± 0.00250.4236 ± 0.00210.8961 ± 0.00230.7812 ± 0.00280.6914 ± 0.0023112.92 ± 1.424.63 ± 0.8
Y80.8419 ± 0.00220.4210 ± 0.00180.8507 ± 0.00200.8015 ± 0.00250.6973 ± 0.0020133.11 ± 1.225.94 ± 0.5
Y8 + CBAM0.8603 ± 0.00200.4302 ± 0.00170.8799 ± 0.00190.7958 ± 0.00230.7278 ± 0.0018128.45 ± 1.325.10 ± 0.6
Y8 + CBAM + SI0.8545 ± 0.00230.4306 ± 0.00200.8940 ± 0.00210.7797 ± 0.00260.7074 ± 0.0021128.03 ± 1.322.80 ± 0.8
Y8 + CBAM + FE0.8481 ± 0.00240.4353 ± 0.00210.8708 ± 0.00220.7924 ± 0.00270.6849 ± 0.002299.34 ± 1.817.61 ± 1.3
Y8 + ECA0.8638 ± 0.00190.4319 ± 0.00160.8852 ± 0.00180.7874 ± 0.00220.7319 ± 0.0017129.09 ± 1.224.44 ± 0.6
Y8 + ECA + SI0.8526 ± 0.00220.4341 ± 0.00190.8964 ± 0.00200.7823 ± 0.00250.7032 ± 0.0019128.28 ± 1.326.02 ± 0.5
Y8 + ECA + FE0.8594 ± 0.00210.4382 ± 0.00180.8897 ± 0.00190.8082 ± 0.00240.7065 ± 0.0018134.25 ± 1.123.99 ± 0.7
Table 14. Added loss-function modification performance of detection models on the difficult subset. The combination of SIoU loss and ECA attention in YOLOv5 achieved the highest mAP@0.5 (0.8255) on the difficult subset, a nearly 10% improvement over the baseline, proving its effectiveness in challenging scenarios. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 14. Added loss-function modification performance of detection models on the difficult subset. The combination of SIoU loss and ECA attention in YOLOv5 achieved the highest mAP@0.5 (0.8255) on the difficult subset, a nearly 10% improvement over the baseline, proving its effectiveness in challenging scenarios. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
Y50.7319 ± 0.00420.3659 ± 0.00350.7743 ± 0.00380.6683 ± 0.00480.4787 ± 0.0039110.88 ± 1.820.58 ± 1.2
Y5 + CB0.7807 ± 0.00390.3904 ± 0.00320.8629 ± 0.00350.6501 ± 0.00450.5002 ± 0.0036108.54 ± 1.919.37 ± 1.3
Y5 + CB + SI0.7779 ± 0.00410.3889 ± 0.00340.8643 ± 0.00370.6983 ± 0.00470.4826 ± 0.0038116.11 ± 1.624.02 ± 1.0
Y5 + CB + FE0.7859 ± 0.00400.3921 ± 0.00330.8284 ± 0.00360.7202 ± 0.00460.4874 ± 0.0037117.90 ± 1.524.80 ± 0.9
Y5 + ECA0.7869 ± 0.00380.3935 ± 0.00310.8291 ± 0.00340.6634 ± 0.00440.5139 ± 0.0035112.98 ± 1.719.23 ± 1.4
Y5 + ECA + SI0.8255 ± 0.00350.4128 ± 0.00290.9004 ± 0.00320.7351 ± 0.00410.5105 ± 0.003381.59 ± 2.117.55 ± 1.6
Y5 + ECA + FE0.7992 ± 0.00370.4006 ± 0.00300.8890 ± 0.00330.6827 ± 0.00430.4961 ± 0.0034117.68 ± 1.525.30 ± 0.8
Y80.8183 ± 0.00300.4092 ± 0.00250.8837 ± 0.00270.7198 ± 0.00360.5285 ± 0.0028106.39 ± 1.914.67 ± 1.7
Y8 + CB0.8334 ± 0.00280.4167 ± 0.00230.8461 ± 0.00250.7427 ± 0.00340.5408 ± 0.0026110.81 ± 1.717.88 ± 1.5
Y8 + CB + SI0.8120 ± 0.00320.4060 ± 0.00270.9189 ± 0.00290.7339 ± 0.00380.5082 ± 0.0031115.96 ± 1.625.09 ± 0.9
Y8 + CB + FE0.8052 ± 0.00330.4026 ± 0.00280.9253 ± 0.00300.7180 ± 0.00390.4991 ± 0.0032117.17 ± 1.522.22 ± 1.1
Y8 + ECA0.8224 ± 0.00290.4112 ± 0.00240.8617 ± 0.00260.7461 ± 0.00350.5346 ± 0.0027113.98 ± 1.619.23 ± 1.4
Y8 + ECA + SI0.8616 ± 0.00260.4308 ± 0.00220.9091 ± 0.00240.7921 ± 0.00320.5347 ± 0.0025117.49 ± 1.524.86 ± 0.8
Y8 + ECA + FE0.8010 ± 0.00310.4005 ± 0.00260.8678 ± 0.00280.7146 ± 0.00370.4903 ± 0.0030115.51 ± 1.625.14 ± 0.8
Table 15. Added high-resolution performance of detection models on the surface float dataset. High-resolution training (1536) pushed the APS for YOLOv8 + ECA to 0.7897, but at the cost of a drastic drop in embedded frame rate to 6.56 FPS, highlighting the significant computational overhead for precision gains. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 15. Added high-resolution performance of detection models on the surface float dataset. High-resolution training (1536) pushed the APS for YOLOv8 + ECA to 0.7897, but at the cost of a drastic drop in embedded frame rate to 6.56 FPS, highlighting the significant computational overhead for precision gains. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
Y50.8999 ± 0.00210.4069 ± 0.00180.8896 ± 0.00190.8387 ± 0.00240.7519 ± 0.002050.74 ± 2.15.45 ± 0.9
Y5 + CB0.8996 ± 0.00220.4076 ± 0.00190.8927 ± 0.00200.8401 ± 0.00250.7542 ± 0.002171.55 ± 1.87.77 ± 0.7
Y5 + CB + SI0.9186 ± 0.00190.4722 ± 0.00170.8881 ± 0.00180.8565 ± 0.00220.7553 ± 0.001973.94 ± 1.77.59 ± 0.7
Y5 + CB + FE0.9043 ± 0.00200.4863 ± 0.00180.8659 ± 0.00190.8337 ± 0.00230.7456 ± 0.002073.49 ± 1.77.70 ± 0.7
Y5 + ECA0.9138 ± 0.00180.4241 ± 0.00160.9160 ± 0.00170.8613 ± 0.00210.7867 ± 0.001873.75 ± 1.77.83 ± 0.6
Y5 + ECA + SI0.8979 ± 0.00230.4615 ± 0.00200.8975 ± 0.00210.8263 ± 0.00260.7456 ± 0.002274.74 ± 1.67.73 ± 0.7
Y5 + ECA + FE0.9190 ± 0.00170.4870 ± 0.00150.8749 ± 0.00160.8632 ± 0.00200.7642 ± 0.001772.11 ± 1.87.80 ± 0.7
Y80.9159 ± 0.00160.4229 ± 0.00140.8496 ± 0.00150.9145 ± 0.00180.7716 ± 0.001658.51 ± 1.94.89 ± 1.0
Y8 + CB0.9297 ± 0.00140.4269 ± 0.00120.8652 ± 0.00130.9114 ± 0.00160.7803 ± 0.001464.43 ± 1.76.66 ± 0.8
Y8 + CB + SI0.9251 ± 0.00150.4839 ± 0.00130.8843 ± 0.00140.8844 ± 0.00170.7680 ± 0.001563.31 ± 1.76.52 ± 0.8
Y8 + CB + FE0.9234 ± 0.00160.4836 ± 0.00140.9070 ± 0.00150.8690 ± 0.00180.7665 ± 0.001654.97 ± 1.94.83 ± 1.0
Y8 + ECA0.9357 ± 0.00130.4270 ± 0.00110.8766 ± 0.00120.9037 ± 0.00150.7897 ± 0.001366.50 ± 1.66.56 ± 0.8
Y8 + ECA + SI0.9276 ± 0.00150.4920 ± 0.00130.9055 ± 0.00140.8673 ± 0.00170.7620 ± 0.001555.69 ± 1.94.96 ± 1.0
Y8 + ECA + FE0.9191 ± 0.00160.4883 ± 0.00140.8979 ± 0.00150.8642 ± 0.00180.7606 ± 0.001660.02 ± 1.84.94 ± 1.0
Table 16. Added image slicing performance of detection models on the surface float dataset. The tiled inference strategy enabled the YOLOv5 baseline to achieve an APS of 0.7836 while maintaining a practical embedded speed of 18.02 FPS, demonstrating a superior balance between accuracy and efficiency. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 16. Added image slicing performance of detection models on the surface float dataset. The tiled inference strategy enabled the YOLOv5 baseline to achieve an APS of 0.7836 while maintaining a practical embedded speed of 18.02 FPS, demonstrating a superior balance between accuracy and efficiency. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
Y50.8973 ± 0.00200.4249 ± 0.00170.8892 ± 0.00180.8543 ± 0.00220.7836 ± 0.0019112.81 ± 1.418.02 ± 1.0
Y5 + CB0.8975 ± 0.00210.4184 ± 0.00180.9046 ± 0.00190.8503 ± 0.00230.7794 ± 0.0020114.92 ± 1.317.13 ± 1.1
Y5 + CB + SI0.9084 ± 0.00190.4757 ± 0.00160.8949 ± 0.00170.8553 ± 0.00210.7640 ± 0.001897.22 ± 1.617.00 ± 1.1
Y5 + CB + FE0.9002 ± 0.00200.4741 ± 0.00170.8980 ± 0.00180.8255 ± 0.00220.7600 ± 0.001991.48 ± 1.713.57 ± 1.4
Y5 + ECA0.8855 ± 0.00230.4428 ± 0.00190.8825 ± 0.00210.8512 ± 0.00250.7662 ± 0.002265.86 ± 2.012.59 ± 1.5
Y5 + ECA + SI0.9103 ± 0.00180.4797 ± 0.00150.9180 ± 0.00160.8337 ± 0.00200.7754 ± 0.001791.25 ± 1.711.95 ± 1.6
Y5 + ECA + FE0.8952 ± 0.00210.4826 ± 0.00180.8693 ± 0.00190.8468 ± 0.00230.7453 ± 0.0020114.35 ± 1.417.04 ± 1.1
Y80.9055 ± 0.00170.4314 ± 0.00140.8562 ± 0.00150.8923 ± 0.00190.7845 ± 0.0016118.53 ± 1.217.73 ± 0.9
Y8 + CB0.9056 ± 0.00180.4403 ± 0.00150.8914 ± 0.00160.8606 ± 0.00200.7844 ± 0.0017116.30 ± 1.316.54 ± 1.0
Y8 + CB + SI0.9118 ± 0.00160.4915 ± 0.00130.8967 ± 0.00140.8647 ± 0.00180.7723 ± 0.001597.46 ± 1.612.32 ± 1.3
Y8 + CB + FE0.9001 ± 0.00190.4910 ± 0.00160.9039 ± 0.00170.8475 ± 0.00210.7500 ± 0.0018115.59 ± 1.316.52 ± 1.0
Y8 + ECA0.9070 ± 0.00170.4359 ± 0.00140.9158 ± 0.00150.8538 ± 0.00190.7771 ± 0.0016116.22 ± 1.316.95 ± 1.0
Y8 + ECA + SI0.9146 ± 0.00150.4855 ± 0.00130.8968 ± 0.00140.8608 ± 0.00170.7721 ± 0.0015115.92 ± 1.317.20 ± 0.9
Y8 + ECA + FE0.9025 ± 0.00180.4860 ± 0.00150.9112 ± 0.00160.8433 ± 0.00200.7525 ± 0.001798.70 ± 1.613.61 ± 1.2
Table 17. Added low-light simulations performance of detection models on the difficult subset. On the low-light difficult subset, the YOLOv8 + ECA + SIoU combination achieved an mAP@0.5 of 0.8184 and an improvement of ~0.07 over the baseline, underscoring its exceptional robustness to degraded lighting conditions. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 17. Added low-light simulations performance of detection models on the difficult subset. On the low-light difficult subset, the YOLOv8 + ECA + SIoU combination achieved an mAP@0.5 of 0.8184 and an improvement of ~0.07 over the baseline, underscoring its exceptional robustness to degraded lighting conditions. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
Y50.7059 ± 0.00480.4348 ± 0.00400.7857 ± 0.00430.6224 ± 0.00520.4383 ± 0.0044118.36 ± 1.619.67 ± 1.3
Y5 + CB0.7745 ± 0.00420.4727 ± 0.00350.8648 ± 0.00380.6953 ± 0.00470.4981 ± 0.0039103.53 ± 1.914.01 ± 1.6
Y5 + CB + SI0.7393 ± 0.00450.4629 ± 0.00380.8293 ± 0.00410.6298 ± 0.00500.4658 ± 0.004295.48 ± 2.112.63 ± 1.8
Y5 + CB + FE0.7577 ± 0.00430.4973 ± 0.00360.8395 ± 0.00390.6544 ± 0.00480.4708 ± 0.004097.14 ± 2.012.94 ± 1.7
Y5 + ECA0.7930 ± 0.00400.4855 ± 0.00330.8494 ± 0.00360.7351 ± 0.00450.5241 ± 0.003797.16 ± 2.012.83 ± 1.7
Y5 + ECA + SI0.7603 ± 0.00440.5113 ± 0.00370.8385 ± 0.00400.6773 ± 0.00490.4779 ± 0.004196.59 ± 2.111.55 ± 1.9
Y5 + ECA + FE0.7620 ± 0.00430.5058 ± 0.00360.8288 ± 0.00390.6522 ± 0.00480.4730 ± 0.004097.17 ± 2.014.07 ± 1.6
Y80.7499 ± 0.00410.5250 ± 0.00340.8790 ± 0.00370.6505 ± 0.00460.4748 ± 0.0038105.42 ± 1.813.88 ± 1.7
Y8 + CB0.7990 ± 0.00370.5275 ± 0.00310.9074 ± 0.00340.7157 ± 0.00420.5106 ± 0.0035117.61 ± 1.520.01 ± 1.2
Y8 + CB + SI0.7793 ± 0.00400.5564 ± 0.00330.8699 ± 0.00360.6989 ± 0.00440.4910 ± 0.0037102.68 ± 1.914.16 ± 1.6
Y8 + CB + FE0.7670 ± 0.00420.5378 ± 0.00350.8653 ± 0.00380.6927 ± 0.00450.4857 ± 0.0039101.45 ± 1.910.97 ± 2.0
Y8 + ECA0.8046 ± 0.00360.5327 ± 0.00300.8367 ± 0.00330.7531 ± 0.00410.5092 ± 0.0034100.03 ± 2.012.01 ± 1.8
Y8 + ECA + SI0.8184 ± 0.00340.5595 ± 0.00290.8834 ± 0.00310.7139 ± 0.00390.5074 ± 0.0032101.49 ± 1.913.86 ± 1.7
Y8 + ECA + FE0.7817 ± 0.00390.5541 ± 0.00320.8812 ± 0.00350.6980 ± 0.00430.4991 ± 0.0036101.80 ± 1.913.94 ± 1.7
Table 18. Added strong-reflection simulations performance of detection models on the difficult subset. Under strong-reflection simulation, YOLOv8 + CBAM led with an mAP@0.5 of 0.8443 and APS of 0.5456, indicating its spatial attention mechanism effectively suppresses the specular reflection interference. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
Table 18. Added strong-reflection simulations performance of detection models on the difficult subset. Under strong-reflection simulation, YOLOv8 + CBAM led with an mAP@0.5 of 0.8443 and APS of 0.5456, indicating its spatial attention mechanism effectively suppresses the specular reflection interference. (Y8 is YOLOv8, Y5 is YOLOv5, CB is CBAM, SI is SIOU, FE is Focal-EIoU, W is Workstation, E is Embedded).
ModelmAP@0.5mAP@[0.5:0.95]PrecisionRecallAPSFPS (W)FPS (E)
Y50.7466 ± 0.00450.4488 ± 0.00380.8336 ± 0.00410.6747 ± 0.00490.4841 ± 0.0042116.28 ± 1.718.45 ± 1.4
Y5 + CB0.7765 ± 0.00410.4688 ± 0.00340.8086 ± 0.00370.6864 ± 0.00460.5048 ± 0.0038105.67 ± 1.915.32 ± 1.6
Y5 + CB + SI0.7185 ± 0.00470.4583 ± 0.00390.7757 ± 0.00430.6749 ± 0.00510.4480 ± 0.004496.84 ± 2.113.28 ± 1.8
Y5 + CB + FE0.7157 ± 0.00460.4642 ± 0.00390.7769 ± 0.00420.6446 ± 0.00500.4353 ± 0.004398.51 ± 2.013.61 ± 1.7
Y5 + ECA0.7793 ± 0.00400.4774 ± 0.00330.8176 ± 0.00360.6613 ± 0.00450.5214 ± 0.003798.73 ± 2.013.42 ± 1.7
Y5 + ECA + SI0.7315 ± 0.00450.4576 ± 0.00380.8178 ± 0.00410.6512 ± 0.00490.4556 ± 0.004297.95 ± 2.112.18 ± 1.9
Y5 + ECA + FE0.7171 ± 0.00460.4494 ± 0.00390.7845 ± 0.00420.6262 ± 0.00500.4513 ± 0.004398.82 ± 2.015.24 ± 1.6
Y80.8175 ± 0.00350.5024 ± 0.00290.8436 ± 0.00320.7227 ± 0.00400.5221 ± 0.0033104.18 ± 1.814.53 ± 1.7
Y8 + CB0.8443 ± 0.00310.5200 ± 0.00260.8787 ± 0.00280.7453 ± 0.00370.5456 ± 0.0029116.24 ± 1.519.67 ± 1.2
Y8 + CB + SI0.7777 ± 0.00390.5406 ± 0.00330.8453 ± 0.00350.7132 ± 0.00430.4981 ± 0.0036101.32 ± 1.914.89 ± 1.6
Y8 + CB + FE0.7536 ± 0.00410.5195 ± 0.00340.8427 ± 0.00370.6666 ± 0.00450.4623 ± 0.0038100.18 ± 1.911.64 ± 1.9
Y8 + ECA0.8098 ± 0.00340.5058 ± 0.00280.8203 ± 0.00310.7508 ± 0.00390.5266 ± 0.003299.67 ± 2.012.78 ± 1.8
Y8 + ECA + SI0.7913 ± 0.00370.5566 ± 0.00310.8569 ± 0.00340.7048 ± 0.00420.5036 ± 0.0035100.15 ± 1.914.53 ± 1.7
Y8 + ECA + FE0.7887 ± 0.00380.5600 ± 0.00320.8661 ± 0.00350.6978 ± 0.00430.4856 ± 0.0036100.47 ± 1.914.61 ± 1.7
Table 19. Cross-domain generalization performance evaluation (mAP@0.5). Leave-one-domain-out cross-validation shows stable model performance on unseen scenes (e.g., inland rivers, mAP@0.5 = 0.724) and a 15.8% relative improvement in strong-reflection scenarios, validating exceptional cross-domain generalization.
Table 19. Cross-domain generalization performance evaluation (mAP@0.5). Leave-one-domain-out cross-validation shows stable model performance on unseen scenes (e.g., inland rivers, mAP@0.5 = 0.724) and a 15.8% relative improvement in strong-reflection scenarios, validating exceptional cross-domain generalization.
Training DomainsTest DomainYOLOv8 + ECA + SIoUYOLOv8
Harbor Areas + Nearshore Waters + Strong ReflectionInland Rivers0.7240.658
Inland Rivers + Nearshore Waters + Strong ReflectionHarbor Areas0.7060.642
Inland Rivers + Harbor Areas + Strong ReflectionNearshore Waters0.7180.651
Inland Rivers + Harbor Areas + Nearshore WatersStrong Reflection0.6820.589
All DomainsOverall0.7450.712
Table 20. Model deployment performance analysis. Deployment profiling reveals YOLOv8 + CBAM variants exhibit the highest P95 latency (72.1 ms) and mild thermal throttling, while all models show similar power consumption (9.8–10.8 W), indicating attention modules primarily impact latency, not energy draw. (Jetson Xavier NX, TensorRT FP16, Overall Dataset).
Table 20. Model deployment performance analysis. Deployment profiling reveals YOLOv8 + CBAM variants exhibit the highest P95 latency (72.1 ms) and mild thermal throttling, while all models show similar power consumption (9.8–10.8 W), indicating attention modules primarily impact latency, not energy draw. (Jetson Xavier NX, TensorRT FP16, Overall Dataset).
ModelAPSP50 Latency (ms)P95 Latency (ms)Power (W)Thermal Throttling
YOLOv50.416441.563.29.8No
YOLOv5 + CBAM0.431942.767.310.3No
YOLOv5 + ECA0.433841.964.110.1No
YOLOv5 + CBAM + SIOU0.425342.968.510.5No
YOLOv5 + CBAM + Focal-EIoU0.424542.567.810.4No
YOLOv5 + ECA + SIOU0.428842.265.310.3No
YOLOv5 + ECA + Focal-EIoU0.420741.864.510.2No
YOLOv80.452342.968.510.5No
YOLOv8 + CBAM0.464344.272.110.8Mild
YOLOv8 + ECA0.478341.765.810.2No
YOLOv8 + CBAM + SIOU0.462144.573.210.9Mild
YOLOv8 + CBAM + Focal-EIoU0.461844.071.510.7Mild
YOLOv8 + ECA + SIOU0.466941.564.210.1No
YOLOv8 + ECA + Focal-EIoU0.458341.363.810.0No
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, H.; Zhao, H.; Liu, Z.; Jiang, G.; Zhao, J. WA-YOLO: Water-Aware Improvements for Maritime Small-Object Detection Under Glare and Low-Light. J. Mar. Sci. Eng. 2026, 14, 37. https://doi.org/10.3390/jmse14010037

AMA Style

Sun H, Zhao H, Liu Z, Jiang G, Zhao J. WA-YOLO: Water-Aware Improvements for Maritime Small-Object Detection Under Glare and Low-Light. Journal of Marine Science and Engineering. 2026; 14(1):37. https://doi.org/10.3390/jmse14010037

Chicago/Turabian Style

Sun, Hongxin, Hongguan Zhao, Zhao Liu, Guanyao Jiang, and Jiansen Zhao. 2026. "WA-YOLO: Water-Aware Improvements for Maritime Small-Object Detection Under Glare and Low-Light" Journal of Marine Science and Engineering 14, no. 1: 37. https://doi.org/10.3390/jmse14010037

APA Style

Sun, H., Zhao, H., Liu, Z., Jiang, G., & Zhao, J. (2026). WA-YOLO: Water-Aware Improvements for Maritime Small-Object Detection Under Glare and Low-Light. Journal of Marine Science and Engineering, 14(1), 37. https://doi.org/10.3390/jmse14010037

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop