1. Introduction
Small-object detection in water-surface scenes remains an under-explored yet practically critical problem, especially for battery-powered electric unmanned surface vehicles (e-USVs) and electrified surface-cleaning platforms used for autonomous inspection and debris removal [
1]. Compared with terrestrial settings, water-surface perception on such platforms must operate under stricter onboard constraints: embedded processors are limited by compute budgets and power budgets, while perception failures can directly translate into wasted energy, reduced mission endurance, and incomplete cleanup [
2]. Therefore, developing a robust and lightweight framework for detecting tiny floating debris is not only valuable for environmental intelligence but also essential for reliable onboard perception and energy-aware autonomous operation in electrified missions.
Real-time detection of small floating objects in reflective water environments remains challenging. Many existing vision models exhibit limited accuracy, unstable real-time performance, and heavy computational cost when deployed on embedded platforms [
2,
3,
4]. The main difficulties arise from three aspects. First, dynamic reflections, surface ripples, and shadows frequently obscure visual cues, leading to incomplete texture extraction and contour ambiguity [
5]. Second, cameras mounted on unmanned surface vehicles are often affected by motion blur, viewpoint variation, and platform vibration, which reduce feature consistency during real-time inference. Third, floating debris such as plastic bags, leaves, and fruit peels usually exhibits weak texture, low contrast, irregular shapes, and large appearance variation, making them easily confused with background clutter and difficult to localize precisely. These characteristics make water-surface small-object detection substantially more difficult than conventional object detection in relatively structured land scenes.
In recent years, deep learning-based object detection has developed rapidly. Two-stage detectors such as Faster R-CNN [
6], Mask R-CNN [
7], and Cascade R-CNN [
8] generally provide strong localization capability in structured scenes with sufficient resolution. Single-stage detectors such as SSD [
9] and the YOLO family [
10,
11,
12,
13] emphasize real-time inference and have been widely adopted in edge applications. In parallel, anchor-free detectors, such as FCOS, CenterNet, and RepPoints, as well as transformer-based architectures such as Deformable DETR, further expand the design space for accurate and flexible detection. Industrial-grade variants, including PP-YOLOE and edge-oriented frameworks such as EdgeYOLO explore lightweight design and deployment-aware optimization [
14,
15]. However, relatively few studies explicitly formulate waterborne tiny-object detection as an onboard perception task under deployment-driven constraints, where low object saliency, dynamic backgrounds, and reflective-scene label unreliability jointly affect both training stability and runtime robustness [
2].
Although the above detectors—including the latest YOLO variants—continue to evolve toward lighter architectures and stronger feature modeling ability, their performance in detecting small objects under complex environments remains suboptimal. Recent evaluations [
16,
17] indicate that models such as YOLOv5 and YOLOv7 still suffer from low recall and weak localization accuracy on small-object-focused datasets such as VisDrone, TinyPerson, and UAVDT. Similar limitations are observed when applying CenterNet, FCOS, RepPoints, and transformer-based detectors to tiny floating objects, where limited shallow feature retention, insufficient receptive-field adaptation, and inadequate multi-scale semantic fusion often lead to missed detections or inaccurate bounding boxes [
18,
19,
20]. Furthermore, edge hardware platforms such as Jetson Nano and Raspberry Pi impose strict constraints on model size and computational cost [
21,
22]. For battery-powered electric platforms, this issue is further coupled with energy consumption and mission endurance. As a result, balancing detection accuracy, robustness, and real-time efficiency remains a key challenge for long-duration autonomous operation [
2].
To address the above issues, this paper proposes a task-driven improvement of the YOLOv11 framework for reflective water-surface small-object detection under embedded deployment constraints on battery-powered electric platforms. Rather than pursuing accuracy gains alone, the proposed design aims to improve robustness in reflective scenes while preserving real-time onboard inference practicality. At the methodological level, the framework enhances the baseline detector from four complementary aspects, namely discriminative feature representation, fine-grained edge perception, train-time noisy-label mitigation, and small-object regression optimization. In this way, the detector is expected to better adapt to reflective interference, weak target boundaries, and resource-constrained onboard deployment.
The main novelty of this study lies in a lightweight yet task-oriented detection design for reflective water-surface environments. Compared with recent lightweight YOLO variants that mainly focus on generic efficiency improvement, the proposed method explicitly considers the combined effects of reflective background interference, tiny-object ambiguity, label unreliability during training, and embedded deployment requirements. Specifically, the proposed framework integrates a Global Channel–Spatial Attention mechanism to enhance discriminative representation, a median-enhanced channel–spatial module to strengthen weak-boundary perception, a train-time noisy-label screening strategy to stabilize optimization, and a weighted IoU-based regression design to improve localization quality for small floating targets. Therefore, the difference of this work lies not only in improving detection accuracy but also in emphasizing robustness and deployment feasibility for energy-constrained electric platforms.
Comprehensive experiments are conducted on the public CAS floating-object dataset [
23] and a self-collected multi-category water-surface debris dataset, covering 11 classes across both static and dynamic scenes. Results show that YOLOv11-IMP improves mAP@0.75 by 3.3% and small-object accuracy by 6.5% over the YOLOv11 baseline. It achieves superior performance in key categories such as can, plastic_bag, and floater, while maintaining real-time embedded inference on Jetson Nano. To further support deployment on electric platforms, we additionally report an energy–latency profiling protocol on embedded hardware (Section Energy–Latency Profile on Embedded Hardware), following common edge-profiling practices where latency/throughput are measured at batch size 1 and energy-per-frame can be computed from onboard power monitoring when available [
2]. These results validate the model’s robustness and deployment feasibility, highlighting its practical potential for energy-constrained onboard perception in electric autonomous surveillance and cleanup missions.
The remainder of this paper is organized as follows.
Section 2 reviews related studies on water-surface debris detection, lightweight object detection, and task-relevant optimization strategies.
Section 3 presents the proposed YOLOv11-IMP framework and its main design components.
Section 4 describes the datasets, implementation details, and experimental settings.
Section 5 reports the comparative experiments, ablation studies, and deployment-oriented analyses. Finally,
Section 6 concludes the paper.
2. Related Work
Research related to this paper mainly involves three aspects: water-surface debris detection, lightweight object detection for embedded deployment, and small-object-oriented enhancement and robust optimization under complex environments. Since the task considered in this work is not a generic object detection problem but a deployment-oriented perception problem for tiny floating objects in reflective water scenes, these directions need to be reviewed in an integrated manner.
2.1. Water-Surface Debris Detection
Water-surface debris detection has gradually attracted attention in recent years because of its practical importance in environmental monitoring, autonomous cleaning, and intelligent inspection. Compared with conventional object detection tasks in urban or industrial scenes, water-surface perception usually faces more severe optical interference, including dynamic reflections, ripples, shadows, illumination fluctuations, and partial occlusion. These factors often reduce the visibility and structural integrity of floating targets, especially when the objects are small, deformable, or weak in texture. As a result, directly applying generic detectors to water-surface scenes often leads to unstable recall and degraded localization quality.
Existing studies on water-surface object detection mainly focus on floating-object identification, river or lake debris monitoring, and perception support for unmanned surface vehicles [
1,
23]. Some works emphasize the construction of annotated datasets and benchmark protocols for floating targets, while others explore scene-specific network adaptation for reflective backgrounds and weak target saliency. These efforts have provided useful foundations for water-surface perception. However, most existing studies still place greater emphasis on detection feasibility itself, while relatively limited attention has been paid to the joint requirements of tiny-object sensitivity, robustness under reflective interference, and embedded onboard deployment. In particular, for battery-powered electric platforms, the perception model must not only detect small floating objects reliably but also maintain practical throughput and energy efficiency during long-duration autonomous operation [
2]. This indicates that water-surface debris detection still requires more task-oriented designs that explicitly consider both scene complexity and deployment constraints.
2.2. Lightweight Object Detection for Embedded Deployment
To satisfy real-time requirements on resource-constrained platforms, lightweight object detection has become an important research direction. Two-stage detectors, such as Faster R-CNN, Mask R-CNN, and Cascade R-CNN, usually provide strong localization accuracy, but their computational burden often limits direct deployment on low-power edge hardware [
6,
7,
8]. In contrast, single-stage detectors represented by SSD and the YOLO family are more suitable for real-time applications because of their simpler inference pipelines and better efficiency–accuracy trade-off [
9,
10,
11,
12,
13].
Recent lightweight detection studies have further promoted deployment on embedded devices through compact backbone design, feature re-parameterization, efficient attention modules, and deployment-aware optimization. Representative examples include PP-YOLOE and EdgeYOLO, which aim to reduce model complexity while preserving competitive accuracy [
14,
15]. In addition, anchor-free detectors, such as FCOS, CenterNet, and RepPoints, together with transformer-based detectors such as Deformable DETR, have broadened the design space of modern detection frameworks from different perspectives of label assignment, center-based localization, point representation, and global context modeling [
18,
19,
20]. Nevertheless, when the deployment scenario shifts from general edge vision to reflective water-surface small-object perception, the challenge is no longer limited to model compactness alone. Instead, the detector must simultaneously consider computational efficiency, weak-boundary perception, and robustness to scene-induced label unreliability. Therefore, generic lightweight design alone is often insufficient for this task, and more task-oriented improvement strategies are still needed.
2.3. Small-Object Enhancement and Robust Optimization in Complex Scenes
Small-object detection remains one of the most difficult subproblems in object detection. Recent evaluations show that even strong mainstream detectors still suffer from insufficient recall and inaccurate localization on datasets dominated by tiny targets, such as VisDrone, TinyPerson, and UAVDT [
16,
17]. The main reasons include inadequate shallow-feature preservation, limited effective receptive-field adaptation, insufficient cross-scale semantic interaction, and the weak structural distinguishability of tiny targets. These issues become more severe in reflective water environments, where target boundaries are often blurred by optical disturbance and background clutter.
To address such problems, recent studies have introduced various enhancement strategies, including multi-scale feature fusion, channel–spatial attention, boundary-aware refinement, and improved localization losses. Attention mechanisms are widely used to strengthen discriminative representation and suppress irrelevant background responses, while edge-aware or detail-enhanced designs are beneficial for weak-boundary object perception. In addition, improved box-regression losses have been proposed to alleviate the mismatch between predicted and ground-truth boxes, especially for small targets whose localization errors are more sensitive to slight coordinate deviations [
24]. Another practical issue is noisy or unreliable supervision in complex scenes. For reflective water-surface data, ambiguity caused by ripples, specular highlights, and tiny object size may increase annotation difficulty and label inconsistency, which in turn affects training stability. Therefore, robust optimization strategies that mitigate unreliable supervision during training are also relevant to deployment-oriented water-surface perception. However, these strategies are often investigated separately, and fewer studies integrate feature enhancement, robust training, and lightweight deployment into a unified framework for reflective water-surface small-object detection.
2.4. Research Gap and Positioning of This Work
The above review suggests that current studies still leave three main gaps. First, water-surface debris detection remains less studied than conventional land-scene detection, especially when the targets are extremely small and embedded deployment is required. Second, existing lightweight detectors mainly improve generic inference efficiency, but often do not explicitly address the combined influence of reflective background interference, weak target boundaries, and train-time label unreliability. Third, although many enhancement strategies have been explored separately, fewer studies integrate lightweight deployment, small-object enhancement, and robust optimization into a unified detection framework for water-surface onboard perception.
Therefore, the present work is positioned at the intersection of water-surface debris detection, lightweight embedded vision, and robust small-object optimization. The proposed YOLOv11-IMP framework is designed to improve discriminative representation, weak-boundary perception, and localization quality while maintaining practical embedded deployment capability. In this sense, the contribution of this paper lies not only in improving detection accuracy but also in providing a task-oriented design for robust and energy-aware perception on battery-powered electric platforms.
3. Materials and Methods
Given the challenges of detecting small and weakly structured objects in reflective water-surface environments—including low visual saliency, dynamic backgrounds, and platform-induced motion noise—this work develops a task-specific enhancement of the YOLOv11 framework for onboard perception on electric platforms. As shown in
Figure 1, the proposed architecture preserves the lightweight and high-throughput characteristics of YOLOv11, while introducing deployment-oriented improvements for reflective water-surface small-object detection. The overall design aims to strengthen multi-scale contextual representation, improve the perception of weak-boundary targets under background interference, and enhance bounding-box regression quality under the strict compute and power constraints of embedded devices [
2,
25].
As illustrated in
Figure 1, YOLOv11-IMP follows a backbone–neck–head detection pipeline and introduces targeted enhancements at different stages. At the backbone end, a context-reinforced stage composed of SPPF and C2PSA is employed to improve high-level semantic aggregation and suppress redundant background responses. During multi-scale feature fusion in the neck, the Global Channel–Spatial Attention (GCSA) module and the median-enhanced channel–spatial (MECS) module are introduced to strengthen cross-scale interaction and weak-boundary feature refinement [
26]. Finally, the regression branch adopts the WIoU-v3 loss to improve bounding-box alignment and localization stability for small floating objects. Based on this overall architecture, the following subsections describe the backbone design, feature-fusion enhancement strategy, and regression optimization in detail.
3.1. Multi-Scale Feature Extraction Backbone
The multi-scale feature extraction backbone in YOLOv11-IMP is designed to capture rich semantic representations across different spatial resolutions while maintaining an efficient computational footprint for onboard deployment. It combines residual-style feature reuse units with attention-enhanced feature modeling so that shallow visual details and high-level semantic context can be retained simultaneously. This design is particularly important for reflective water-surface scenes, where small floating targets often present weak textures, blurred boundaries, and unstable visual appearance.
As shown in
Figure 1, the backbone is composed of repeated convolutional blocks and C3K2 modules, which support shallow texture extraction and efficient feature reuse [
27]. The C3K2 module adopts multi-branch convolutional pathways to improve intermediate feature flow and enhance the network’s ability to represent small-scale regions. On this basis, the SPPF module (Spatial Pyramid Pooling Fast) employs parallel multi-scale pooling operations to enlarge the effective receptive field, thereby improving the representation of complex target shapes and surrounding contextual information. Following SPPF, the C2PSA module (Channel and Spatial Pyramid Attention) performs channel–spatial feature recalibration through pyramid-style aggregation and attention weighting, enabling the network to emphasize informative regions while suppressing interference on high-level feature maps.
Through the above design, the backbone not only constructs richer multi-scale semantics but also improves the joint modeling of shallow textures and deep structures. This capability is essential for enhancing target separability and feature discriminability in cluttered aquatic scenes. Building on these backbone features, the subsequent neck-stage modules further strengthen cross-scale channel–spatial interaction and edge-aware refinement for tiny-object recognition under complex water-surface interference [
28].
3.2. Global Channel–Spatial Attention Module
To improve YOLOv11’s ability to perceive and represent low-saliency tiny objects in complex backgrounds, this paper introduces the Global Channel–Spatial Attention (GCSA) module into the neck. The purpose of this module is to alleviate the limitations of conventional convolutional layers and plain feature pyramids in modeling channel-wise importance and cross-scale spatial context. By combining channel reweighting and spatial recalibration in a lightweight manner, GCSA enhances responses to informative regions while suppressing redundant background interference, thereby improving feature discriminability and robustness under embedded inference constraints.
In reflective water-surface environments, foreign objects are often affected by specular highlights, ripple-induced distortions, texture sparsity, and weak structural cues. These factors make conventional convolutional aggregation prone to emphasizing saturated regions or unstable background patterns. To address this issue, the proposed GCSA module performs joint channel and spatial refinement on neck features. The channel branch is designed to suppress dominant but uninformative activations and redistribute attention across semantically correlated feature groups, whereas the spatial branch further emphasizes geometrically meaningful local structures and boundary-related responses. Through this joint enhancement process, GCSA provides cleaner and more discriminative small-object-aware features for subsequent multi-scale fusion and prediction.
As shown in
Figure 2, the GCSA module independently models attention weights along the channel and spatial dimensions of the input feature map and then fuses them to enhance feature representation. Let the input feature map be denoted as
.
First, global average pooling is applied to each channel to extract channel-wise statistical descriptors as follows:
The descriptor vector
is then passed through a two-layer multi-layer perceptron (MLP) to obtain the channel attention vector
:
where
and
are learnable weights, and
r denotes the channel reduction ratio.
The obtained attention vector
is projected back onto the original feature map through element-wise multiplication:
To further strengthen inter-channel interaction, a channel shuffling mechanism is introduced. The feature map is first divided into g groups, each containing channels. Reshaping and dimension permutation are then performed to reorganize the internal ordering of channel groups, thereby promoting information exchange across groups. The resulting feature map is denoted by .
Based on
, the spatial branch is constructed to emphasize informative local regions. First, a
convolution is applied, followed by batch normalization and nonlinear activation:
Then, instance-wise normalization and affine transformation are performed to obtain the intermediate refined feature map:
where
and
denote the learnable scaling and shifting parameters, respectively.
Next, another
convolution is applied to
to generate the spatial attention logits, followed by the Sigmoid function to obtain the spatial attention weights:
Finally, the spatial attention map is applied to the channel-refined feature map through element-wise multiplication, yielding the final enhanced feature map
:
where
denotes the spatial attention tensor composed of the weights
.
Through the above process, the GCSA module enhances feature representation from both channel and spatial perspectives while maintaining a lightweight computational structure. This design is particularly suitable for detecting targets with small size, weak boundaries, and strong background interference in reflective water-surface environments.
3.3. Median-Enhanced Channel–Spatial Attention Module (MECS)
To further improve feature concentration and robustness for tiny-object detection in reflective water-surface environments, this paper introduces a Median-Enhanced Channel–Spatial Attention module (MECS) into the feature-fusion pathway. Although the baseline detector already provides strong spatial–semantic representation capability, tiny floating objects in complex aquatic scenes may still suffer from unstable responses under reflection noise, surface disturbance, localized blurring, and weak boundary cues. These difficulties become more pronounced in onboard deployment scenarios, where platform motion and exposure fluctuation introduce additional appearance variation under strict compute and power budgets [
2]. The MECS module is therefore designed to enhance feature selectivity in a lightweight manner by jointly modeling channel-wise importance and multi-scale spatial saliency.
As illustrated in
Figure 3, the MECS module consists of two components: a median-enhanced channel attention mechanism and a multi-scale spatial attention mechanism. The former strengthens robust channel reweighting by combining multiple global statistical descriptors, while the latter captures spatially informative regions through multi-branch receptive fields. Through this joint channel–spatial refinement process, the module enables the network to focus more effectively on task-relevant responses and suppress reflection- and ripple-induced interference.
3.3.1. Median-Enhanced Channel Attention Mechanism
Let the input feature map be denoted as
. To capture complementary statistical properties of each channel, three global pooling operators are employed to construct channel descriptors, namely Global Average Pooling (AvgPool), Global Median Pooling (MedianPool), and Global Max Pooling (MaxPool). These operators characterize the mean response, robust central tendency, and strongest activation of each channel, respectively. In particular, median pooling is less sensitive to local outliers or overexposed responses, which is beneficial for reflective water-surface scenes. The descriptors are defined as follows:
Each descriptor vector
is then passed through a shared two-layer multi-layer perceptron (MLP) with nonlinear activation to produce three channel attention vectors:
where
denotes the ReLU function,
denotes the Sigmoid activation, and
and
are shared MLP weight matrices.
The three attention vectors derived from average, median, and max pooling are combined by element-wise summation to obtain the final channel attention weight vector:
The obtained channel attention is then broadcast and applied to the input feature map
through element-wise multiplication, yielding the channel-refined feature map
:
Through the above process, the channel-attention branch suppresses irrelevant or unstable channel activation and improves the robustness of feature weighting under reflections, brightness fluctuation, and local disturbance.
3.3.2. Multi-Scale Spatial Attention Mechanism
After channel refinement, the MECS module further introduces a multi-scale spatial attention mechanism to enhance sensitivity to locally informative regions. Let
denote the channel-weighted feature map. A set of
n depthwise separable convolution operators
with different receptive fields is applied to extract spatial features at multiple scales:
The aggregated feature map
is then compressed by a
convolution to generate a single-channel spatial attention map:
Finally,
is broadcast along the channel dimension and multiplied with
to obtain the spatially enhanced output:
By using convolution branches with diverse receptive fields, the spatial-attention mechanism improves the perception of local structures at different scales, which is particularly beneficial for small floating objects with weak boundaries and varying apparent sizes. Together, the median-enhanced channel branch and the multi-scale spatial branch provide a lightweight yet effective refinement mechanism for neck-stage feature fusion in YOLOv11-IMP.
3.4. Noisy-Label Audit and Train-Time Strategy
Water-surface frames inevitably contain specular highlights, reflection bands, and ripple-induced distortions, which may introduce imperfect supervision. Typical cases include missing boxes, class ambiguity (e.g., can vs. bottle), and box jitter, as illustrated in
Figure 4, where solid boxes denote clean ground truth and dashed boxes denote noisy labels caused by reflection bands, ripple-blurred edges, class-flip risk, and misaligned annotations. When directly used for optimization, such noisy labels can destabilize early training and degrade localization consistency for small floating debris, a phenomenon widely discussed in recent noisy-label learning studies [
29,
30]. In onboard scenarios, additional motion blur and exposure fluctuation may further increase annotation uncertainty, which motivates the use of a lightweight training-time robustness strategy without modifying the inference path [
2].
To reduce the influence of these artifacts while keeping the detector architecture unchanged, this paper adopts a lightweight train-time noisy-label screening strategy, as shown in
Figure 5. At epoch
t, after a short warm-up stage of
epochs, a mini-batch
produces per-sample detection losses
. The corresponding noise-aware batch loss is defined as
where
denotes a rank-based sample weight.
Each sample is first ranked by the small-loss criterion and, when available, can be further refined using a prediction-agreement score, such as the consistency between the main prediction head and an EMA or auxiliary head. This yields an ordering
from relatively clean to relatively suspicious instances. Let
, where
is the keep ratio at epoch
t, and let
denote the soft down-weighting factor (e.g.,
). The sample weights are then assigned as
so that only the top
fraction contributes with full weight, while the remaining samples are softly suppressed.
The keep ratio
follows a simple linear schedule from
to
over the training process. Denoting
,
, the total number of epochs by
T, and the warm-up length by
, the schedule is written as
Through this design, earlier iterations place greater emphasis on cleaner supervision, while later iterations gradually retain more informative but slightly harder samples. Since the strategy only affects sample weighting during training, no architectural change is introduced and the inference cost remains unchanged. The resulting cleaner supervision signal also provides a more stable basis for the subsequent regression optimization introduced in the next subsection.
3.5. WIoU-V3 Loss Function Design
In water-surface small-object detection, targets typically exhibit small size, blurred boundaries, and irregular shapes. These characteristics are further aggravated by ripples, specular reflections, and background clutter, which can induce center deviations and shape mismatch between predicted boxes and ground-truth boxes. Under such conditions, relying solely on the overlap ratio in conventional IoU-based objectives may provide insufficient sensitivity to spatial misalignment, thereby limiting gradient quality for precise localization.
To address this issue, this paper adopts the improved weighted IoU loss WIoU-v3, which introduces two mechanisms on top of the standard IoU framework [
31]: a center-deviation penalty that explicitly measures the displacement between the predicted box center and the ground-truth center, and a dynamic focusing mechanism that suppresses the influence of abnormal samples during optimization. These designs aim to improve gradient stability and localization quality for tiny floating objects under blurred edges and structural distortion.
As illustrated in
Figure 6, floating debris instances may present varied shapes and orientations, and the predicted box can deviate from the ground truth in center location even when the IoU remains non-zero. This issue is particularly critical for small objects, because a visually acceptable overlap may still correspond to a noticeable localization error. To account for this effect, WIoU-v3 first introduces a center-deviation penalty factor
where
denotes the center of the predicted bounding box,
denotes the center of the ground-truth box, and
and
are the width and height of the enclosing box, respectively.
Based on this factor, the weighted IoU loss is written as
In addition, water-surface datasets may contain low-quality samples caused by reflection interference, ripple-blurred boundaries, or imperfect annotations. These samples may lead to unstable gradients when extreme IoU values dominate the optimization process. To alleviate this issue, WIoU-v3 further introduces an abnormality-suppression mechanism by defining the ratio between the current IoU loss and a reference IoU loss level:
a modulation factor
and a compression coefficient
are then used to construct a focused scaling factor
where
and
are predefined hyperparameters for dynamic reweighting. The final WIoU-v3 loss is thus given by
Through the above formulation, WIoU-v3 preserves the geometric alignment advantages of IoU-based regression while reducing the disruptive influence of abnormal samples and center misalignment. This design is particularly suitable for small-object localization in reflective water-surface scenes, where slight positional errors may lead to noticeable detection bias. Combined with the preceding train-time noisy-label screening strategy, it provides a more stable optimization basis for accurate bounding-box regression in YOLOv11-IMP.
4. Model Training and Experimental Deployment
4.1. Software Environment and Training Configuration
All training and evaluation experiments were conducted on a workstation platform to ensure stability and reproducibility. The implementation was developed in Python 3.8 with PyTorch 2.4.1 (cu121).
Table 1 summarizes the hardware and software configuration used for model training and offline evaluation. In addition to workstation-based experiments, embedded deployment on Jetson Nano was further profiled in terms of latency, throughput, and energy-per-frame when power monitoring was available (Section Energy–Latency Profile on Embedded Hardware).
During training, the Adam optimizer was adopted together with a cosine annealing learning-rate scheduler. The initial learning rate was set to 0.001, the batch size was set to 32, and the total number of training epochs was 300. The input images were resized to a unified resolution before being fed into the network, and the bounding-box regression branch employed the WIoU-v3 loss to improve localization quality for tiny objects under noisy supervision. In order to enhance robustness under complex water-surface backgrounds and alleviate category imbalance in the custom dataset, Mosaic augmentation and category-balanced sampling were further introduced during training.
For fair comparison, the proposed method and the compared baseline detectors were trained under a unified experimental protocol as much as possible, including consistent dataset partitions, image preprocessing, and evaluation criteria. This setting helps ensure that the performance differences mainly arise from architectural design and optimization strategy rather than inconsistencies in training conditions.
4.2. Datasets
To comprehensively evaluate the proposed method, two datasets were employed in this study, including a public benchmark dataset and a custom real-scene dataset. The public dataset was used to provide a standardized evaluation basis for water-surface floating-object localization, while the custom dataset was introduced to assess category diversity, robustness under complex interference, and practical generalization ability. The overall data selection criteria considered scene diversity, target representativeness, annotation validity, and the preservation of realistic water-surface disturbances, such as reflections, ripples, clutter, and partial occlusion.
4.2.1. CAS Water-Surface Floating Object Dataset
In the initial stage of experimentation, we adopted a publicly available water-surface floating-object dataset released by the Chinese Academy of Sciences (CAS), referred to as IWHR_AI_Lable_Floater_V1 [
23]. The dataset contains 3000 high-resolution images, each annotated with a single category label, floater. It was selected because it provides a representative public benchmark for floating-object detection and contains a large proportion of small targets under realistic water-surface interference.
Statistical analysis indicates that 94.5% of annotated targets occupy less than 10% of the image area, reflecting typical characteristics of small-object detection. The dataset also covers diverse target poses and scales and includes annotation uncertainty caused by real water-surface disturbances such as reflections and ripples, which makes precise localization non-trivial. These properties make the CAS dataset suitable for evaluating the sensitivity of a detector to tiny floating objects and its localization stability under reflective interference.
Figure 7 visualizes two key distribution patterns of the annotations: (a) shows the spatial density distribution of object centers, and (b) shows the distribution of object width–height ratios. These statistics confirm that the dataset is both representative and challenging for floating-debris detection, thus providing a reliable basis for public-benchmark evaluation. The CAS dataset was divided into training, validation, and test subsets at a ratio of 8:1:1, and all compared methods were evaluated under the same partition protocol.
4.2.2. Custom Water-Surface Debris Dataset
To address the limitations of existing public datasets in terms of category diversity and multi-scenario generalization, we constructed a custom dataset for water-surface debris detection. This dataset consists of 3000 real-world images covering 11 representative classes, including can, leaf, branch, plastic_bottle, plastic_box, and glass_bottle. It was introduced as a complement to the public benchmark so that the proposed method could be further evaluated in more diverse and practically relevant conditions.
Images were collected from multiple water environments, including still water, flowing water, strong reflective surfaces, and partially occluded regions, so as to preserve realistic variations in illumination, texture, clutter, and background dynamics. During dataset construction, samples were retained according to four main principles: representative debris categories frequently observed in real scenes, sufficient diversity of target scale and pose, valid manual annotation quality, and preservation of challenging interference factors such as reflections, ripples, and boundary ambiguity. Instead of deliberately removing difficult samples, these disturbances were retained as much as possible to better reflect practical detection conditions.
As shown in
Figure 8, the top-left region indicates a naturally imbalanced category distribution, where classes such as can occur more frequently, while rare classes such as ball and bad_fruit are underrepresented. This long-tail distribution is consistent with the frequency characteristics of real floating debris. The right region provides representative annotation samples for 11 classes across different materials, sizes, and water-surface backgrounds, further demonstrating the diversity and complexity of the collected data.
Compared with common detection benchmarks and baseline evaluation settings such as Faster R-CNN [
6], YOLOv7 [
27], and RetinaNet [
32], the proposed custom dataset emphasizes broader category coverage, stronger water-surface interference, and more realistic scene variability. Therefore, it serves as a complementary benchmark for evaluating robustness and generalization in reflective water-surface environments.
Similar to the CAS dataset, the custom dataset was also divided into training, validation, and test subsets at a ratio of 8:1:1, and all compared models were evaluated under the same partition protocol.
4.3. Experimental Deployment Configuration
To further validate real-world detection performance and deployment feasibility, an embedded inference platform based on the NVIDIA Jetson Nano was implemented and tested in natural river environments. The deployment system integrates the Jetson Nano as the onboard inference unit together with auxiliary components such as a GPS positioning module and representative floating-object samples for controlled field evaluation.
As shown in
Figure 9, the system operates without external GPU support, reflecting a realistic resource-constrained deployment setting. The purpose of this deployment experiment is not only to verify detection feasibility in real scenes but also to assess whether the proposed model can maintain acceptable runtime characteristics under practical embedded constraints. Accordingly, in addition to accuracy-related results, latency, throughput, and energy-per-frame (when power monitoring is available) are reported in Section Energy–Latency Profile on Embedded Hardware. This configuration provides a deployment-oriented basis for evaluating the practical applicability of YOLOv11-IMP in real-world water-surface monitoring and cleanup scenarios.
5. Results
To evaluate the proposed YOLOv11-IMP for small-object detection in complex water-surface environments, experiments were conducted on both the public CAS floating-object dataset and the custom multi-category water-surface debris dataset. For a comprehensive comparison, YOLOv11-IMP was evaluated against multiple representative detectors, including mainstream YOLO variants, a two-stage detector, an anchor-free detector family, and a transformer-based detector. This comparison setting was designed to cover different detection paradigms and evaluation perspectives, including detection accuracy, inference efficiency, model complexity, and deployment feasibility on embedded hardware.
5.1. Detection Accuracy Comparison
Figure 10 presents representative qualitative detection results of YOLOv11-IMP and several mainstream detectors on eight typical categories of water-surface debris, namely branch, can, grass, fruit, bottle, milk_box, plastic_bag, and floater (CAS). Under challenging conditions such as specular reflections, aquatic-plant occlusion, blurred boundaries, and low resolution, YOLOv11-IMP generally produces tighter bounding boxes, fewer missed detections, and fewer fragmented predictions. These visual results indicate that the proposed design improves robustness in reflective and cluttered water-surface scenes.
Quantitative results in
Table 2 further confirm the above trend. YOLOv11-IMP achieves the highest class-wise precision in 6 out of 13 categories and obtains the best overall AP
75 of 81.5%. Compared with Faster R-CNN, YOLOv8, and the original YOLOv11 baseline, the proposed method provides superior detection accuracy while maintaining a lightweight computational footprint of 7.3 GFLOPs. This result suggests that the proposed improvements effectively enhance small-object discriminability without introducing excessive deployment burden.
Figure 11 compares the normalized confusion matrices of YOLOv11 and YOLOv11-IMP on the custom dataset. Compared with the baseline model, YOLOv11-IMP exhibits reduced confusion among visually similar categories. In particular, misclassifications between plastic_box and milk_box, grass and background, and leaf and can are alleviated, resulting in stronger diagonal dominance and more compact off-diagonal distributions. These patterns indicate improved category discriminability for small and weak-boundary objects.
The observed gains can be attributed to the complementary effects of the proposed components. GCSA strengthens global channel–spatial interaction during multi-scale fusion, which improves context awareness and suppresses background-induced false responses. This effect is particularly beneficial in cluttered regions, where reflective patterns or vegetation-like textures may otherwise activate the detector incorrectly. MECS further enhances local detail retention and weak-boundary perception by combining robust channel statistics with multi-scale spatial refinement, which is beneficial for categories with strong local interference and blurred contours. In addition, the WIoU-v3 regression objective improves box alignment quality by reducing center deviation and shape mismatch, thereby improving localization stability for tiny reflective targets, such as can and floater.
Notably, these components exhibit a complementary effect on plastic_bag, a particularly challenging category due to soft contours, frequent occlusion, and partial transparency. In this case, GCSA provides stronger global contextual guidance, MECS improves local response concentration, and WIoU-v3 stabilizes bounding-box regression. Their joint contribution helps reduce false positives in confusing regions and improve class-wise performance on difficult categories.
Overall, YOLOv11-IMP achieves balanced improvement in detection accuracy, category discriminability, and computational efficiency. These results support the effectiveness of the proposed design for robust small-object detection in complex water-surface environments, with further deployment-oriented profiling reported in Section Energy–Latency Profile on Embedded Hardware.
5.2. Module Ablation and Efficiency Analysis
To quantify the contribution of each proposed component, controlled ablation experiments were conducted starting from the original YOLOv11 baseline. The GCSA module, the MECS module, the WIoU-v3 regression objective, and the train-time noisy-label screening strategy were progressively introduced to form a series of enhanced variants. Detection accuracy, model complexity, and inference efficiency were then jointly evaluated. Unless otherwise stated, all ablation results were obtained under the same training protocol and on the same split of the custom water-surface debris dataset.
As shown in
Table 3 and
Figure 12, the baseline YOLOv11 achieves 86.1% AP
75 with 6.9 GFLOPs and 3.3 ms measured latency. After introducing GCSA, the model size increases from 30 M to 32 M parameters with only a slight computational increase, while AP
75 rises to 86.9%. This indicates that global channel–spatial interaction improves robustness against background interference during feature fusion.
When MECS is introduced, AP75 further increases to 87.4% with nearly unchanged FLOPs, indicating that improved weak-boundary perception and local-detail retention can benefit reflective water-surface detection. When GCSA and MECS are jointly used, AP75 rises to 88.3% with only modest overhead, demonstrating that global contextual modeling and local detail enhancement are complementary.
After enabling WIoU-v3, AP75 further increases to 89.5% while FLOPs remain at 7.3 G, showing that improved regression guidance can substantially benefit tiny-object localization. Finally, the train-time noisy-label screening strategy yields the best AP75 of 89.9% without changing the inference path. Therefore, model complexity and runtime characteristics remain unchanged compared with the previous variant, while the optimization process becomes more robust to reflection- and ripple-induced label uncertainty.
Overall, GCSA, MECS, and WIoU-v3 provide progressive and consistent accuracy gains with controllable computational overhead, and the train-time noisy-label screening strategy further improves accuracy without increasing inference cost. These results support the effectiveness of the proposed design for practical deployment under resource-constrained conditions.
Energy–Latency Profile on Embedded Hardware
To support deployment on battery-powered electric platforms, the inference efficiency of representative models was additionally profiled on an NVIDIA Jetson Nano. Inference latency and throughput were measured under the same input resolution and batch setting as used in the deployment tests. When power monitoring was available, the average module power during steady-state inference was recorded and the energy consumption per frame was computed as
where
denotes the average power in watts and
denotes the measured latency in milliseconds. If direct power measurements were unavailable for some configurations, only latency and throughput were reported.
As shown in
Table 4, YOLOv11-IMP maintains a favorable accuracy–efficiency balance for embedded inference. The added attention and regression components introduce only modest overhead in latency and energy per frame, which is important for long-duration missions on electric platforms. This analysis complements the ablation results in
Table 3 by providing a deployment-oriented efficiency view beyond FLOPs since practical embedded performance can also be influenced by memory access patterns, kernel efficiency, and device power states.
6. Conclusions
This paper addresses small-object detection in reflective water-surface environments, where weak boundaries, dynamic reflections, ripples, and embedded deployment constraints jointly degrade detection accuracy and runtime stability for onboard perception. To address these challenges, a lightweight detector, YOLOv11-IMP, is developed based on YOLOv11. The proposed framework enhances the baseline model through complementary designs, including an attention-enhanced high-level backbone stage based on SPPF and C2PSA, a Global Channel–Spatial Attention (GCSA) module for cross-scale feature interaction, a Median-Enhanced Channel–Spatial Attention (MECS) module for weak-boundary refinement, a WIoU-v3 regression objective for improved localization, and a train-time noisy-label screening strategy for more robust optimization without changing the inference path.
Experiments on the public CAS floating-object dataset and a custom multi-category water-surface debris dataset demonstrate that YOLOv11-IMP achieves clear advantages in small-object detection under complex water-surface interference. Compared with the YOLOv11 baseline, the proposed method improves AP75 by 3.3% while maintaining lightweight complexity at 7.3 GFLOPs. In addition, the model reduces confusion among visually similar categories and produces tighter localization for small, weak-boundary, and reflection-disturbed targets. The embedded profiling results on NVIDIA Jetson Nano further indicate that the proposed detector maintains favorable runtime characteristics under the tested setting, supporting its practical applicability to onboard perception in water-quality monitoring and electric surface-cleaning scenarios.
Overall, YOLOv11-IMP provides a favorable balance among detection accuracy, computational efficiency, and deployment feasibility, making it a practical solution for energy-constrained onboard perception in battery-powered electric unmanned surface vehicles and electrified surface-cleaning platforms. This balance is particularly important for long-duration autonomous inspection and cleanup missions, where perception reliability and embedded efficiency directly affect mission endurance and operational effectiveness. In future work, semi-supervised learning, domain adaptation, and broader real-scene validation will be explored to reduce annotation cost and further improve robustness under unseen water conditions.