Energy-Aware Floating-Debris Detection for Battery-Powered Electric Unmanned Surface Vehicles: A Lightweight YOLO-Based Method with Embedded Profiling

Wang, Li; Gao, Yuan; Cai, Guosheng; Shen, Caoxin

doi:10.3390/wevj17030156

Open AccessArticle

Energy-Aware Floating-Debris Detection for Battery-Powered Electric Unmanned Surface Vehicles: A Lightweight YOLO-Based Method with Embedded Profiling

School of Electrical Engineering and Automation, Nantong University, Nantong 226019, China

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2026, 17(3), 156; https://doi.org/10.3390/wevj17030156

Submission received: 24 February 2026 / Revised: 15 March 2026 / Accepted: 17 March 2026 / Published: 19 March 2026

(This article belongs to the Section Vehicle Control and Management)

Download

Browse Figures

Versions Notes

Abstract

Battery-powered electric unmanned surface vehicles (e-USVs) and electrified surface-cleaning platforms require reliable onboard vision under strict compute and power constraints. In reflective water environments, tiny floating debris is often obscured by specular highlights, reflection bands, ripples, motion blur, and camera jitter, while label noise further degrades training stability. To improve robustness without increasing onboard inference burden, this paper proposes YOLOv11-IMP, a lightweight detector for reflective water-surface scenes and embedded edge inference. The method integrates a transformer-enhanced backbone stage, a Global Channel–Spatial Attention module in the neck, and a median-enhanced channel–spatial module in the neck to improve global-context modeling, cross-scale interaction, and weak-boundary representation. WIoU-v3 is adopted to improve localization, and a train-time-only noise-aware screening strategy based on the small-loss principle is introduced to suppress unreliable labels without extra inference cost. Experiments on the CAS dataset and a self-built debris dataset show gains of 3.3% in AP@0.75 and 6.5% in AP for small objects over YOLOv11, while maintaining 7.3 GFLOPs and real-time inference on Jetson Nano, demonstrating practical potential for energy-constrained onboard missions.

Keywords:

electric unmanned surface vehicles (e-USVs); floating-debris detection; lightweight YOLO-based detection; attention mechanisms; noisy-label learning; embedded edge deployment; energy–latency profiling

1. Introduction

Small-object detection in water-surface scenes remains an under-explored yet practically critical problem, especially for battery-powered electric unmanned surface vehicles (e-USVs) and electrified surface-cleaning platforms used for autonomous inspection and debris removal [1]. Compared with terrestrial settings, water-surface perception on such platforms must operate under stricter onboard constraints: embedded processors are limited by compute budgets and power budgets, while perception failures can directly translate into wasted energy, reduced mission endurance, and incomplete cleanup [2]. Therefore, developing a robust and lightweight framework for detecting tiny floating debris is not only valuable for environmental intelligence but also essential for reliable onboard perception and energy-aware autonomous operation in electrified missions.

Real-time detection of small floating objects in reflective water environments remains challenging. Many existing vision models exhibit limited accuracy, unstable real-time performance, and heavy computational cost when deployed on embedded platforms [2,3,4]. The main difficulties arise from three aspects. First, dynamic reflections, surface ripples, and shadows frequently obscure visual cues, leading to incomplete texture extraction and contour ambiguity [5]. Second, cameras mounted on unmanned surface vehicles are often affected by motion blur, viewpoint variation, and platform vibration, which reduce feature consistency during real-time inference. Third, floating debris such as plastic bags, leaves, and fruit peels usually exhibits weak texture, low contrast, irregular shapes, and large appearance variation, making them easily confused with background clutter and difficult to localize precisely. These characteristics make water-surface small-object detection substantially more difficult than conventional object detection in relatively structured land scenes.

In recent years, deep learning-based object detection has developed rapidly. Two-stage detectors such as Faster R-CNN [6], Mask R-CNN [7], and Cascade R-CNN [8] generally provide strong localization capability in structured scenes with sufficient resolution. Single-stage detectors such as SSD [9] and the YOLO family [10,11,12,13] emphasize real-time inference and have been widely adopted in edge applications. In parallel, anchor-free detectors, such as FCOS, CenterNet, and RepPoints, as well as transformer-based architectures such as Deformable DETR, further expand the design space for accurate and flexible detection. Industrial-grade variants, including PP-YOLOE and edge-oriented frameworks such as EdgeYOLO explore lightweight design and deployment-aware optimization [14,15]. However, relatively few studies explicitly formulate waterborne tiny-object detection as an onboard perception task under deployment-driven constraints, where low object saliency, dynamic backgrounds, and reflective-scene label unreliability jointly affect both training stability and runtime robustness [2].

Although the above detectors—including the latest YOLO variants—continue to evolve toward lighter architectures and stronger feature modeling ability, their performance in detecting small objects under complex environments remains suboptimal. Recent evaluations [16,17] indicate that models such as YOLOv5 and YOLOv7 still suffer from low recall and weak localization accuracy on small-object-focused datasets such as VisDrone, TinyPerson, and UAVDT. Similar limitations are observed when applying CenterNet, FCOS, RepPoints, and transformer-based detectors to tiny floating objects, where limited shallow feature retention, insufficient receptive-field adaptation, and inadequate multi-scale semantic fusion often lead to missed detections or inaccurate bounding boxes [18,19,20]. Furthermore, edge hardware platforms such as Jetson Nano and Raspberry Pi impose strict constraints on model size and computational cost [21,22]. For battery-powered electric platforms, this issue is further coupled with energy consumption and mission endurance. As a result, balancing detection accuracy, robustness, and real-time efficiency remains a key challenge for long-duration autonomous operation [2].

To address the above issues, this paper proposes a task-driven improvement of the YOLOv11 framework for reflective water-surface small-object detection under embedded deployment constraints on battery-powered electric platforms. Rather than pursuing accuracy gains alone, the proposed design aims to improve robustness in reflective scenes while preserving real-time onboard inference practicality. At the methodological level, the framework enhances the baseline detector from four complementary aspects, namely discriminative feature representation, fine-grained edge perception, train-time noisy-label mitigation, and small-object regression optimization. In this way, the detector is expected to better adapt to reflective interference, weak target boundaries, and resource-constrained onboard deployment.

The main novelty of this study lies in a lightweight yet task-oriented detection design for reflective water-surface environments. Compared with recent lightweight YOLO variants that mainly focus on generic efficiency improvement, the proposed method explicitly considers the combined effects of reflective background interference, tiny-object ambiguity, label unreliability during training, and embedded deployment requirements. Specifically, the proposed framework integrates a Global Channel–Spatial Attention mechanism to enhance discriminative representation, a median-enhanced channel–spatial module to strengthen weak-boundary perception, a train-time noisy-label screening strategy to stabilize optimization, and a weighted IoU-based regression design to improve localization quality for small floating targets. Therefore, the difference of this work lies not only in improving detection accuracy but also in emphasizing robustness and deployment feasibility for energy-constrained electric platforms.

Comprehensive experiments are conducted on the public CAS floating-object dataset [23] and a self-collected multi-category water-surface debris dataset, covering 11 classes across both static and dynamic scenes. Results show that YOLOv11-IMP improves mAP@0.75 by 3.3% and small-object accuracy by 6.5% over the YOLOv11 baseline. It achieves superior performance in key categories such as can, plastic_bag, and floater, while maintaining real-time embedded inference on Jetson Nano. To further support deployment on electric platforms, we additionally report an energy–latency profiling protocol on embedded hardware (Section Energy–Latency Profile on Embedded Hardware), following common edge-profiling practices where latency/throughput are measured at batch size 1 and energy-per-frame can be computed from onboard power monitoring when available [2]. These results validate the model’s robustness and deployment feasibility, highlighting its practical potential for energy-constrained onboard perception in electric autonomous surveillance and cleanup missions.

The remainder of this paper is organized as follows. Section 2 reviews related studies on water-surface debris detection, lightweight object detection, and task-relevant optimization strategies. Section 3 presents the proposed YOLOv11-IMP framework and its main design components. Section 4 describes the datasets, implementation details, and experimental settings. Section 5 reports the comparative experiments, ablation studies, and deployment-oriented analyses. Finally, Section 6 concludes the paper.

2. Related Work

Research related to this paper mainly involves three aspects: water-surface debris detection, lightweight object detection for embedded deployment, and small-object-oriented enhancement and robust optimization under complex environments. Since the task considered in this work is not a generic object detection problem but a deployment-oriented perception problem for tiny floating objects in reflective water scenes, these directions need to be reviewed in an integrated manner.

2.1. Water-Surface Debris Detection

Water-surface debris detection has gradually attracted attention in recent years because of its practical importance in environmental monitoring, autonomous cleaning, and intelligent inspection. Compared with conventional object detection tasks in urban or industrial scenes, water-surface perception usually faces more severe optical interference, including dynamic reflections, ripples, shadows, illumination fluctuations, and partial occlusion. These factors often reduce the visibility and structural integrity of floating targets, especially when the objects are small, deformable, or weak in texture. As a result, directly applying generic detectors to water-surface scenes often leads to unstable recall and degraded localization quality.

Existing studies on water-surface object detection mainly focus on floating-object identification, river or lake debris monitoring, and perception support for unmanned surface vehicles [1,23]. Some works emphasize the construction of annotated datasets and benchmark protocols for floating targets, while others explore scene-specific network adaptation for reflective backgrounds and weak target saliency. These efforts have provided useful foundations for water-surface perception. However, most existing studies still place greater emphasis on detection feasibility itself, while relatively limited attention has been paid to the joint requirements of tiny-object sensitivity, robustness under reflective interference, and embedded onboard deployment. In particular, for battery-powered electric platforms, the perception model must not only detect small floating objects reliably but also maintain practical throughput and energy efficiency during long-duration autonomous operation [2]. This indicates that water-surface debris detection still requires more task-oriented designs that explicitly consider both scene complexity and deployment constraints.

2.2. Lightweight Object Detection for Embedded Deployment

To satisfy real-time requirements on resource-constrained platforms, lightweight object detection has become an important research direction. Two-stage detectors, such as Faster R-CNN, Mask R-CNN, and Cascade R-CNN, usually provide strong localization accuracy, but their computational burden often limits direct deployment on low-power edge hardware [6,7,8]. In contrast, single-stage detectors represented by SSD and the YOLO family are more suitable for real-time applications because of their simpler inference pipelines and better efficiency–accuracy trade-off [9,10,11,12,13].

Recent lightweight detection studies have further promoted deployment on embedded devices through compact backbone design, feature re-parameterization, efficient attention modules, and deployment-aware optimization. Representative examples include PP-YOLOE and EdgeYOLO, which aim to reduce model complexity while preserving competitive accuracy [14,15]. In addition, anchor-free detectors, such as FCOS, CenterNet, and RepPoints, together with transformer-based detectors such as Deformable DETR, have broadened the design space of modern detection frameworks from different perspectives of label assignment, center-based localization, point representation, and global context modeling [18,19,20]. Nevertheless, when the deployment scenario shifts from general edge vision to reflective water-surface small-object perception, the challenge is no longer limited to model compactness alone. Instead, the detector must simultaneously consider computational efficiency, weak-boundary perception, and robustness to scene-induced label unreliability. Therefore, generic lightweight design alone is often insufficient for this task, and more task-oriented improvement strategies are still needed.

2.3. Small-Object Enhancement and Robust Optimization in Complex Scenes

Small-object detection remains one of the most difficult subproblems in object detection. Recent evaluations show that even strong mainstream detectors still suffer from insufficient recall and inaccurate localization on datasets dominated by tiny targets, such as VisDrone, TinyPerson, and UAVDT [16,17]. The main reasons include inadequate shallow-feature preservation, limited effective receptive-field adaptation, insufficient cross-scale semantic interaction, and the weak structural distinguishability of tiny targets. These issues become more severe in reflective water environments, where target boundaries are often blurred by optical disturbance and background clutter.

To address such problems, recent studies have introduced various enhancement strategies, including multi-scale feature fusion, channel–spatial attention, boundary-aware refinement, and improved localization losses. Attention mechanisms are widely used to strengthen discriminative representation and suppress irrelevant background responses, while edge-aware or detail-enhanced designs are beneficial for weak-boundary object perception. In addition, improved box-regression losses have been proposed to alleviate the mismatch between predicted and ground-truth boxes, especially for small targets whose localization errors are more sensitive to slight coordinate deviations [24]. Another practical issue is noisy or unreliable supervision in complex scenes. For reflective water-surface data, ambiguity caused by ripples, specular highlights, and tiny object size may increase annotation difficulty and label inconsistency, which in turn affects training stability. Therefore, robust optimization strategies that mitigate unreliable supervision during training are also relevant to deployment-oriented water-surface perception. However, these strategies are often investigated separately, and fewer studies integrate feature enhancement, robust training, and lightweight deployment into a unified framework for reflective water-surface small-object detection.

2.4. Research Gap and Positioning of This Work

The above review suggests that current studies still leave three main gaps. First, water-surface debris detection remains less studied than conventional land-scene detection, especially when the targets are extremely small and embedded deployment is required. Second, existing lightweight detectors mainly improve generic inference efficiency, but often do not explicitly address the combined influence of reflective background interference, weak target boundaries, and train-time label unreliability. Third, although many enhancement strategies have been explored separately, fewer studies integrate lightweight deployment, small-object enhancement, and robust optimization into a unified detection framework for water-surface onboard perception.

Therefore, the present work is positioned at the intersection of water-surface debris detection, lightweight embedded vision, and robust small-object optimization. The proposed YOLOv11-IMP framework is designed to improve discriminative representation, weak-boundary perception, and localization quality while maintaining practical embedded deployment capability. In this sense, the contribution of this paper lies not only in improving detection accuracy but also in providing a task-oriented design for robust and energy-aware perception on battery-powered electric platforms.

3. Materials and Methods

Given the challenges of detecting small and weakly structured objects in reflective water-surface environments—including low visual saliency, dynamic backgrounds, and platform-induced motion noise—this work develops a task-specific enhancement of the YOLOv11 framework for onboard perception on electric platforms. As shown in Figure 1, the proposed architecture preserves the lightweight and high-throughput characteristics of YOLOv11, while introducing deployment-oriented improvements for reflective water-surface small-object detection. The overall design aims to strengthen multi-scale contextual representation, improve the perception of weak-boundary targets under background interference, and enhance bounding-box regression quality under the strict compute and power constraints of embedded devices [2,25].

As illustrated in Figure 1, YOLOv11-IMP follows a backbone–neck–head detection pipeline and introduces targeted enhancements at different stages. At the backbone end, a context-reinforced stage composed of SPPF and C2PSA is employed to improve high-level semantic aggregation and suppress redundant background responses. During multi-scale feature fusion in the neck, the Global Channel–Spatial Attention (GCSA) module and the median-enhanced channel–spatial (MECS) module are introduced to strengthen cross-scale interaction and weak-boundary feature refinement [26]. Finally, the regression branch adopts the WIoU-v3 loss to improve bounding-box alignment and localization stability for small floating objects. Based on this overall architecture, the following subsections describe the backbone design, feature-fusion enhancement strategy, and regression optimization in detail.

3.1. Multi-Scale Feature Extraction Backbone

The multi-scale feature extraction backbone in YOLOv11-IMP is designed to capture rich semantic representations across different spatial resolutions while maintaining an efficient computational footprint for onboard deployment. It combines residual-style feature reuse units with attention-enhanced feature modeling so that shallow visual details and high-level semantic context can be retained simultaneously. This design is particularly important for reflective water-surface scenes, where small floating targets often present weak textures, blurred boundaries, and unstable visual appearance.

As shown in Figure 1, the backbone is composed of repeated convolutional blocks and C3K2 modules, which support shallow texture extraction and efficient feature reuse [27]. The C3K2 module adopts multi-branch convolutional pathways to improve intermediate feature flow and enhance the network’s ability to represent small-scale regions. On this basis, the SPPF module (Spatial Pyramid Pooling Fast) employs parallel multi-scale pooling operations to enlarge the effective receptive field, thereby improving the representation of complex target shapes and surrounding contextual information. Following SPPF, the C2PSA module (Channel and Spatial Pyramid Attention) performs channel–spatial feature recalibration through pyramid-style aggregation and attention weighting, enabling the network to emphasize informative regions while suppressing interference on high-level feature maps.

Through the above design, the backbone not only constructs richer multi-scale semantics but also improves the joint modeling of shallow textures and deep structures. This capability is essential for enhancing target separability and feature discriminability in cluttered aquatic scenes. Building on these backbone features, the subsequent neck-stage modules further strengthen cross-scale channel–spatial interaction and edge-aware refinement for tiny-object recognition under complex water-surface interference [28].

3.2. Global Channel–Spatial Attention Module

To improve YOLOv11’s ability to perceive and represent low-saliency tiny objects in complex backgrounds, this paper introduces the Global Channel–Spatial Attention (GCSA) module into the neck. The purpose of this module is to alleviate the limitations of conventional convolutional layers and plain feature pyramids in modeling channel-wise importance and cross-scale spatial context. By combining channel reweighting and spatial recalibration in a lightweight manner, GCSA enhances responses to informative regions while suppressing redundant background interference, thereby improving feature discriminability and robustness under embedded inference constraints.

In reflective water-surface environments, foreign objects are often affected by specular highlights, ripple-induced distortions, texture sparsity, and weak structural cues. These factors make conventional convolutional aggregation prone to emphasizing saturated regions or unstable background patterns. To address this issue, the proposed GCSA module performs joint channel and spatial refinement on neck features. The channel branch is designed to suppress dominant but uninformative activations and redistribute attention across semantically correlated feature groups, whereas the spatial branch further emphasizes geometrically meaningful local structures and boundary-related responses. Through this joint enhancement process, GCSA provides cleaner and more discriminative small-object-aware features for subsequent multi-scale fusion and prediction.

As shown in Figure 2, the GCSA module independently models attention weights along the channel and spatial dimensions of the input feature map and then fuses them to enhance feature representation. Let the input feature map be denoted as

F \in R^{C \times H \times W}

.

First, global average pooling is applied to each channel to extract channel-wise statistical descriptors as follows:

z_{c} = \frac{1}{H \cdot W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{c, i, j}, c = 1, 2, \dots, C .

(1)

The descriptor vector

z \in R^{C}

is then passed through a two-layer multi-layer perceptron (MLP) to obtain the channel attention vector

α

:

\begin{matrix} h & = ReLU (W_{1} \cdot z + b_{1}), \end{matrix}

(2)

\begin{matrix} α & = Sigmoid (W_{2} \cdot h + b_{2}) \in {[0, 1]}^{C}, \end{matrix}

(3)

where

W_{1} \in R^{\frac{C}{r} \times C}

and

W_{2} \in R^{C \times \frac{C}{r}}

are learnable weights, and r denotes the channel reduction ratio.

The obtained attention vector

α

is projected back onto the original feature map through element-wise multiplication:

F_{c a} = α ⊙ F = [\begin{matrix} α_{1} \cdot F_{1} \\ α_{2} \cdot F_{2} \\ ⋮ \\ α_{C} \cdot F_{C} \end{matrix}], F_{c a} \in R^{C \times H \times W} .

(4)

To further strengthen inter-channel interaction, a channel shuffling mechanism is introduced. The feature map is first divided into g groups, each containing

C / g

channels. Reshaping and dimension permutation are then performed to reorganize the internal ordering of channel groups, thereby promoting information exchange across groups. The resulting feature map is denoted by

F_{shuffle}

.

Based on

F_{shuffle}

, the spatial branch is constructed to emphasize informative local regions. First, a

7 \times 7

convolution is applied, followed by batch normalization and nonlinear activation:

U = W^{(1)} * F_{shuffle} + b^{(1)} .

(5)

Then, instance-wise normalization and affine transformation are performed to obtain the intermediate refined feature map:

F_{c, i, j}^{(1)} = max (0, η_{c} \cdot \frac{U_{c, i, j} - μ_{c}}{\sqrt{σ_{c}^{2} + ϵ}} + ξ_{c}),

(6)

where

η_{c}

and

ξ_{c}

denote the learnable scaling and shifting parameters, respectively.

Next, another

7 \times 7

convolution is applied to

F^{(1)}

to generate the spatial attention logits, followed by the Sigmoid function to obtain the spatial attention weights:

S = W^{(2)} * F^{(1)} + b^{(2)}, m_{c, i, j} = σ (S_{c, i, j}) = \frac{1}{1 + e^{- S_{c, i, j}}} .

(7)

Finally, the spatial attention map is applied to the channel-refined feature map through element-wise multiplication, yielding the final enhanced feature map

F_{out}

:

F_{out} = M ⊙ F_{shuffle} = [\begin{matrix} m_{1} \cdot F_{1} \\ m_{2} \cdot F_{2} \\ ⋮ \\ m_{C} \cdot F_{C} \end{matrix}], F_{out} \in R^{C \times H \times W},

(8)

where

M

denotes the spatial attention tensor composed of the weights

m_{c, i, j}

.

Through the above process, the GCSA module enhances feature representation from both channel and spatial perspectives while maintaining a lightweight computational structure. This design is particularly suitable for detecting targets with small size, weak boundaries, and strong background interference in reflective water-surface environments.

3.3. Median-Enhanced Channel–Spatial Attention Module (MECS)

To further improve feature concentration and robustness for tiny-object detection in reflective water-surface environments, this paper introduces a Median-Enhanced Channel–Spatial Attention module (MECS) into the feature-fusion pathway. Although the baseline detector already provides strong spatial–semantic representation capability, tiny floating objects in complex aquatic scenes may still suffer from unstable responses under reflection noise, surface disturbance, localized blurring, and weak boundary cues. These difficulties become more pronounced in onboard deployment scenarios, where platform motion and exposure fluctuation introduce additional appearance variation under strict compute and power budgets [2]. The MECS module is therefore designed to enhance feature selectivity in a lightweight manner by jointly modeling channel-wise importance and multi-scale spatial saliency.

As illustrated in Figure 3, the MECS module consists of two components: a median-enhanced channel attention mechanism and a multi-scale spatial attention mechanism. The former strengthens robust channel reweighting by combining multiple global statistical descriptors, while the latter captures spatially informative regions through multi-branch receptive fields. Through this joint channel–spatial refinement process, the module enables the network to focus more effectively on task-relevant responses and suppress reflection- and ripple-induced interference.

3.3.1. Median-Enhanced Channel Attention Mechanism

Let the input feature map be denoted as

F \in R^{C \times H \times W}

. To capture complementary statistical properties of each channel, three global pooling operators are employed to construct channel descriptors, namely Global Average Pooling (AvgPool), Global Median Pooling (MedianPool), and Global Max Pooling (MaxPool). These operators characterize the mean response, robust central tendency, and strongest activation of each channel, respectively. In particular, median pooling is less sensitive to local outliers or overexposed responses, which is beneficial for reflective water-surface scenes. The descriptors are defined as follows:

z_{c}^{(avg)} = \frac{1}{H \cdot W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{c, i, j}, c = 1, 2, \dots, C

(9)

z_{c}^{(med)} = {median}_{i, j} (F_{c, i, j})

(10)

z_{c}^{(\max)} = max_{i, j} (F_{c, i, j})

(11)

Each descriptor vector

z^{(*)} \in R^{C}

is then passed through a shared two-layer multi-layer perceptron (MLP) with nonlinear activation to produce three channel attention vectors:

α^{(*)} = σ (W_{2} \cdot δ (W_{1} \cdot z^{(*)} + b_{1}) + b_{2}), α^{(*)} \in {[0, 1]}^{C}

(12)

where

δ (\cdot)

denotes the ReLU function,

σ (\cdot)

denotes the Sigmoid activation, and

W_{1}

and

W_{2}

are shared MLP weight matrices.

The three attention vectors derived from average, median, and max pooling are combined by element-wise summation to obtain the final channel attention weight vector:

α = α^{(avg)} + α^{(med)} + α^{(\max)} .

(13)

The obtained channel attention is then broadcast and applied to the input feature map

F

through element-wise multiplication, yielding the channel-refined feature map

F^{'}

:

F^{'} = α ⊙ F, F^{'} \in R^{C \times H \times W} .

(14)

Through the above process, the channel-attention branch suppresses irrelevant or unstable channel activation and improves the robustness of feature weighting under reflections, brightness fluctuation, and local disturbance.

3.3.2. Multi-Scale Spatial Attention Mechanism

After channel refinement, the MECS module further introduces a multi-scale spatial attention mechanism to enhance sensitivity to locally informative regions. Let

F^{'} \in R^{C \times H \times W}

denote the channel-weighted feature map. A set of n depthwise separable convolution operators

D_{i} (\cdot)

with different receptive fields is applied to extract spatial features at multiple scales:

F_{s} = \sum_{i = 1}^{n} D_{i} (F^{'}), F_{s} \in R^{C \times H \times W} .

(15)

The aggregated feature map

F_{s}

is then compressed by a

1 \times 1

convolution to generate a single-channel spatial attention map:

β = σ ({Conv}_{1 \times 1} (F_{s})), β \in {[0, 1]}^{H \times W} .

(16)

Finally,

β

is broadcast along the channel dimension and multiplied with

F^{'}

to obtain the spatially enhanced output:

F^{″} = β ⊙ F^{'}, F^{″} \in R^{C \times H \times W} .

(17)

By using convolution branches with diverse receptive fields, the spatial-attention mechanism improves the perception of local structures at different scales, which is particularly beneficial for small floating objects with weak boundaries and varying apparent sizes. Together, the median-enhanced channel branch and the multi-scale spatial branch provide a lightweight yet effective refinement mechanism for neck-stage feature fusion in YOLOv11-IMP.

3.4. Noisy-Label Audit and Train-Time Strategy

Water-surface frames inevitably contain specular highlights, reflection bands, and ripple-induced distortions, which may introduce imperfect supervision. Typical cases include missing boxes, class ambiguity (e.g., can vs. bottle), and box jitter, as illustrated in Figure 4, where solid boxes denote clean ground truth and dashed boxes denote noisy labels caused by reflection bands, ripple-blurred edges, class-flip risk, and misaligned annotations. When directly used for optimization, such noisy labels can destabilize early training and degrade localization consistency for small floating debris, a phenomenon widely discussed in recent noisy-label learning studies [29,30]. In onboard scenarios, additional motion blur and exposure fluctuation may further increase annotation uncertainty, which motivates the use of a lightweight training-time robustness strategy without modifying the inference path [2].

To reduce the influence of these artifacts while keeping the detector architecture unchanged, this paper adopts a lightweight train-time noisy-label screening strategy, as shown in Figure 5. At epoch t, after a short warm-up stage of

E_{0}

epochs, a mini-batch

{(x_{i}, y_{i})}_{i = 1}^{B}

produces per-sample detection losses

ℓ_{i}^{(t)}

. The corresponding noise-aware batch loss is defined as

L_{noise}^{(t)} = \frac{1}{B} \sum_{i = 1}^{B} w_{i}^{(t)} ℓ_{i}^{(t)},

(18)

where

w_{i}^{(t)}

denotes a rank-based sample weight.

Each sample is first ranked by the small-loss criterion and, when available, can be further refined using a prediction-agreement score, such as the consistency between the main prediction head and an EMA or auxiliary head. This yields an ordering

π_{t}

from relatively clean to relatively suspicious instances. Let

k_{t} = ⌊ p (t) B ⌋

, where

p (t)

is the keep ratio at epoch t, and let

0 < γ ≪ 1

denote the soft down-weighting factor (e.g.,

γ = 0.1

). The sample weights are then assigned as

w_{π_{t} (i)}^{(t)} = \{\begin{matrix} 1, & 1 \leq i \leq k_{t}, \\ γ, & k_{t} < i \leq B, \end{matrix}

(19)

so that only the top

p (t)

fraction contributes with full weight, while the remaining samples are softly suppressed.

The keep ratio

p (t)

follows a simple linear schedule from

0.8

to

0.7

over the training process. Denoting

p_{max} = 0.8

,

p_{min} = 0.7

, the total number of epochs by T, and the warm-up length by

E_{0}

, the schedule is written as

p (t) = \{\begin{matrix} p_{max}, & t \leq E_{0}, \\ p_{max} - (p_{max} - p_{min}) \frac{t - E_{0}}{T - E_{0}}, & E_{0} < t \leq T . \end{matrix}

(20)

Through this design, earlier iterations place greater emphasis on cleaner supervision, while later iterations gradually retain more informative but slightly harder samples. Since the strategy only affects sample weighting during training, no architectural change is introduced and the inference cost remains unchanged. The resulting cleaner supervision signal also provides a more stable basis for the subsequent regression optimization introduced in the next subsection.

3.5. WIoU-V3 Loss Function Design

In water-surface small-object detection, targets typically exhibit small size, blurred boundaries, and irregular shapes. These characteristics are further aggravated by ripples, specular reflections, and background clutter, which can induce center deviations and shape mismatch between predicted boxes and ground-truth boxes. Under such conditions, relying solely on the overlap ratio in conventional IoU-based objectives may provide insufficient sensitivity to spatial misalignment, thereby limiting gradient quality for precise localization.

To address this issue, this paper adopts the improved weighted IoU loss WIoU-v3, which introduces two mechanisms on top of the standard IoU framework [31]: a center-deviation penalty that explicitly measures the displacement between the predicted box center and the ground-truth center, and a dynamic focusing mechanism that suppresses the influence of abnormal samples during optimization. These designs aim to improve gradient stability and localization quality for tiny floating objects under blurred edges and structural distortion.

As illustrated in Figure 6, floating debris instances may present varied shapes and orientations, and the predicted box can deviate from the ground truth in center location even when the IoU remains non-zero. This issue is particularly critical for small objects, because a visually acceptable overlap may still correspond to a noticeable localization error. To account for this effect, WIoU-v3 first introduces a center-deviation penalty factor

R_{WIoU} = exp (- \frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{W_{g}^{2} + H_{g}^{2}}),

(21)

where

(x, y)

denotes the center of the predicted bounding box,

(x_{g t}, y_{g t})

denotes the center of the ground-truth box, and

W_{g}

and

H_{g}

are the width and height of the enclosing box, respectively.

Based on this factor, the weighted IoU loss is written as

L_{WIoU} = R_{WIoU} \cdot (1 - IoU) .

(22)

In addition, water-surface datasets may contain low-quality samples caused by reflection interference, ripple-blurred boundaries, or imperfect annotations. These samples may lead to unstable gradients when extreme IoU values dominate the optimization process. To alleviate this issue, WIoU-v3 further introduces an abnormality-suppression mechanism by defining the ratio between the current IoU loss and a reference IoU loss level:

β = \frac{L_{IoU}^{*}}{L_{IoU}} .

(23)

a modulation factor

δ

and a compression coefficient

α

are then used to construct a focused scaling factor

r = \frac{β}{δ \cdot α \cdot | β - δ |},

(24)

where

δ

and

α

are predefined hyperparameters for dynamic reweighting. The final WIoU-v3 loss is thus given by

L_{WIoUv 3} = r \cdot L_{WIoU} .

(25)

Through the above formulation, WIoU-v3 preserves the geometric alignment advantages of IoU-based regression while reducing the disruptive influence of abnormal samples and center misalignment. This design is particularly suitable for small-object localization in reflective water-surface scenes, where slight positional errors may lead to noticeable detection bias. Combined with the preceding train-time noisy-label screening strategy, it provides a more stable optimization basis for accurate bounding-box regression in YOLOv11-IMP.

4. Model Training and Experimental Deployment

4.1. Software Environment and Training Configuration

All training and evaluation experiments were conducted on a workstation platform to ensure stability and reproducibility. The implementation was developed in Python 3.8 with PyTorch 2.4.1 (cu121). Table 1 summarizes the hardware and software configuration used for model training and offline evaluation. In addition to workstation-based experiments, embedded deployment on Jetson Nano was further profiled in terms of latency, throughput, and energy-per-frame when power monitoring was available (Section Energy–Latency Profile on Embedded Hardware).

During training, the Adam optimizer was adopted together with a cosine annealing learning-rate scheduler. The initial learning rate was set to 0.001, the batch size was set to 32, and the total number of training epochs was 300. The input images were resized to a unified resolution before being fed into the network, and the bounding-box regression branch employed the WIoU-v3 loss to improve localization quality for tiny objects under noisy supervision. In order to enhance robustness under complex water-surface backgrounds and alleviate category imbalance in the custom dataset, Mosaic augmentation and category-balanced sampling were further introduced during training.

For fair comparison, the proposed method and the compared baseline detectors were trained under a unified experimental protocol as much as possible, including consistent dataset partitions, image preprocessing, and evaluation criteria. This setting helps ensure that the performance differences mainly arise from architectural design and optimization strategy rather than inconsistencies in training conditions.

4.2. Datasets

To comprehensively evaluate the proposed method, two datasets were employed in this study, including a public benchmark dataset and a custom real-scene dataset. The public dataset was used to provide a standardized evaluation basis for water-surface floating-object localization, while the custom dataset was introduced to assess category diversity, robustness under complex interference, and practical generalization ability. The overall data selection criteria considered scene diversity, target representativeness, annotation validity, and the preservation of realistic water-surface disturbances, such as reflections, ripples, clutter, and partial occlusion.

4.2.1. CAS Water-Surface Floating Object Dataset

In the initial stage of experimentation, we adopted a publicly available water-surface floating-object dataset released by the Chinese Academy of Sciences (CAS), referred to as IWHR_AI_Lable_Floater_V1 [23]. The dataset contains 3000 high-resolution images, each annotated with a single category label, floater. It was selected because it provides a representative public benchmark for floating-object detection and contains a large proportion of small targets under realistic water-surface interference.

Statistical analysis indicates that 94.5% of annotated targets occupy less than 10% of the image area, reflecting typical characteristics of small-object detection. The dataset also covers diverse target poses and scales and includes annotation uncertainty caused by real water-surface disturbances such as reflections and ripples, which makes precise localization non-trivial. These properties make the CAS dataset suitable for evaluating the sensitivity of a detector to tiny floating objects and its localization stability under reflective interference.

Figure 7 visualizes two key distribution patterns of the annotations: (a) shows the spatial density distribution of object centers, and (b) shows the distribution of object width–height ratios. These statistics confirm that the dataset is both representative and challenging for floating-debris detection, thus providing a reliable basis for public-benchmark evaluation. The CAS dataset was divided into training, validation, and test subsets at a ratio of 8:1:1, and all compared methods were evaluated under the same partition protocol.

4.2.2. Custom Water-Surface Debris Dataset

To address the limitations of existing public datasets in terms of category diversity and multi-scenario generalization, we constructed a custom dataset for water-surface debris detection. This dataset consists of 3000 real-world images covering 11 representative classes, including can, leaf, branch, plastic_bottle, plastic_box, and glass_bottle. It was introduced as a complement to the public benchmark so that the proposed method could be further evaluated in more diverse and practically relevant conditions.

Images were collected from multiple water environments, including still water, flowing water, strong reflective surfaces, and partially occluded regions, so as to preserve realistic variations in illumination, texture, clutter, and background dynamics. During dataset construction, samples were retained according to four main principles: representative debris categories frequently observed in real scenes, sufficient diversity of target scale and pose, valid manual annotation quality, and preservation of challenging interference factors such as reflections, ripples, and boundary ambiguity. Instead of deliberately removing difficult samples, these disturbances were retained as much as possible to better reflect practical detection conditions.

As shown in Figure 8, the top-left region indicates a naturally imbalanced category distribution, where classes such as can occur more frequently, while rare classes such as ball and bad_fruit are underrepresented. This long-tail distribution is consistent with the frequency characteristics of real floating debris. The right region provides representative annotation samples for 11 classes across different materials, sizes, and water-surface backgrounds, further demonstrating the diversity and complexity of the collected data.

Compared with common detection benchmarks and baseline evaluation settings such as Faster R-CNN [6], YOLOv7 [27], and RetinaNet [32], the proposed custom dataset emphasizes broader category coverage, stronger water-surface interference, and more realistic scene variability. Therefore, it serves as a complementary benchmark for evaluating robustness and generalization in reflective water-surface environments.

Similar to the CAS dataset, the custom dataset was also divided into training, validation, and test subsets at a ratio of 8:1:1, and all compared models were evaluated under the same partition protocol.

4.3. Experimental Deployment Configuration

To further validate real-world detection performance and deployment feasibility, an embedded inference platform based on the NVIDIA Jetson Nano was implemented and tested in natural river environments. The deployment system integrates the Jetson Nano as the onboard inference unit together with auxiliary components such as a GPS positioning module and representative floating-object samples for controlled field evaluation.

As shown in Figure 9, the system operates without external GPU support, reflecting a realistic resource-constrained deployment setting. The purpose of this deployment experiment is not only to verify detection feasibility in real scenes but also to assess whether the proposed model can maintain acceptable runtime characteristics under practical embedded constraints. Accordingly, in addition to accuracy-related results, latency, throughput, and energy-per-frame (when power monitoring is available) are reported in Section Energy–Latency Profile on Embedded Hardware. This configuration provides a deployment-oriented basis for evaluating the practical applicability of YOLOv11-IMP in real-world water-surface monitoring and cleanup scenarios.

5. Results

To evaluate the proposed YOLOv11-IMP for small-object detection in complex water-surface environments, experiments were conducted on both the public CAS floating-object dataset and the custom multi-category water-surface debris dataset. For a comprehensive comparison, YOLOv11-IMP was evaluated against multiple representative detectors, including mainstream YOLO variants, a two-stage detector, an anchor-free detector family, and a transformer-based detector. This comparison setting was designed to cover different detection paradigms and evaluation perspectives, including detection accuracy, inference efficiency, model complexity, and deployment feasibility on embedded hardware.

5.1. Detection Accuracy Comparison

Figure 10 presents representative qualitative detection results of YOLOv11-IMP and several mainstream detectors on eight typical categories of water-surface debris, namely branch, can, grass, fruit, bottle, milk_box, plastic_bag, and floater (CAS). Under challenging conditions such as specular reflections, aquatic-plant occlusion, blurred boundaries, and low resolution, YOLOv11-IMP generally produces tighter bounding boxes, fewer missed detections, and fewer fragmented predictions. These visual results indicate that the proposed design improves robustness in reflective and cluttered water-surface scenes.

Quantitative results in Table 2 further confirm the above trend. YOLOv11-IMP achieves the highest class-wise precision in 6 out of 13 categories and obtains the best overall AP₇₅ of 81.5%. Compared with Faster R-CNN, YOLOv8, and the original YOLOv11 baseline, the proposed method provides superior detection accuracy while maintaining a lightweight computational footprint of 7.3 GFLOPs. This result suggests that the proposed improvements effectively enhance small-object discriminability without introducing excessive deployment burden.

Figure 11 compares the normalized confusion matrices of YOLOv11 and YOLOv11-IMP on the custom dataset. Compared with the baseline model, YOLOv11-IMP exhibits reduced confusion among visually similar categories. In particular, misclassifications between plastic_box and milk_box, grass and background, and leaf and can are alleviated, resulting in stronger diagonal dominance and more compact off-diagonal distributions. These patterns indicate improved category discriminability for small and weak-boundary objects.

The observed gains can be attributed to the complementary effects of the proposed components. GCSA strengthens global channel–spatial interaction during multi-scale fusion, which improves context awareness and suppresses background-induced false responses. This effect is particularly beneficial in cluttered regions, where reflective patterns or vegetation-like textures may otherwise activate the detector incorrectly. MECS further enhances local detail retention and weak-boundary perception by combining robust channel statistics with multi-scale spatial refinement, which is beneficial for categories with strong local interference and blurred contours. In addition, the WIoU-v3 regression objective improves box alignment quality by reducing center deviation and shape mismatch, thereby improving localization stability for tiny reflective targets, such as can and floater.

Notably, these components exhibit a complementary effect on plastic_bag, a particularly challenging category due to soft contours, frequent occlusion, and partial transparency. In this case, GCSA provides stronger global contextual guidance, MECS improves local response concentration, and WIoU-v3 stabilizes bounding-box regression. Their joint contribution helps reduce false positives in confusing regions and improve class-wise performance on difficult categories.

Overall, YOLOv11-IMP achieves balanced improvement in detection accuracy, category discriminability, and computational efficiency. These results support the effectiveness of the proposed design for robust small-object detection in complex water-surface environments, with further deployment-oriented profiling reported in Section Energy–Latency Profile on Embedded Hardware.

5.2. Module Ablation and Efficiency Analysis

To quantify the contribution of each proposed component, controlled ablation experiments were conducted starting from the original YOLOv11 baseline. The GCSA module, the MECS module, the WIoU-v3 regression objective, and the train-time noisy-label screening strategy were progressively introduced to form a series of enhanced variants. Detection accuracy, model complexity, and inference efficiency were then jointly evaluated. Unless otherwise stated, all ablation results were obtained under the same training protocol and on the same split of the custom water-surface debris dataset.

As shown in Table 3 and Figure 12, the baseline YOLOv11 achieves 86.1% AP₇₅ with 6.9 GFLOPs and 3.3 ms measured latency. After introducing GCSA, the model size increases from 30 M to 32 M parameters with only a slight computational increase, while AP₇₅ rises to 86.9%. This indicates that global channel–spatial interaction improves robustness against background interference during feature fusion.

When MECS is introduced, AP₇₅ further increases to 87.4% with nearly unchanged FLOPs, indicating that improved weak-boundary perception and local-detail retention can benefit reflective water-surface detection. When GCSA and MECS are jointly used, AP₇₅ rises to 88.3% with only modest overhead, demonstrating that global contextual modeling and local detail enhancement are complementary.

After enabling WIoU-v3, AP₇₅ further increases to 89.5% while FLOPs remain at 7.3 G, showing that improved regression guidance can substantially benefit tiny-object localization. Finally, the train-time noisy-label screening strategy yields the best AP₇₅ of 89.9% without changing the inference path. Therefore, model complexity and runtime characteristics remain unchanged compared with the previous variant, while the optimization process becomes more robust to reflection- and ripple-induced label uncertainty.

Overall, GCSA, MECS, and WIoU-v3 provide progressive and consistent accuracy gains with controllable computational overhead, and the train-time noisy-label screening strategy further improves accuracy without increasing inference cost. These results support the effectiveness of the proposed design for practical deployment under resource-constrained conditions.

Energy–Latency Profile on Embedded Hardware

To support deployment on battery-powered electric platforms, the inference efficiency of representative models was additionally profiled on an NVIDIA Jetson Nano. Inference latency and throughput were measured under the same input resolution and batch setting as used in the deployment tests. When power monitoring was available, the average module power during steady-state inference was recorded and the energy consumption per frame was computed as

E_{frame} = P_{avg} \times \frac{t_{lat}}{1000} .

(26)

where

P_{avg}

denotes the average power in watts and

t_{lat}

denotes the measured latency in milliseconds. If direct power measurements were unavailable for some configurations, only latency and throughput were reported.

As shown in Table 4, YOLOv11-IMP maintains a favorable accuracy–efficiency balance for embedded inference. The added attention and regression components introduce only modest overhead in latency and energy per frame, which is important for long-duration missions on electric platforms. This analysis complements the ablation results in Table 3 by providing a deployment-oriented efficiency view beyond FLOPs since practical embedded performance can also be influenced by memory access patterns, kernel efficiency, and device power states.

6. Conclusions

This paper addresses small-object detection in reflective water-surface environments, where weak boundaries, dynamic reflections, ripples, and embedded deployment constraints jointly degrade detection accuracy and runtime stability for onboard perception. To address these challenges, a lightweight detector, YOLOv11-IMP, is developed based on YOLOv11. The proposed framework enhances the baseline model through complementary designs, including an attention-enhanced high-level backbone stage based on SPPF and C2PSA, a Global Channel–Spatial Attention (GCSA) module for cross-scale feature interaction, a Median-Enhanced Channel–Spatial Attention (MECS) module for weak-boundary refinement, a WIoU-v3 regression objective for improved localization, and a train-time noisy-label screening strategy for more robust optimization without changing the inference path.

Experiments on the public CAS floating-object dataset and a custom multi-category water-surface debris dataset demonstrate that YOLOv11-IMP achieves clear advantages in small-object detection under complex water-surface interference. Compared with the YOLOv11 baseline, the proposed method improves AP₇₅ by 3.3% while maintaining lightweight complexity at 7.3 GFLOPs. In addition, the model reduces confusion among visually similar categories and produces tighter localization for small, weak-boundary, and reflection-disturbed targets. The embedded profiling results on NVIDIA Jetson Nano further indicate that the proposed detector maintains favorable runtime characteristics under the tested setting, supporting its practical applicability to onboard perception in water-quality monitoring and electric surface-cleaning scenarios.

Overall, YOLOv11-IMP provides a favorable balance among detection accuracy, computational efficiency, and deployment feasibility, making it a practical solution for energy-constrained onboard perception in battery-powered electric unmanned surface vehicles and electrified surface-cleaning platforms. This balance is particularly important for long-duration autonomous inspection and cleanup missions, where perception reliability and embedded efficiency directly affect mission endurance and operational effectiveness. In future work, semi-supervised learning, domain adaptation, and broader real-scene validation will be explored to reduce annotation cost and further improve robustness under unseen water conditions.

Author Contributions

Conceptualization, L.W. and Y.G.; methodology, L.W. and Y.G.; software, Y.G.; validation, L.W., Y.G., G.C. and C.S.; formal analysis, L.W. and Y.G.; investigation, L.W., Y.G. and G.C.; resources, Y.G.; data curation, L.W. and Y.G.; writing—original draft preparation, Y.G.; writing—review and editing, L.W., G.C. and C.S.; visualization, L.W. and Y.G.; supervision, L.W.; project administration, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The public CAS floating-debris dataset used in this study is available from its corresponding publication [23]. The custom multi-category water-surface debris dataset generated during the current study is not publicly available due to ongoing research and usage restrictions.

Acknowledgments

The authors would like to thank the members of the field-test team for their assistance with embedded deployment and on-site data collection. During the preparation of this manuscript, the authors used AI-assisted tools only for LaTeX formatting and typesetting support. The authors reviewed and edited all content and take full responsibility for the integrity and accuracy of this publication.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

USV	Unmanned Surface Vehicle
e-USV	Electric Unmanned Surface Vehicle
GCSA	Global Channel–Spatial Attention
MECS	Median-Enhanced Channel–Spatial Attention
WIoU	Weighted Intersection over Union
SPPF	Spatial Pyramid Pooling Fast
FPS	Frames Per Second
AP	Average Precision
mAP	Mean Average Precision

References

Sotelo-Torres, F.; Alvarez, L.V.; Roberts, R.C. An Unmanned Surface Vehicle (USV): Development of an Autonomous Boat with a Sensor Integration System for Bathymetric Surveys. Sensors 2023, 23, 4420. [Google Scholar] [CrossRef] [PubMed]
Mela, J.L.; García Sánchez, C. Yolo-based power-efficient object detection on edge devices for USVs. J. Real-Time Image Process. 2025, 22, 108. [Google Scholar] [CrossRef]
Rahul, J.; K, S.B.; Sharmila, T.; Natarajan, K.; Reddy, V.; Tellapati, A.D. Detection of Floating Waste on Water Bodies. In Proceedings of the 2024 Global Conference on Communications and Information Technologies (GCCIT), Bangalore, India, 25–26 October 2024. [Google Scholar]
Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid Task Cascade for Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 11–15 June 2019; pp. 4969–4978. [Google Scholar]
Gupta, H.; Kotlyar, O.; Andreasson, H.; Lilienthal, A.J. Robust Object Detection in Challenging Weather Conditions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 7508–7517. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A review on YOLOv8 and its advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 18–20 November 2024; pp. 529–545. [Google Scholar]
Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An evolved version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar] [CrossRef]
Liu, S.; Zha, J.; Sun, J.; Li, Z.; Wang, G. EdgeYOLO: An edge-real-time object detector. arXiv 2023, arXiv:2302.07483. [Google Scholar]
Wang, H.; Gao, P. Survey of small object detection methods based on deep learning. In Proceedings of the 9th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 21–23 November 2024; Volume 9, pp. 221–224. [Google Scholar]
Chen, G.; Wang, H.; Chen, K.; Li, Z.; Song, Z.; Liu, Y.; Chen, W.; Knoll, A. A survey of the four pillars for small object detection: Multiscale representation, contextual information, super-resolution, and region proposal. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 936–953. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Adarsh, P.; Rathi, P.; Kumar, M. YOLO v3-tiny: Object detection and recognition using one-stage improved model. In Proceedings of the 6th International Conference on Advanced Computing and Communication Systems (ICACCS); IEEE: Piscataway, NJ, USA, 2020; pp. 687–694. [Google Scholar]
Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small object detection algorithm based on improved YOLOv8 for remote sensing. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2024, 17, 1734–1747. [Google Scholar] [CrossRef]
Ren, J.; Guo, Y.; Zhang, D.; Liu, Q.; Zhang, Y. Distributed and efficient object detection in edge computing: Challenges and solutions. IEEE Netw. 2018, 32, 137–143. [Google Scholar] [CrossRef]
Tang, Z.; Xiong, B.; Fang, Z.; Zhang, J. LightDet: A lightweight and efficient object detection network for embedded systems. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6931–6944. [Google Scholar]
Qiao, G.; Yang, M.; Wang, H. An annotated dataset and benchmark for detecting floating debris in inland waters. Sci. Data 2025, 12, 385. [Google Scholar] [CrossRef] [PubMed]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Shao, Y. Local-global attention: An adaptive mechanism for multi-scale feature integration. arXiv 2024, arXiv:2411.09604. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2023, arXiv:2207.02696. [Google Scholar]
Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge computing: Vision and challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Song, H.; Kim, M.; Park, D.; Shin, Y.; Lee, J. Learning from noisy labels with deep neural networks: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 8135–8153. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Song, B.; Wang, H.; Han, B.; Liu, T.; Liu, L.; Sugiyama, M. BadLabel: A robust perspective on evaluating and enhancing label-noise learning. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4398–4409. [Google Scholar] [CrossRef] [PubMed]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]

Figure 1. Overview of the YOLOv11-IMP architecture, including backbone, neck, and enhanced modules. Different colors are used to distinguish different functional components, and the red boxes mark the target objects in the input scene.

Figure 2. Structure of the GCSA module. Different colors are used to distinguish different functional components and processing branches.

Figure 3. The architecture of the proposed MECS module. Different colors are used to distinguish different functional components and processing branches.

Figure 4. Real-scene noisy labels.

Figure 5. Noisy-label train-time screening via top-

p (t)

small-loss samples (

p (t) : 0.8 \to 0.7

). Different colors are used to distinguish different stages or operations in the screening process.

Figure 5. Noisy-label train-time screening via top-

p (t)

small-loss samples (

p (t) : 0.8 \to 0.7

). Different colors are used to distinguish different stages or operations in the screening process.

Figure 6. Illustration of the WIoU-v3 loss structure.

Figure 7. Statistical visualization of annotated targets in the CAS dataset.

Figure 8. Statistical visualization and sample overview of the custom water-surface debris dataset, including category distribution and bounding-box statistics.

Figure 9. On-site embedded deployment hardware used for testing YOLOv11-IMP in natural water environments. Different colors and red frames indicate different debris categories in the deployment scene.

Figure 10. Comparative detection results of various models on representative water-surface debris categories.

Figure 11. Normalized confusion matrices of YOLOv11 and YOLOv11-IMP on the custom water-surface debris dataset.

Figure 12. Ablation study of YOLOv11 variants in terms of AP₇₅, FLOPs, and measured latency.

Table 1. Experimental Platform Configuration.

Experimental Platform	Specific Configuration
CPU	Intel(R) Core(TM) i5-12400F, Intel Corporation, Santa Clara, CA, USA
GPU	NVIDIA GeForce RTX 4060Ti, NVIDIA Corporation, Santa Clara, CA, USA
Operating System	Windows 11 64-bit, Microsoft Corporation, Redmond, WA, USA
Memory	16 GB
Training Framework	PyTorch 2.4.1 + cu121

Table 2. Comparison results of different detection models on water-surface object categories. Bold numbers indicate top-1 performance in each column.

Method	Bottle	Grass	Branch	Can	Leaf	Bad Fruit	Ball	Plastic Bag	Glass Bottle	Floater (CAS)	AP₇₅
Faster R-CNN	96.2	28.0	84.8	46.1	47.3	76.0	89.5	93.5	91.2	91.2	70.1
RFLA	88.5	30.6	87.3	48.4	50.0	78.5	81.0	93.6	93.6	91.6	61.8
YOLOv8	95.0	33.5	90.2	41.5	56.0	92.8	85.2	93.0	93.0	91.0	75.0
Deformable DETR	94.8	32.1	98.7	50.4	55.5	92.8	89.1	89.5	91.2	93.6	67.3
PP-YOLOE Tiny	94.0	16.5	82.4	42.0	41.2	83.0	83.0	90.6	88.4	88.4	65.1
CenterNet	88.4	33.0	81.1	50.5	48.1	89.5	90.2	90.2	85.0	85.0	79.4
FCOS	92.7	15.0	83.6	43.3	42.7	85.1	94.6	90.5	87.0	87.0	74.0
RepPoints	95.5	26.2	84.4	44.9	45.0	86.5	87.0	91.5	89.5	89.5	76.3
YOLOv11	92.6	32.0	89.8	50.0	62.5	80.0	93.5	92.7	79.5	79.5	61.2
YOLOv11-IMP (Ours)	96.4	36.5	95.2	60.5	53.0	91.2	99.5	95.2	92.0	92.0	81.5

Table 3. Ablation study results of different module combinations (custom dataset, same protocol).

Method	Params (M)	FPS	Lat. (ms)	AP₇₅	FLOPs (G)
YOLOv11 (baseline)	30	303	3.3	86.1	6.9
+ GCSA	32	290	3.7	86.9	7.0
+ MECS	32	280	3.6	87.4	7.0
+ GCSA + MECS	34	289	3.9	88.3	7.2
+ GCSA + MECS + WIoU-v3	34	270	4.1	89.5	7.3
+ (train-time) noisy-label screening	34	270	4.1	89.9	7.3

Table 4. Energy–latency profile on Jetson Nano.

Model	Params (M)	GFLOPs (G)	FPS	Latency (ms)	Power (W)	Energy (J)
YOLOv11 (baseline)	30.0	6.9	303.0	3.30	5.8	0.019
YOLOv11 + GCSA	32.0	7.0	290.0	3.7	6.1	0.023
YOLOv11 + MECS	32.0	7.0	280.0	3.6	6.0	0.023
YOLOv11 + GCSA + MECS	34.0	7.2	289.0	3.9	6.4	0.025
YOLOv11 + GCSA + MECS + WIoU-v3	34.0	7.3	270.0	4.1	6.7	0.028
YOLOv11-IMP (Ours) (+ screening)	34.0	7.3	270.0	4.1	6.7	0.028

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Wang, L.; Gao, Y.; Cai, G.; Shen, C. Energy-Aware Floating-Debris Detection for Battery-Powered Electric Unmanned Surface Vehicles: A Lightweight YOLO-Based Method with Embedded Profiling. World Electr. Veh. J. 2026, 17, 156. https://doi.org/10.3390/wevj17030156

AMA Style

Wang L, Gao Y, Cai G, Shen C. Energy-Aware Floating-Debris Detection for Battery-Powered Electric Unmanned Surface Vehicles: A Lightweight YOLO-Based Method with Embedded Profiling. World Electric Vehicle Journal. 2026; 17(3):156. https://doi.org/10.3390/wevj17030156

Chicago/Turabian Style

Wang, Li, Yuan Gao, Guosheng Cai, and Caoxin Shen. 2026. "Energy-Aware Floating-Debris Detection for Battery-Powered Electric Unmanned Surface Vehicles: A Lightweight YOLO-Based Method with Embedded Profiling" World Electric Vehicle Journal 17, no. 3: 156. https://doi.org/10.3390/wevj17030156

APA Style

Wang, L., Gao, Y., Cai, G., & Shen, C. (2026). Energy-Aware Floating-Debris Detection for Battery-Powered Electric Unmanned Surface Vehicles: A Lightweight YOLO-Based Method with Embedded Profiling. World Electric Vehicle Journal, 17(3), 156. https://doi.org/10.3390/wevj17030156

Article Menu

Energy-Aware Floating-Debris Detection for Battery-Powered Electric Unmanned Surface Vehicles: A Lightweight YOLO-Based Method with Embedded Profiling

Abstract

1. Introduction

2. Related Work

2.1. Water-Surface Debris Detection

2.2. Lightweight Object Detection for Embedded Deployment

2.3. Small-Object Enhancement and Robust Optimization in Complex Scenes

2.4. Research Gap and Positioning of This Work

3. Materials and Methods

3.1. Multi-Scale Feature Extraction Backbone

3.2. Global Channel–Spatial Attention Module

3.3. Median-Enhanced Channel–Spatial Attention Module (MECS)

3.3.1. Median-Enhanced Channel Attention Mechanism

3.3.2. Multi-Scale Spatial Attention Mechanism

3.4. Noisy-Label Audit and Train-Time Strategy

3.5. WIoU-V3 Loss Function Design

4. Model Training and Experimental Deployment

4.1. Software Environment and Training Configuration

4.2. Datasets

4.2.1. CAS Water-Surface Floating Object Dataset

4.2.2. Custom Water-Surface Debris Dataset

4.3. Experimental Deployment Configuration

5. Results

5.1. Detection Accuracy Comparison

5.2. Module Ablation and Efficiency Analysis

Energy–Latency Profile on Embedded Hardware

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI