A Novel Method for Water Surface Debris Detection Based on YOLOV8 with Polarization Interference Suppression

Chen, Yi; Lin, Honghui; Xiao, Lin; Zhang, Maolin; Zhang, Pingjun

doi:10.3390/photonics12060620

Open AccessArticle

A Novel Method for Water Surface Debris Detection Based on YOLOV8 with Polarization Interference Suppression

by

Yi Chen

^1,2

,

Honghui Lin

³,

Lin Xiao

⁴,

Maolin Zhang

^3,5 and

Pingjun Zhang

^3,*

¹

School of Smart Marine Science and Technology, Fujian University of Technology, Fuzhou 350118, China

²

Fujian Provincial Key Laboratory of Marine Smart Equipment, Fuzhou 350118, China

³

School of Electronic, Electrical Engineering and Physics, Fujian University of Technology, Fuzhou 350118, China

⁴

School of Computer Science and Mathematics, Fujian University of Technology, Fuzhou 350118, China

⁵

Faculty of Geosciences and Engineering, Southwest Jiaotong University, Chengdu 611756, China

^*

Author to whom correspondence should be addressed.

Photonics 2025, 12(6), 620; https://doi.org/10.3390/photonics12060620

Submission received: 24 April 2025 / Revised: 9 June 2025 / Accepted: 12 June 2025 / Published: 18 June 2025

(This article belongs to the Special Issue Advancements in Optical Measurement Techniques and Applications)

Download

Browse Figures

Versions Notes

Abstract

Aquatic floating debris detection is a key technological foundation for ecological monitoring and integrated water environment management. It holds substantial scientific and practical value in applications such as pollution source tracing, floating debris control, and maritime navigation safety. However, this field faces ongoing challenges due to water surface polarization. Reflections of polarized light produce intense glare, resulting in localized overexposure, detail loss, and geometric distortion in captured images. These optical artifacts severely impair the performance of conventional detection algorithms, increasing both false positives and missed detections. To overcome these imaging challenges in complex aquatic environments, we propose a novel YOLOv8-based detection framework with integrated polarized light suppression mechanisms. The framework consists of four key components: a fisheye distortion correction module, a polarization feature processing layer, a customized residual network with Squeeze-and-Excitation (SE) attention, and a cascaded pipeline for super-resolution reconstruction and deblurring. Additionally, we developed the PSF-IMG dataset (Polarized Surface Floats), which includes common floating debris types such as plastic bottles, bags, and foam boards. Extensive experiments demonstrate the network’s robustness in suppressing polarization artifacts and enhancing feature stability under dynamic optical conditions.

Keywords:

polarization multispectral imaging; enhanced YOLOv8 architecture; adaptive optical correction; aquatic debris monitoring; unmanned surface vessel

1. Introduction

Marine plastic pollution has become a major global sustainability challenge, with millions of metric tons of plastic debris entering marine ecosystems each year and endangering the survival of many marine species [1]. Conventional manual inspection methods fall short of meeting large-scale aquatic monitoring needs due to limitations in range and efficiency. Against this backdrop, there is a growing need to replace traditional human-operated systems with intelligent unmanned surface vehicles (USVs). Advances in deep learning have positioned vision-based intelligent USVs for automated debris removal as a leading research direction. However, the performance, precision, and efficiency of these systems largely depend on the effectiveness of visual floating debris detection methods. Integrating USV technology with advanced deep learning-based object detection enables the development of autonomous systems that can identify and remove floating debris, thereby significantly improving aquatic cleanup efficiency [2]. The effectiveness of USV-based debris removal relies on the accuracy and real-time performance of their detection subsystems. Modern advances in computer vision and deep learning now allow comprehensive use of visual information. As a result, vision-based debris detection is currently among the most cost-effective and operationally efficient solutions [3].

Polarized light reflections on water surfaces degrade the performance of existing target recognition algorithms, leading to reduced accuracy and increased false positives and missed detections. This occurs because polarized reflections cause water surfaces and floating debris to share similar optical characteristics, making target discrimination more difficult. In addition, solar glare on water surfaces introduces optical flare and detail loss in vessel targets captured by imaging systems [4]. To address these issues, we propose a solar glare suppression method that combines polarized light filtering with adaptive polynomial fitting. By leveraging the polarization characteristics of sunlight reflected from water surfaces, the proposed method effectively reduces glare during image acquisition, thereby improving target recognition accuracy [5].

To address the challenges of polarized light reflections in USV-based aquatic debris removal, existing approaches mainly include adaptive or multi-exposure techniques, multi-view image acquisition and fusion, and the use of optical filters [6,7,8]. Although these methods partially mitigate the performance degradation caused by reflections, each has notable limitations. Adaptive or multi-exposure techniques require precise hardware control and robust image fusion algorithms [9]. Multi-view systems increase equipment costs and computational complexity [10]. Optical filters suppress reflections directly without requiring additional image capture but demand strict parameter tuning [11]. Moreover, polarized environments under specific conditions—such as high refraction angles from oblique sunlight, multiple scattering in turbid waters, and mixed vegetation–rock polarization scenarios—remain insufficiently studied in a systematic manner.

To address these challenges, we propose an innovative YOLOv8-Polarization Disturbance Learning fusion network deployed on a self-developed trimaran USV for debris collection across diverse aquatic environments. The architecture integrates three core innovations: (1) a Polarization-Aware Feature Enhancement Module (PAFEM) that enhances specular reflection suppression via multispectral imaging fusion; (2) a Dynamic Feature Calibration Network (DFCN) leveraging Squeeze-and-Excitation (SE) attention to maintain feature stability under wave-induced disturbances; and (3) a Lightweight Deblurring Pipeline (LDP) that employs GPU-accelerated deblurring to optimize computational efficiency. Experimental results for the PSF-IMG dataset (Polarized Surface Floats)—a comprehensive collection of polarized aquatic environments—showed improved mAP@50 compared to baseline YOLOv8, while maintaining a balance between precision and processing speed. As shown in Figure 1, the dataset was built under controlled strong polarization conditions, following strict protocols for image acquisition, curation, annotation, and preprocessing. This provides a solid foundation for research on target discrimination under polarization phenomenology. Future experiments will benchmark system performance to verify the framework’s adaptability to complex optical interference conditions.

2. Related Work

The development of aquatic target detection technology has primarily followed two paradigms: physics-based methods and deep learning approaches. Physics-based optical imaging techniques enable precise manipulation through polarization modulation, reconstruction, and micro/nano-scale structural design. Polarization imaging utilizes Stokes vector analysis of polarization state differences (e.g., DOLP, AOP) to enhance target detection in turbid waters by suppressing polarization-differential scattering [12,13]. Polarization imaging combines compressive sensing with temporal encoding, enabling single-shot multi-polarization acquisition via focal-plane polarization imaging systems. However, conventional polarization imaging approaches that require sequential acquisitions (e.g., using mechanically rotating polarizer wheels or liquid crystal variable retarders) inherently depend on temporal modulation or complex optical configurations. These sequential methods introduce latency due to the multiple exposures needed and can suffer from reduced optical efficiency and misregistration of artifacts in dynamic scenes [14]. In contrast, while DoFP systems eliminate the need for moving parts and capture data rapidly in a single shot, they achieve this at the cost of spatial resolution, as each pixel senses only one polarization state, requiring interpolation to reconstruct the full Stokes vector at each point [15,16]. Furthermore, physics-based methods face limitations in environmental adaptability: time-sequential polarization systems are sensitive to transient angle fluctuations induced by dynamic waves, and Fresnel reflection models break down beyond Brewster angles, resulting in increased false detection rates [12,14].

Deep learning methods overcome conventional physical constraints through data-driven strategies and offer notable advantages in small-target detection and image deblurring tasks. Single-stage detection frameworks based on YOLO variants enhance mAP@0.5 for small targets in turbid environments by employing deformable convolutions and dynamic receptive fields, while retaining real-time performance [17,18]. Light-field super-resolution networks (e.g., EPIT, LFT) integrate epipolar-constrained attention mechanisms to focus on hourglass-shaped epipolar plane image (EPI) regions, effectively suppressing irrelevant features and outperforming conventional methods in PSNR metrics [19]. Physics-enhanced models such as PhysenNet combine forward diffraction propagation with neural networks to enable label-free phase retrieval, albeit with increased computational complexity due to iterative optimization [20]. However, deep learning approaches still face several challenges: feature loss due to downsampling in small-target detection (resulting in high information loss in deeper layers), weak cross-domain generalization [21,22], and the need for large annotated datasets in multi-modal fusion (e.g., infrared-visible). In addition, semi-supervised methods often suffer from persistent false detection in low-contrast scenarios.

Despite the substantial progress that has been made in this research area, four key bottlenecks remain unresolved: (1) Most current models are designed for general scenarios and fail to adequately address edge distortion effects induced by the low viewing angles and wide field-of-view (FOV) characteristics of unmanned surface vehicles (USVs). For instance, YOLOv5 shows increased false detection rates in peripheral regions of images [18]. (2) Reflection suppression methods often assume ideal optical conditions. Traditional high-dynamic-range (HDR) techniques, however, experience significant performance degradation in rainy or foggy environments due to multiple scattering effects [23]. (3) The lack of fine-grained classification leads to frequent confusion between aquatic vegetation and buoys in Faster R-CNN frameworks, highlighting the need for refined feature decoupling mechanisms [24]. (4) A persistent trade-off between real-time processing and detection accuracy remains a major challenge. Although SSD models offer high frame rates, they suffer from considerable resolution loss when parsing blurred images [25]. To systematically overcome these challenges, we propose an integrated solution comprising an adaptive polarization filtration system, a physics-guided domain adaptation strategy, a hierarchical feature decoupling network, and an optimized lightweight dual-stream reconstruction model.

3. Methodology

Figure 2 illustrates the workflow of the water surface debris detection system built upon an enhanced YOLOv8 architecture. The system consists of multiple modules engineered for efficient and accurate debris detection in complex aquatic environments. In the data preparation phase, fisheye lens distortion correction and image enhancement are implemented to eliminate geometric distortions caused by a wide field of view and to improve image quality. The image preprocessing module applies polarized reflection suppression and deblurring techniques to enhance usable image information while reducing environmental interference. Subsequently, the feature extraction module uses a backbone network integrated with the Squeeze-and-Excitation (SE) attention mechanism for hierarchical feature learning, supported by image normalization and enhancement operations to improve detection accuracy. The detection output is refined using an object detection head followed by Non-Maximum Suppression (NMS) to ensure accurate target localization and classification. To optimize the model, Exponential Moving Average (EMA) smoothing is employed to improve training stability. Finally, the pipeline integrates detection results with evaluation metrics to produce comprehensive performance reports. Through the systematic integration of these technical components, the detection system achieves robust performance and real-time processing in dynamic aquatic environments.

3.1. Data Processing

In the water surface debris detection task, image quality is influenced by multiple factors such as surface reflection, scattering, and field of view (FOV). To address these issues, this study proposes a comprehensive image processing method.

Our fisheye distortion correction module is implemented in three core modules: dynamic parameter estimation, azimuth perception correction, and area adaptive mapping, and its mathematical formula is defined below.

The distortion feature dynamic estimation module is established through a content-aware analysis framework, whereby a distortion characterization metric is formulated to quantify geometric aberrations.

The distortion parameters

(k_{1}, k_{2}, r_{p})

are dynamically estimated by the Content-Aware Analysis Framework, and their values are related to the edge intensity distribution and ellipse aspect ratio.

The unified distortion correction model defines the corrected radial distance

r_{corr}

through a piecewise function that implements region-adaptive compensation based on a normalized radius

\bar{r} = r / r_{p}

.

r_{corr} = \{\begin{matrix} r & \bar{r} \leq 0.5 \\ (1 - β) r + β r [1 + Γ (r, θ) (k_{1} r^{2} + k_{2} r^{4})] & 0.5 < \bar{r} \leq 0.8 \\ ω r + (1 - ω) r [1 + Γ (r, θ) (k_{1} r^{2} + k_{2} r^{4})] & \bar{r} > 0.8 \end{matrix}

(1)

For the central preservation zone (

\bar{r} \leq 0.5

), the original radial coordinate is retained (

r_{corr} = r

) to prevent overcorrection artifacts near the optical axis. In the transition optimization zone (

0.5 < \bar{r} \leq 0.8

), a blended correction

r_{c o r r} = (1 - β) r + β r [1 + Γ (r, θ) (k_{1} r^{2} + k_{2} r^{4})]

is applied, where the cosine-derived coefficient

β = \frac{1}{2} [1 - cos (π \cdot \frac{\tilde{r} - 0.5}{0.3})]

ensures smooth interpolation between uncorrected and fully corrected coordinates. Within the edge correction zone (

\bar{r} > 0.8

), linearly weighted compensation

r_{c o r r} = ω r + (1 - ω) r [1 + Γ (r, θ) (k_{1} r^{2} + k_{2} r^{4})]

prioritizes distortion removal at peripheries using the radial decay weight

ω = 1 - \frac{\bar{r} - 0.8}{0.2}

. The directional modulator

Γ (r, θ)

preserves horizontal feature integrity through angular weighting while enforcing radial transition continuity via its sigmoid component.

The corrected polar coordinates (

r_{c o r r}

and

θ

) are transformed to Cartesian coordinates:

\{\begin{matrix} x^{'} = W / 2 + r_{c o r r} cos θ \cdot W / 2 \\ y^{'} = H / 2 + r_{c o r r} sin θ \cdot H / 2 \end{matrix}

(2)

where (W,H) denotes the image resolution. Subpixel-accurate coordinate mapping is achieved through bicubic spline interpolation.

I_{c o r r e c t} (x, y) = \sum_{i = - 1}^{2} \sum_{j = - 1}^{2} I (x^{'} + i, y^{'} + j) \cdot B (i - Δ x) \cdot B (j - Δ y)

(3)

Here,

B (\cdot)

denotes the cubic B-spline basis function, where

Δ x = x^{'} - ⌊ x^{'} ⌋ and Δ y = y^{'} - ⌊ y^{'} ⌋

represents subpixel offset components. Boundary handling employs a mirror padding strategy:

I (x, y) = I (min (m a x (x, 0), W - 1), m i n (m a x (y, 0), H - 1))

(4)

This method adapts to the distortion characteristics of diverse lenses through a dynamic parameter estimation mechanism, while the directional weighting function effectively preserves the structural integrity of vertical linear features. The regionalized correction strategy achieves balanced barrel distortion rectification while circumventing edge-stretching artifacts, thereby maintaining geometric consistency across the entire image plane.

Next, to mitigate the interference caused by water surface reflections, an image processing method based on polarization characteristics was designed. This method calculates the pixel polarization degree using a weighted color difference method:

P = \sqrt{{(d_{r} \cdot w_{r})}^{2} + {(d_{g} \cdot w_{g})}^{2} + {(d_{b} \cdot w_{b})}^{2}}

(5)

In this formulation,

d_{r}

,

d_{g}

, and

d_{b}

denote the deviations of RGB channel intensities from their mean value. Extensive experimental studies have demonstrated that the perceptual weighting coefficients

w_{r} = 0.299

,

w_{g} = 0.587

, and

w_{b} = 0.114

optimally align with human visual sensitivity. Concurrently, the computed polarization degree is leveraged to generate reflection region masks through an adaptive thresholding method, which is synergistically combined with local contrast enhancement to achieve effective reflection suppression.

3.2. Polarization Feature Enhancement Network

3.2.1. Feature Pyramid Network with Squeeze-and-Excitation

In this paper, an FPN-SE (Feature Pyramid Network with Squeeze-and-Excitation) architecture, based on Feature Pyramid Networks (FPN) and Squeeze-and-Excitation (SE) modules, was designed, as shown in Figure 3. This module combines the traditional FPN structure with the SE module to enhance multi-scale feature fusion and expression capabilities. The traditional FPN structure efficiently captures multi-scale contextual information and performs feature fusion, while the introduction of the SE module further enhances the adaptive channel weights, improving the representation of important features while suppressing irrelevant ones.

In the FPN-SE module, the SE module is placed at critical points in the FPN feature fusion process. It performs global average pooling (GAP) to compress the information of each channel, thus obtaining the global information for each channel. This information is then passed through a fully connected layer and activation functions (ReLU and Sigmoid) to generate adaptive weights for each channel, which are subsequently used to recalibrate the channels, enhancing key features and suppressing irrelevant information. The feature fusion in the FPN-SE module is conducted in a top–down manner, with low-level features undergoing 1 × 1 convolution followed by batch normalization (BN) and an activation function (SiLU), generating enhanced feature maps. These are then fused with higher-level features and further refined through 3 × 3 convolution to strengthen feature interactions. This process ensures the effective fusion of fine-grained information from lower layers with semantic information from higher layers, ultimately enabling precise multi-scale object detection. Through this design, the FPN-SE module not only enhances the network’s robustness but also improves its capability in handling complex optical environments.

The module processes three input feature maps from the backbone network:

C_{3}

(low-level features with a spatial resolution of 80 × 80 and 256 channels),

C_{4}

(mid-level features with a resolution of 40 × 40 and 512 channels), and

C_{5}

(high-level semantic features with a resolution of 20 × 20 and 1024 channels). Each input feature is aligned in dimensions through lateral convolutions:

C_{l a t}^{i} = SiLU (BN ({Conv}_{1 \times 1} (C^{i}))) (i = 3, 4, 5)

(6)

where

{Conv}_{1 \times 1}

refers to a 1 × 1 convolution, BN represents batch normalization, and SiLU denotes the Sigmoid Linear Unit activation function.

A feature pyramid is constructed through iterative upsampling and fusion, as outlined below:

First, the

P_{5}

feature is generated via a 3 × 3 convolution:

P_{5} = {Conv}_{3 \times 3} (BN (C_{l a t}^{5}))

(7)

Next, the

P_{4}

feature is generated via bilinear upsampling and fusion:

P_{4} = {Conv}_{3 \times 3} (BN (C_{l a t}^{4} + {Upsample}_{\times 2} (P_{5})))

(8)

Finally, the

P_{3}

feature is generated in a similar manner:

P_{3} = {Conv}_{3 \times 3} (BN (C_{l a t}^{3} + {Upsample}_{\times 2} (P_{4})))

(9)

where the bilinear upsampling operation

{Upsample}_{\times 2}

increases the spatial resolution by a factor of two through linear interpolation.

\tilde{F} = σ (W_{2} \cdot δ (W_{1} \cdot (\frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F (i, j)))) ⊙ F

(10)

The squeeze step compresses spatial information into channel descriptors via global average pooling (GAP), generating a C dimensional vector; the excitation step employs two fully connected layers (

W_{1} \in R^{C / r \times C}

and

W_{2} \in R^{C \times C / r}

with a reduction ratio of

r = 16

) and nonlinear activations (ReLU

δ

followed by Sigmoid

σ

) to adaptively learn channel-wise weights; finally, the recalibration step scales the original feature map F through the Hadamard product (⊙), dynamically enhancing discriminative channels while suppressing redundant responses.

The features are further optimized using residual blocks:

F_{out} = {Conv}_{3 \times 3} (BN (SiLU ({Conv}_{3 \times 3} (F_{in})))) + F_{in}

(11)

This structure retains the original feature information while enhancing gradient flow. The final output provides complementary semantic information:

P_{3}

(256 × 80 × 80) for high-resolution features suited to small object detection,

P_{4}

(256 × 40 × 40) for balanced features suitable for medium objects, and

P_{5}

(256 × 20 × 20) for semantically rich features suited for large objects. This architecture improves the mAP@50 on the water surface dataset compared to the standard FPN, demonstrating its excellent ability to handle reflection interference and scale variation.

3.2.2. Polarization-Aware Feature Enhancement Module (PAFEM)

The proposed Polarization-Aware Feature Enhancement Module (PAFEM) is specifically designed to enhance the polarization features of water surface targets. This module adopts a three-level feature extraction architecture, including a feature extraction layer, polarization feature enhancement layer, and feature fusion layer. The feature extraction layer consists of a 3 × 3 convolution (input 3→output 16 channels), batch normalization, and SiLU activation function, used to capture basic image features. The polarization feature enhancement layer further extracts polarization-related features through a 3 × 3 convolution (16→32 channels). The feature fusion layer uses 3 × 3 convolution (32→3 channels) and Sigmoid activation function to map the enhanced features back to the original input dimensions. PAFEM also integrates a channel attention mechanism (SE layer), which adaptively learns channel weights to highlight key features and suppress irrelevant ones. Finally, residual connections ensure that original information is preserved while enhancing the model’s representation capability for polarization features. PAFEM can be represented as follows:

F_{o u t} = S E (C o n v_{3 \times 3} (S i L U (B N (C o n v_{3 \times 3} (S i L U (B N (C o n v_{3 \times 3} (F_{i n})))))))) + F_{i n}

(12)

where

F_{i n}

and

F_{o u t}

represent the input and output features,

C o n v_{3 \times 3}

represents the

3 \times 3

convolution operation,

B N

represents batch normalization,

S i L U

represents the SiLU activation function, and

S E

represents the channel attention operation.

3.2.3. Deblurring Feature Compensation Network (DFCN)

The Deblurring Feature Compensation Network (DFCN) aims to address the blurring issues in water surface images caused by wave fluctuations and lighting variations. DFCN adopts a lightweight asymmetric encoder–decoder structure. The encoder part consists of a 3 × 3 convolution (3→32 channels), a 3 × 3 convolution with stride 2 (32→64 channels), and a residual block, used to extract feature representations of blurred images. The decoder part uses a transposed convolution with stride 2 (64→32 channels) for upsampling, followed by a 3 × 3 convolution (32→3 channels) and Sigmoid activation function to reconstruct clear images. The introduction of residual blocks enhances feature extraction capability while alleviating the gradient-vanishing problem. DFCN can be represented as follows:

F_{o u t} = σ (C o n v_{3 \times 3} (S i L U (C o n v T r a n s p o s e_{3 \times 3} (R e s B l o c k (C o n v_{3 \times 3} (F_{i n})))))))

(13)

where

F_{i n}

and

F_{o u t}

represent the input and output features,

C o n v_{3 \times 3}

represents the 3 × 3 convolution operation,

C o n v T r a n s p o s e_{3 \times 3}

represents the 3 × 3 transposed convolution,

R e s B l o c k

represents the residual block,

S i L U

represents the SiLU activation function, and

σ

represents the Sigmoid activation function.

3.2.4. Light Specular Reflection Processing Module (LSRP)

The Light Specular Reflection Processing (LSRP) module is specifically designed to suppress strong light reflections on water surfaces. This module first detects areas with brightness greater than 0.7 through a highlight suppression unit and creates a highlight mask, then calculates polarization degree information from the differences between RGB channels and average intensity. LSRP implements adaptive local threshold processing using 25 × 25 window average pooling and calculates texture features based on horizontal and vertical gradients:

T = | G_{x + 1, y} - G_{x, y} | + | G_{x, y + 1} - G_{x, y} |

(14)

where G represents the image grayscale value and T represents texture features. Combining polarization degree and texture features, LSRP generates a reflection mask:

M = (P > α \cdot P_{l o c a l}) \land (T < β)

(15)

where P represents polarization degree,

P_{l o c a l}

represents local average polarization degree, and

α

and

β

are threshold parameters. Finally, reflection suppression while preserving image details is achieved through adaptive blending of the original image and the processed image:

F_{o u t} = F_{i n} \cdot (1 - γ \cdot M) + I \cdot γ \cdot M

(16)

where

F_{i n}

and

F_{o u t}

represent the input and output features, I represents image intensity,

γ

represents the detail preservation coefficient, and M represents the reflection mask.

3.2.5. Module Integration and Collaborative Working

The proposed water surface target detection framework organically integrates the LSRP, PAFEM, and DFCN modules with the YOLOv8 backbone network to form an end-to-end detection system. As shown in Figure 4, the input image first passes through the LSRP module to suppress water surface light reflection interference, improving the contrast between targets and background; then through the PAFEM module to enhance polarization features, highlighting feature representations of targets on the water surface; next, it passes through the DFCN module to address image blurring issues, improving image clarity; finally, the processed image enters the YOLOv8 backbone network for feature extraction and target detection. The sequential processing of the three modules forms a complementary effect, jointly addressing key challenges in water surface target detection.

3.3. Learning Rate

The proposed aquatic debris detection system builds upon the YOLOv8 framework, which is optimized for wide-FOV imaging and water surface reflections. The multi-stage pipeline comprises three cascaded modules: (1) an image preprocessing unit with fisheye distortion correction and polarization-based reflection suppression, (2) a feature extraction backbone enhanced with super-resolution and deblurring networks, and (3) an object detection head with adaptive localization. During preprocessing, a fisheye distortion model rectifies wide-angle geometric aberrations while polarization imaging mitigates specular interference through Stokes parameter analysis.

The training protocol integrates OneCycleLR dynamic learning rate scheduling governed by

η_{t} = η_{m a x} \cdot (1 - \frac{| t - T_{peak} |}{T_{peak}}) (0 \leq t \leq 2 T_{peak})

(17)

where

η_{t}

denotes the learning rate at step t,

η_{\max} = 1 \times 10^{- 4}

is the maximum learning rate, and

T_{peak} = 60

epochs marks the peak scheduling moment. This strategy ensures stable convergence by progressively increasing the learning rate until

T_{peak}

and then decreasing it, balancing optimization efficiency and training stability.

To further stabilize training, an Exponential Moving Average (EMA) mechanism is integrated. The parameter update rule is formulated as follows:

θ_{E M A}^{(t)} = β θ_{E M A}^{(t - 1)} + (1 - β) θ_{m o d e l}^{(t)}

(18)

where

θ_{E M A}^{(t)}

represents the EMA-smoothed parameters at step t,

θ_{m o d e l}^{(t)}

is the current model parameter, and

β

= 0.999 controls historical gradient weighting.

The AdamW optimizer with weight decay is employed for parameter updates:

θ_{t + 1} = θ_{t} - η \cdot (\frac{m_{t}}{\sqrt{v_{t}} + ϵ} + λ θ_{t})

(19)

In this formulation,

θ_{t}

denotes the parameter value at step t, and

θ_{t + 1}

represents the updated parameter value. The first-order moment estimate

m_{t}

, characterizing the mean of gradients, is computed with momentum parameters

β_{1}

= 0.9 and

β_{2}

= 0.999. A negligible constant

ϵ

= 1 ×

10^{- 8}

ensures numerical stability, while the weight decay coefficient

λ

= 0.01 penalizes parameter magnitudes to suppress overfitting.

Our training configuration employed a batch size of eight with gradient accumulation over two steps, yielding an effective batch size of one to alleviate GPU memory constraints. Training was conducted for up to 200 epochs, with early stopping triggered upon 15 consecutive epochs of validation mAP@50 stagnation. The learning rate was initialized at 1 ×

10^{- 4}

with a five-epoch linear warmup phase to stabilize gradient statistics.

3.4. Loss Function

Bounding box localization is optimized via DIoU/CIoU/GIoU loss functions. Taking DIoU as an illustrative case,

L_{D I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}}

(20)

The Intersection-over-Union (IoU) metric quantifies the degree of overlap between a predicted bounding box and its corresponding ground truth bounding box, with higher values indicating greater alignment. Mathematically, IoU is defined as the ratio of the intersection area to the union area of the two bounding boxes.

The term

ρ^{2} (b, b^{g t})

represents the squared Euclidean distance between the centroids of the predicted bounding box b and the ground truth bounding box

b^{g t}

. Here,

ρ

denotes the distance between the two centroids, spatially quantifying their positional discrepancy. The parameter

c^{2}

corresponds to the squared area of the minimum enclosing circle that contains both boxes b and

b^{g t}

. This circle is defined as the smallest possible circle encompassing both boxes, with its radius determined by the distance from its center to the farthest edge point of either box.

Therefore, the DIoU loss function comprises two components: the conventional IoU loss term

(1 - I o U)

, which aims to maximize bounding box overlap, and a second term,

\frac{ρ^{2} (b, b^{g t})}{c^{2}}

, that weights the squared Euclidean distance between predicted and ground truth box centroids to reduce spatial displacement. Through this formulation, DIoU not only prioritizes overlap optimization but also mitigates relative positional errors between bounding boxes.

Concurrently, the system employs cuDNN-accelerated mixed-precision training to enhance computational efficiency.

Furthermore, after addressing geometric distortions in peripheral regions caused by wide-angle imaging and feature ambiguity induced by multi-source optical interference, the system must resolve additional challenges, including insufficient fine-grained classification capability and image blurring. To further improve image clarity, this paper proposes a deblurring network based on multi-scale features and edge enhancement. The network adopts an encoder-decoder architecture, with its core processing flow formalized as follows:

F_{out} = Dec (Res (Enc (F_{ms})))

(21)

F_{ms}

denotes the output of multi-scale feature extraction,

E n c

and

D e c

represent encoding and decoding operations, respectively, and

R e s

indicates the attention-enhanced residual module. To fully leverage multi-scale image information, the network first extracts features via multi-scale convolutions:

F_{ms} = Concat ({{Conv}_{k} (I)}), k \in {3, 5, 7}

(22)

where

{Conv}_{k}

denotes a convolution operation with kernel size k, and

C o n c a t

represents feature concatenation. During residual processing, the network incorporates six attention-enhanced residual blocks to amplify critical feature representations.

To further enhance edge details, the network integrates a Sobel operator-based edge enhancement module. This module extracts edge features using vertical and horizontal edge detection kernels:

E_{v} = F * K_{v}, E_{h} = F * K_{h}

(23)

E = \sqrt{E_{v}^{2} + E_{h}^{2}}

(24)

where

K_{v}

and

K_{h}

are vertical and horizontal edge detection kernels, respectively, and ∗ denotes the convolution operation. The final reconstructed image is obtained by fusing decoded features with edge-enhanced features:

O = \tanh ({Conv}_{2} (PReLU ({Conv}_{1} ({Dec}_{f} + E))))

(25)

Here,

{D e c}_{f}

represents the decoder’s reconstructed feature output.

C o n v_{1}

and

C o n v_{2}

are convolutional layers with 32 and 3 channels, respectively, while

P R e L U

and

t a n h

serve as activation functions to ensure nonlinear feature transformations and constrain output values within a reasonable range. The integration of multi-scale feature extraction and edge enhancement improves image clarity.

By tightly coupling these innovative modules, the proposed framework effectively addresses key technical challenges in aquatic debris detection. The experimental results demonstrate that the system outperforms baseline methods in detection accuracy, robustness against interference, and real-time performance.

4. Description of PSF-IMG Dataset

The scarcity of image data capturing water surface polarization characteristics and the lack of standardized polarized light interference datasets have significantly hindered progress in domain-specific challenges. As shown in Figure 5, in order to solve the influence of polarization light refraction, we developed the PSF-IMG dataset. Raw images were rigorously screened to remove blurred, underexposed, or polarization-deficient samples. The remaining images formed a curated dataset with comprehensive annotations. We prioritized capturing views with pronounced polarized light refraction, ensuring ample representation of anomalous polarization distributions caused by optical refraction. This approach enables in-depth evaluation of algorithm adaptability and robustness under complex optical interference. The dataset was divided into training, validation, and test sets, ensuring high image fidelity and annotation accuracy. As a specialized resource for USV-based debris detection under polarized light interference, PSF-IMG supports training and evaluation of deep learning object detection models. It includes common floating debris categories that exhibit typical distribution patterns and high occurrence in aquatic environments. By providing high-quality images and precise annotations, the dataset offers reliable support for systematic studies of optical characteristics and interference mechanisms in water surface polarization, filling a critical gap in polarization-specific resources.

Figure 6a illustrates the data acquisition process, in which a high-resolution camera was used under strictly controlled environmental conditions to ensure stability and representativeness. Sampling diversity was achieved across key parameters—acquisition timestamps, viewing angles, and water surface conditions—to enhance dataset comprehensiveness.

Figure 6b shows that target regions were meticulously annotated according to the PASCAL VOC format standard [26], ensuring data usability and scientific rigor for subsequent model training and evaluation. All images were then processed by a standardized pipeline—including denoising, grayscale normalization, and color calibration—to mitigate the noise contamination and uneven illumination introduced during acquisition. This preprocessing protocol ensures stable input data and enhances feature extraction effectiveness in downstream models.

Dataset Analysis

Figure 7 illustrates the sample distribution of the PSF-IMG dataset, showing balanced representation across debris categories such as plastic bottles, plastic bags, and dead branches. It also visualizes dataset coverage of diverse weather conditions—including sunny, rainy, and high-reflection scenarios—using a pie chart to highlight adaptability to complex optical interference.

In Table 1 and Table 2, our PSF-IMG dataset adopts a 6:2:2 partition strategy with 9000 training samples to fully learn complex features such as water surface reflection, 3000 validation samples to ensure the reliability of hyperparameter tuning (confidence interval ± 1.2%), and 3000 test samples to ensure evaluation objectivity. By maintaining the consistency of subset class/environment distribution by stratified sampling, combined with FP16 acceleration and an early stopping mechanism (with a 15-round patience threshold), the scheme can be trained in 12 h on the A100 graphics card, which is especially suitable for the detection of small and micro debris on the water surface under complex optical interference.

In aquatic environments, reflections from the water surface exhibit distinct polarization characteristics due to the Brewster effect. This phenomenon causes incident unpolarized light to reflect primarily as s-polarized light (with its electric field vector parallel to the water surface), corresponding to a polarization angle of approximately 90°. As clearly shown in Figure 8, the pixel polarization angles across the entire PSF-IMG dataset (global), training set, and validation set all exhibit a pronounced peak near 90°. This strong peak in the distribution confirms the dominance of s-polarized reflections in our collected images, aligning with fundamental optical principles. Representative subsets capturing this dominant polarization state were strategically allocated to the training and validation sets. Additionally, Table 1 summarizes the dataset’s key parameters, underscoring its scientific rigor and practical utility for engineering applications.

5. Experimental Results

Figure 9 presents the hull parameters and hardware configuration of the unmanned vessel system used in this experiment. The system is equipped with a high-resolution camera and a BeiDou positioning module for real-time detection and localization of debris on the water surface.The communication module employs a 3DR data transmission station, utilizing wireless technology to enable real-time data exchange between the flight control system—based on the ArduPilot open-source framework—and the ground station. The module supports a maximum communication range of up to 5 km, offering strong anti-interference capabilities and a high data update rate of 200 Hz, thereby satisfying remote control requirements. In experimental tests, the unmanned vessel operated at 24 V and was equipped with dual-redundant electronic speed controllers (ESCs) and a waterproof electronics cabin to ensure stability in complex hydrological conditions. The hull is constructed from lightweight composite materials. The vessel measures 1.5 m × 0.8 m × 0.6 m (length × width × height), has a draft of 0.15 m, achieves a maximum speed of 3 m/s, and can operate for at least four hours. In a representative section of the test area, surface debris density was controlled at 2–5 items per section, including common debris types such as plastic, foam, and dead branches, to evaluate the system’s adaptability under varied conditions. Meanwhile, a camera and a LiDAR (Light Detection and Ranging) system are mounted on the bow mast of the unmanned surface vehicle (USV) to capture navigation scenarios in inland waters. However, the dynamic range encountered in real-world inland aquatic environments may exceed the capabilities of conventional imaging systems.

Algorithmic Analysis

Figure 10 presents the detection performance of the baseline YOLOv8 and the enhanced YOLOv8-PDL models in highly reflective scenes. The enhanced model incorporates a Polarization-Aware Feature Enhancement Module (PAFEM) that uses multispectral fusion to suppress specular reflection artifacts. Additionally, the FPN-SE architecture dynamically calibrates channel-wise attention weights within the feature pyramid, enhancing discriminative feature extraction for low-texture targets. Moreover, the lightweight DeblurModule employs a multi-scale residual learning strategy to achieve high-fidelity reconstruction of edge details in motion-blurred objects (e.g., plastic bottle contours).

Table 3 presents the comparison of average precision (AP) metrics across different models under identical preprocessing conditions. AP-s, AP-m, and AP-l denote detection accuracies for small (area

< 32^{2}

pixels), medium (

32^{2} \leq

area

\leq 96^{2}

pixels), and large targets (area

> 96^{2}

pixels) respectively. In USV application scenarios, aquatic floating debris typically exhibits small-target characteristics, as it is constrained by the wide-angle imaging distance, typical floating objects like plastic bottles and bags occupy limited pixel regions in images, whose low-resolution features are susceptible to polarization interference and wave-induced motion blur. Statistics from the PSF-IMG dataset reveal that small targets constitute 68% of the samples (as shown in Figure 7), making the improvement in AP-s a critical indicator of model practicality. The proposed YOLOv8-PDL enhances edge feature extraction for small targets through the Polarization-Aware Feature Enhancement Module (PAFEM) and Lightweight Deblurring Pipeline (LDP), achieving a 25% increase in AP-s (0.25) compared to the baseline YOLOv8 (0.20). This enhancement in capturing low-visibility floating debris is crucial for enabling multimodal collaborative detection combining mmWave radar and visual sensors.

As shown in Table 3, the proposed YOLOv8-PDL model demonstrates advantages under complex optical interference conditions. Compared with traditional two-stage detection frameworks (e.g., Faster R-CNN [24] and Cascade R-CNN [27]), single-stage detectors (e.g., SSD [25] and YOLO series) achieve better real-time performance through end-to-end feature learning, but still face limitations in small object detection (AP-s) and occluded scenarios (AP-m). Specifically, Faster R-CNN suffers from insufficient inference speed due to the high computational complexity of its Region Proposal Network (RPN), while SSD’s multi-scale feature fusion mechanism is prone to feature confusion under dynamic wave interference. EfficientDet-D1 [28] balances accuracy and speed through compound scaling strategies, but its attention mechanism shows inadequate sensitivity to polarized reflections. RetinaNet [29] addresses class imbalance through focal loss but leaves room for improvement in bounding box regression accuracy (AP75). Our method suppresses specular reflection artifacts through the Polarization-Aware Feature Enhancement Module (PAFEM), enhances channel attention mechanisms via the Dynamic Feature Calibration Network (DFCN), and optimizes computational efficiency with the Lightweight Deblurring Pipeline (LDP), achieving a better AP75 while maintaining the real-time performance of YOLOv8, which is improved compared with the baseline model (YOLOv8).

To evaluate the effectiveness of the proposed methodology, comprehensive ablation studies were performed. Figure 11 illustrates the impact of various module combinations on model performance, showing trends in key metrics such as mAP50 and mAP50-95. Beginning with the baseline YOLOv8 architecture, the sequential integration of the proposed modules led to gradual improvements in performance.

The baseline model (purple curve) initially demonstrated relatively low performance, which gradually improved as training progressed. The integration of an enhanced backbone network (dark blue curve) improved feature extraction capabilities and yielded measurable gains in mAP50. Subsequent addition of the FPN-SE module (green curve) further improved localization accuracy by leveraging spatial attention mechanisms. Optimization of the detection heads (light green curve) led to improved detection precision. Finally, the introduction of the DeblurModule (yellow curve) increased model robustness to motion blur and lighting variations, leading to superior overall performance.

Figure 11 presents the model’s precision–recall characteristics. The training curves indicate that the proposed modules collectively enhance both detection precision and recall. Notably, in the later training stages (epochs 60–100), the enhanced model maintained high precision while steadily improving recall, indicating fewer missed detections without compromising false-positive suppression. The DeblurModule (yellow curve) played a crucial role in enhancing detection reliability under complex conditions.

Quantitative analysis from the ablation studies highlights the individual contribution of each module to overall performance improvement. These results collectively validate the effectiveness of the proposed enhancements, particularly in addressing target detection challenges under complex lighting conditions.

To quantitatively evaluate module contributions, Table 4 summarizes the convergence performance of different model variants on the validation set. As shown in Figure 12, the baseline model achieves 76.1% mAP50, which improves to 76.6% after incorporating the enhanced backbone network (a +0.5 percentage point gain), demonstrating the effectiveness of multi-scale polarization feature extraction. The FPN-SE module increases mAP50-95 by 0.9 percentage points (from 33.7% to 34.6%) through channel attention mechanisms, while simultaneously boosting precision and recall by 2.4 and 2.2 percentage points, respectively, proving its capability to enhance feature stability under wave disturbances. The optimized detection head delivers improvement, elevating mAP50 by 1.8 percentage points to 78.4% and recall by 1.9 percentage points through refined feature processing. The deblurring module further enhances robustness, contributing a final +0.2 percentage point gain in mAP50 (reaching 78.6%) and +0.5 percentage points in recall. The full model (YOLOv8-PDL) ultimately achieves 78.6% mAP50, representing a 2.5 percentage point absolute improvement (3.3% relative gain) over the baseline, with +2.9 percentage points in mAP50-95 and +4.1% in recall, confirming the necessity of multi-module collaborative optimization.

To rigorously validate the model’s robustness across diverse environments, we conducted additional evaluations on two benchmark datasets: FLOW (CVPR2021) [3] and SeaDronesSee [30]. As shown in Table 5, our model demonstrates consistent performance with mAP@[50-95] of 0.39 and 0.37, respectively, maintaining comparable detection precision (AP50: 0.82/0.79) to the PSF-IMG dataset results. Notably, the AP75 metrics (0.42/0.39) reveal stable localization accuracy under different imaging conditions. The small-target detection capabilities (AP-S: 0.23/0.21) further confirm the effectiveness of our polarization-aware feature enhancement strategy in handling scale variations across datasets.

On the PSF-IMG test set, we performed a detailed evaluation of the model’s detection capability at different confidence thresholds and plotted the Receiver Operating Characteristic (ROC) curve (see Figure 13). The area under the curve (AUC) is 0.66, indicating that the model has moderate performance in balancing the true positive rate (TPR) and false-positive rate (FPR)—when the threshold is relaxed from high to low, the TPR slowly rises from 0 to 1.0, while the FPR gradually expands from 0 to 1.0. Specifically, at an FPR of ≈0.20, the model achieves a TPR of ≈0.21; at an FPR of ≈0.50, the TPR is ≈0.82; and at an FPR of ≈0.85, the TPR is close to 1.0. Compared with random guessing (red diagonal), the model can still correctly detect more than 80% of the real targets with a false-positive rate of about 50%, highlighting its superiority of being sensitive enough and effectively suppressing false alarms when the unmanned surface vehicle is cruising continuously.

6. Discussion

This study presents a systematic enhancement of YOLOv8 for water-surface debris detection by integrating polarization-aware feature fusion, multi-scale attention mechanisms, and a lightweight deblurring pipeline. Previous approaches to glare and reflection suppression on water surfaces have relied on adaptive or multi-exposure imaging techniques [6,7,8], polarization filtering combined with polynomial fitting [5], or traditional HDR correction methods [4], each of which falls short under dynamic wave conditions, high refraction angles, or multiple scattering. In contrast, our method fuses polarization cues within visible-spectrum features, applies channel-wise attention recalibration, and restores motion-blur details via super-resolution, achieving a favorable balance between high detection precision and real-time performance under complex optical disturbances. Specifically, the overall mAP@[50–95] increases from 0.37 to 0.41 (+10.8%), AP@75 jumps from 0.27 to 0.45 (+66.7%), and small- and medium-object detection improve by 25% and 5%, respectively. Ablation studies further underscore the pivotal roles of each module in glare suppression, enhanced separability of low-texture targets, and detail recovery. When benchmarked against established detectors such as Faster R-CNN [24] and SSD [25], YOLOv8-PDL delivers superior localization precision and small-object recall while maintaining 48 FPS real-time inference. Overall, our approach offers a robust and efficient visual detection solution for marine and inland water debris monitoring.

6.1. Analysis of Key Performance Improvements

As summarized in Table 3, integrating our polarization-driven learning enhancements into the YOLOv8 backbone raises the overall mAP@[50–95] from 0.37 to 0.41, representing a 10.8% relative gain. The most striking improvement is observed at the stricter IoU threshold: AP@75 increases from 0.27 to 0.45 (a 66.7% relative gain). Although AP@50 shows a slight decrease from 0.86 to 0.84, this can be attributed to the polarization-suppression module’s filtering of high-confidence false positives—such as specular reflections on water surfaces misidentified as debris—thereby achieving a more favorable precision–recall balance. Furthermore, performance on multi-scale targets is demonstrably enhanced: small-object detection (AP-s) increases from 0.20 to 0.25 (a 25% relative improvement), and medium-object detection (AP-m) from 0.40 to 0.42 (a 5% relative improvement), validating the efficacy of our feature-enhancement strategies under complex optical conditions.

6.2. Module-Level Ablation Study

Our ablation experiments show that sequentially adding the three proposed modules elevates mAP@[50–95] from 0.37 at baseline to 0.41 in the final configuration. The Polarization-Aware Feature Enhancement Module (PAFEM) accounts for the majority of this gain, improving mAP@[50–95] from 0.37 to 0.39 by effectively integrating polarization cues with visible-spectrum features to suppress glare artifacts (e.g., noon-time sun glint) while preserving critical object contours. Incorporating the FPN-SE architecture further boosts mAP@[50–95] to 0.40 by dynamically reallocating channel weights toward low-texture targets—such as foam boards—thus enhancing inter-class separability, as evidenced by a threefold increase in attention-weight metrics. Finally, the Lightweight Deblurring Pipeline (LDP) delivers the remaining improvement to 0.41 by employing an edge-preservation strategy that yields a 4.2 dB PSNR increase in contour reconstruction for motion-blurred objects (e.g., wave-distorted plastic bottles; Figure 9), thereby strengthening localization accuracy under dynamic conditions.

6.3. Comparison with Established Detection Frameworks

Under identical preprocessing and evaluation protocols, our YOLOv8-PDL model achieves AP = 0.41 (AP@50 = 0.84) compared to Faster R-CNN’s AP = 0.18 (AP@50 = 0.35), a 17.1% relative improvement that underscores the superiority of learned polarization filtering versus conventional region-proposal approaches for specular-noise suppression. When compared to SSD—which attains AP@50 = 0.42—YOLOv8-PDL demonstrates markedly higher strict-threshold precision (AP@75 of 0.45 versus SSD’s 0.10) while maintaining robust recall for small objects. These results confirm that the synergy of polarization-informed feature suppression and architectural optimization substantially enhances debris-detection robustness and precision in challenging aquatic environments.

7. Conclusions

In this paper, we present an enhanced YOLOv8-based multi-mode detection system for aquatic debris that aims to address key challenges in USV-based surface monitoring, namely wide-angle distortion, optical interference, and motion blur. The proposed cascaded architecture—integrating a fisheye distortion correction module and a polarization feature enhancement network (PFEN)—reduces edge detection errors and suppresses specular reflection artifacts across a wide field of view. Our novel dual-stream residual DeblurModule, equipped with spatiotemporal attention mechanisms, improves recall of blurred targets under wave disturbances compared to conventional single-branch designs. Experimental results on the self-constructed PSF-IMG dataset show competitive detection accuracy, confirming the system’s efficacy in complex aquatic environments. This work offers an engineering framework for intelligent aquatic monitoring, advancing environmentally sustainable water conservation technologies.

Author Contributions

Conceptualization, Y.C. and P.Z.; Methodology, Y.C.; Software, H.L.; Validation, H.L. and P.Z.; Formal analysis, L.X. and M.Z.; Investigation, Y.C.; Resources, L.X. and M.Z.; Data curation, H.L.; Writing—original draft, Y.C. and H.L.; Writing—review & editing, Y.C. and H.L.; Visualization, L.X. and M.Z.; Project administration, P.Z.; Funding acquisition, L.X. and P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Scientific Research Development Fund of Fujian University of Technology, Project No. GY-Z24133.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request due to restrictions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gola, K.K.; Singh, B.; Singh, M.; Srivastava, T. PlastOcean: Detecting Floating Marine Macro Litter (FMML) Using Deep Learning Models. In Proceedings of the International Conference on Intelligent Systems Design and Applications, Malaga, Spain, 11–13 September 2023. [Google Scholar]
Heping, Y.; Bo, Z.; Bijin, L. Multi-target floating garbage tracking algorithm for cleaning ships based on YOLOv5-Byte. Chin. J. Ship Res. 2025, 20, 1–12. [Google Scholar]
Cheng, Y.; Zhu, J.; Jiang, M.; Fu, J.; Pang, C.; Wang, P.; Sankaran, K.; Onabola, O.; Liu, Y.; Liu, D.; et al. Flow: A dataset and benchmark for floating waste detection in inland waters. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, NA, Canada, 10–17 October 2021; pp. 10953–10962. [Google Scholar]
Kay, S.; Hedley, J.D.; Lavender, S. Sun glint correction of high and low spatial resolution images of aquatic scenes: A review of methods for visible and near-infrared wavelengths. Remote Sens. 2009, 1, 697–730. [Google Scholar] [CrossRef]
Yang, M.; Zhao, P.; Feng, B.; Zhao, F. Water surface Sun glint suppression method based on polarization filtering and polynomial fitting. Laser Optoelectron. Prog. 2021, 58, 125–136. [Google Scholar]
Shi, W.; Quan, H.; Kong, L. Adaptive specular reflection removal in light field microscopy using multi-polarization hybrid illumination and deep learning. Opt. Lasers Eng. 2025, 186, 108839. [Google Scholar] [CrossRef]
Yang, Y.; Ma, W.; Zheng, Y.; Cai, J.F.; Xu, W. Fast single image reflection suppression via convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 8141–8149. [Google Scholar]
Amer, K.O.; Elbouz, M.; Alfalou, A.; Brosseau, C.; Hajjami, J. Enhancing underwater optical imaging by using a low-pass polarization filter. Opt. Express 2019, 27, 621–643. [Google Scholar] [CrossRef] [PubMed]
Guan, R.; Yao, S.; Zhu, X.; Man, K.L.; Lim, E.G.; Smith, J.; Yue, Y.; Yue, Y. Achelous: A fast unified water-surface panoptic perception framework based on fusion of monocular camera and 4d mmwave radar. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Dalian, China, 2–5 October 2023; pp. 182–188. [Google Scholar]
Kundu, S.; Ming, J.; Nocera, J.; McGregor, K.M. Integrative learning for population of dynamic networks with covariates. NeuroImage 2021, 236, 118181. [Google Scholar] [CrossRef] [PubMed]
Abdlaty, R.; Orepoulos, J.; Sinclair, P.; Berman, R.; Fang, Q. High throughput AOTF hyperspectral imager for randomly polarized light. Photonics 2018, 5, 3. [Google Scholar] [CrossRef]
Wang, K.; Lam, E.Y. Deep learning phase recovery: Data-driven, physics-driven, or a combination of both? Adv. Photonics Nexus 2024, 3, 056006. [Google Scholar] [CrossRef]
Cao, Z.; Sun, S.; Wei, J.; Liu, Y. Dispersive optical activity for spectro-polarimetric imaging. Light. Sci. Appl. 2025, 14, 90. [Google Scholar] [CrossRef] [PubMed]
Wang, F.; Bian, Y.; Wang, H.; Lyu, M.; Pedrini, G.; Osten, W.; Barbastathis, G.; Situ, G. Phase imaging with an untrained neural network. Light. Sci. Appl. 2020, 9, 77. [Google Scholar] [CrossRef] [PubMed]
Tyo, J.S.; Goldstein, D.L.; Chenault, D.B.; Shaw, J.A. Review of passive imaging polarimetry for remote sensing applications. Appl. Opt. 2006, 45, 5453–5469. [Google Scholar] [CrossRef] [PubMed]
Powell, S.B.; Gruev, V. Calibration methods for division-of-focal-plane polarimeters. Opt. Express 2013, 21, 21039–21055. [Google Scholar] [CrossRef] [PubMed]
Wang, S.y.; Qu, Z.; Gao, L.y. Multi-spatial pyramid feature and optimizing focal loss function for object detection. IEEE Trans. Intell. Veh. 2023, 9, 1054–1065. [Google Scholar] [CrossRef]
Zhang, H.; Fu, W.; Li, D.; Wang, X.; Xu, T. Improved small foreign object debris detection network based on YOLOv5. J. Real-Time Image Process. 2024, 21, 21. [Google Scholar] [CrossRef]
Yang, H.; Chen, Y.; Pan, Y.; Yao, T.; Chen, Z.; Ngo, C.W.; Mei, T. Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models. In Proceedings of the 32nd ACM International Conference on Multimedia, Beijing, China, 21–25 October 2024; pp. 6870–6879. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
Li, Y.; Wang, Y.; Xie, J.; Zhang, K. Unmanned Surface Vehicle Target Detection Based on LiDAR. In Proceedings of the International Conference on Autonomous Unmanned Systems, Bangkok, Thailand, 20–22 November 2023; pp. 112–121. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. Seadronessee: A maritime benchmark for detecting humans in open water. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Salt Lake City, UT, USA, 16–20 January 2022; pp. 2260–2270. [Google Scholar]

Figure 1. Problems with acquiring images under the influence of water surface polarization.

Figure 2. The pipeline of our proposed water surface target detection.

Figure 3. Schematic diagram of a multi-layer feature fusion structure combined with attention mechanisms.

Figure 4. The architecture of the proposed water surface target detection framework with LSRP, PAFEM, and DFCN modules.

Figure 5. A sample of floating debris under polarization interference on the water surface.

Figure 6. Schematic of the data acquisition and feature enhancement process, where (a) depicts the dataset collection scenario using a polarized light imaging system under specific lighting conditions, where we use a ground station program to realize the motion control and image acquisition of the ship, and then save the collected image data in a computer to realize the entire data collection process, and (b) shows the target mask annotation for feature representation enhancement.

Figure 7. Dataset composition analysis, where (a) depicts the sample quantity disparities across five debris categories, while (b) illustrates the environmental condition composition, highlighting illumination and weather variations as key factors in data collection.

Figure 8. Image polarization angle distribution and dataset statistics: (a) global distribution of the original image’s polarization angle; (b) training set distribution; (c) validation set distribution. The peak at 90° corresponds to s-polarized reflections parallel to the water surface, consistent with the Brewster effect. Angular units in degrees (°).

Figure 9. The actual working scene of the unmanned ship, in which (a) shows the working scene of the unmanned ship that we developed and equipped with our algorithm and (b) is the equipment (camera and lidar) carried by our unmanned boat, which adopts a dual-modal design to enhance target positioning and operational autonomy.

Figure 10. Confidence comparison between YOLOv8 and YOLOv8-PDL in the actual detection process.

Figure 11. Schematic diagram of the ablation experiment: progressive enhancement effects of stacked modules on YOLOv8n, showing accelerated mAP50 growth with 3D convolutional backbone and stabilized precision after detection head refinement.

Figure 12. Schematic diagram of the ablation experiment: independent technical contributions demonstrating metric convergence from backbone expansion, multi-scale feature fusion synergy, and frequency-domain recovery for motion-blurred targets.

Figure 13. Receiver Operating Characteristic (ROC) analysis of floating object detection performance. The blue solid line (AUC = 0.66) represents the model proposed in this paper, and the red dotted line is the random guess baseline.

Table 1. Sample distribution by environmental conditions.

Environmental Condition	Sample Size	Percentage (%)
Sunny (noon)	2550	17.0
Sunny (Evening)	3450	23.0
Cloudy (noon)	3000	20.0
Cloudy (Evening)	3000	20.0
Foggy	1950	13.0
Rainy	1050	7.0
Total	15,000	100.0

Table 2. Sample distribution of debris categories.

Debris Category	Sample Size	Percentage (%)
Plastic bottle	1354	7.0
Plastic bag	2354	12.3
Branch	3012	15.7
Paper	3564	18.6
Can	3578	18.6
Other debris	5344	27.8
Total	19,206	100.0

Table 3. Comparison of AP parameters between the current mainstream model and YOLOv8-PDL under the same pretreatment conditions.

	AP	AP50	AP75	AP-s	AP-m	AP-l
Faster R-CNN	0.18	0.35	0.08	0.03	0.30	0.85
SSD	0.20	0.42	0.10	0.05	0.25	0.75
EfficientDet-D1	0.25	0.45	0.12	0.08	0.35	0.78
Mask R-CNN	0.22	0.38	0.11	0.04	0.28	0.88
Cascade R-CNN	0.16	0.39	0.10	0.04	0.22	0.89
Retinanet	0.22	0.48	0.15	0.06	0.20	0.92
YOLOv8	0.37	0.86	0.27	0.20	0.40	0.55
YOLOv8-PDL	0.41	0.84	0.45	0.25	0.42	0.55

Table 4. Ablation study on model components.

Model Configuration	mAP50	mAP50-95	Precision	Recall
Baseline	0.761	0.337	0.679	0.649
+ Enhanced Backbone	0.766	0.337	0.691	0.660
+ FPN-SE	0.766	0.346	0.703	0.671
+ LSRP	0.784	0.346	0.703	0.690
Full Model (PDL)	0.786	0.366	0.715	0.690

Table 5. Cross-dataset evaluation results of YOLOv8-PDL.

Dataset	mAP@[50-95]	AP50	AP75	AP-S	AP-M	AP-L
FLOW (CVPR2021)	0.39	0.82	0.42	0.23	0.40	0.53
SeaDronesSee	0.37	0.79	0.39	0.21	0.38	0.52

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Lin, H.; Xiao, L.; Zhang, M.; Zhang, P. A Novel Method for Water Surface Debris Detection Based on YOLOV8 with Polarization Interference Suppression. Photonics 2025, 12, 620. https://doi.org/10.3390/photonics12060620

AMA Style

Chen Y, Lin H, Xiao L, Zhang M, Zhang P. A Novel Method for Water Surface Debris Detection Based on YOLOV8 with Polarization Interference Suppression. Photonics. 2025; 12(6):620. https://doi.org/10.3390/photonics12060620

Chicago/Turabian Style

Chen, Yi, Honghui Lin, Lin Xiao, Maolin Zhang, and Pingjun Zhang. 2025. "A Novel Method for Water Surface Debris Detection Based on YOLOV8 with Polarization Interference Suppression" Photonics 12, no. 6: 620. https://doi.org/10.3390/photonics12060620

APA Style

Chen, Y., Lin, H., Xiao, L., Zhang, M., & Zhang, P. (2025). A Novel Method for Water Surface Debris Detection Based on YOLOV8 with Polarization Interference Suppression. Photonics, 12(6), 620. https://doi.org/10.3390/photonics12060620

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Method for Water Surface Debris Detection Based on YOLOV8 with Polarization Interference Suppression

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Data Processing

3.2. Polarization Feature Enhancement Network

3.2.1. Feature Pyramid Network with Squeeze-and-Excitation

3.2.2. Polarization-Aware Feature Enhancement Module (PAFEM)

3.2.3. Deblurring Feature Compensation Network (DFCN)

3.2.4. Light Specular Reflection Processing Module (LSRP)

3.2.5. Module Integration and Collaborative Working

3.3. Learning Rate

3.4. Loss Function

4. Description of PSF-IMG Dataset

Dataset Analysis

5. Experimental Results

Algorithmic Analysis

6. Discussion

6.1. Analysis of Key Performance Improvements

6.2. Module-Level Ablation Study

6.3. Comparison with Established Detection Frameworks

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI