Diffusion-Enhanced Underwater Debris Detection via Improved YOLOv12n Framework

Tao, Jianghan; Zhao, Fan; Chen, Yijia; Liu, Yongying; Xue, Feng; Song, Jian; Wu, Hao; Chen, Jundong; Li, Peiran; Xu, Nan

doi:10.3390/rs17233910

Open AccessArticle

Diffusion-Enhanced Underwater Debris Detection via Improved YOLOv12n Framework

by

Jianghan Tao

¹,

Fan Zhao

^2,*

,

Yijia Chen

²,

Yongying Liu

²

,

Feng Xue

³,

Jian Song

²,

Hao Wu

⁴,

Jundong Chen

⁵,

Peiran Li

⁶ and

Nan Xu

⁷

¹

Graduate School of Global Environmental Studies, Sophia University, Tokyo 102-8554, Japan

²

Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa 277-8561, Japan

³

Graduate School of Information, Production and Systems, Waseda University, Kitakyushu 808-0135, Japan

⁴

Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-0033, Japan

⁵

Data Science and AI Innovation Research Promotion Center, Shiga University, Hikone 522-8522, Japan

⁶

Hitotsubashi Institute for Advanced Study, Hitotsubashi University, Tokyo 186-0004, Japan

⁷

Department of Urban Informatics, School of Architecture and Urban Planning, Shenzhen University, Shenzhen 518060, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(23), 3910; https://doi.org/10.3390/rs17233910

Submission received: 23 September 2025 / Revised: 27 November 2025 / Accepted: 30 November 2025 / Published: 2 December 2025

(This article belongs to the Special Issue Multi-Source Data Fusion and Feature Extraction for Underwater Target Detection)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose UDD-YOLO, a novel framework that pioneers the use of a Cold Diffusion model for underwater image enhancement, significantly boosting debris detection accuracy by restoring structural details lost to blur and scattering.
The model achieves state-of-the-art performance while remaining lightweight, outperforming numerous advanced detectors and demonstrating superior robustness to common underwater image degradation.

What are the implications of the main findings?

The integration of diffusion models establishes a new, more effective paradigm for image enhancement in underwater vision, moving beyond traditional methods to directly address core image quality issues.
This work provides a practical and efficient solution for accurate, real-time debris detection, with direct applicability for deployment on autonomous underwater vehicles to support large-scale marine environmental monitoring.

Abstract

Detecting underwater debris is important for monitoring the marine environment but remains challenging due to poor image quality, visual noise, object occlusions, and diverse debris appearances in underwater scenes. This study proposes UDD-YOLO, a novel detection framework that, for the first time, applies a diffusion-based model to underwater image enhancement, introducing a new paradigm for improving perceptual quality in marine vision tasks. Specifically, the proposed framework integrates three key components: (1) a Cold Diffusion module that acts as a pre-processing stage to restore image clarity and contrast by reversing deterministic degradation such as blur and occlusion—without injecting stochastic noise—making it the first diffusion-based enhancement applied to underwater object detection; (2) an AMC2f feature extraction module that combines multi-scale separable convolutions and learnable normalization to improve representation for targets with complex morphology and scale variation; and (3) a Unified-IoU (UIoU) loss function designed to dynamically balance localization learning between high- and low-quality predictions, thereby reducing errors caused by occlusion or boundary ambiguity. Extensive experiments are conducted on the public underwater plastic pollution detection dataset, which includes 15 categories of underwater debris. The proposed method achieves a mAP50 of 81.8%, with 87.3% precision and 75.1% recall, surpassing eleven advanced detection models such as Faster R-CNN, RT-DETR-L, YOLOv8n, and YOLOv12n. Ablation studies verify the function of every module. These findings show that diffusion-driven enhancement, when coupled with feature extraction and localization optimization, offers a promising direction for accurate, robust underwater perception, opening new opportunities for environmental monitoring and autonomous marine systems.

Keywords:

deep learning; underwater debris detection; diffusion model; underwater imaging; object detection; YOLOv12n

1. Introduction

The escalating issue of marine and underwater debris pollution presents a profound threat to aquatic ecosystems, biodiversity, economic activities, and global water security [1,2,3,4]. Accurate and efficient monitoring of underwater debris, particularly plastics, is essential for environmental assessment, policy formulation, and the implementation of targeted cleanup initiatives. Traditionally, monitoring methods such as trawling, optical surveys via remotely operated vehicles (ROVs), and acoustic mapping have been employed for seabed assessment [5]. However, these methods are costly, labor-intensive, and spatially constrained, rendering them unsuitable for large-scale, long-term deployment. Furthermore, trawling causes significant habitat disruption [6], while optical mapping via ROVs faces visibility and mobility limitations in complex underwater terrains [7,8,9]. In light of these limitations, the geoscience and remote sensing communities are increasingly turning to automated, non-intrusive technologies. Unmanned systems and computer vision methods offer promising, cost-effective alternatives for aquatic debris monitoring [10,11,12,13,14]. The need for such innovative strategies is underscored by the growing body of legislation emphasizing long-term marine debris surveillance to support effective coastal management [15,16].

Deep learning, particularly object detection techniques, has emerged as a cornerstone for automated environmental surveillance [17,18,19,20]. The YOLO (You Only Look Once) family of single-stage detectors is widely acknowledged for its speed-accuracy trade-off [21,22,23,24], enabling efficient deployment on low-power platforms [21], including unmanned aerial vehicles (UAVs) [25,26] and amphibious underwater imaging systems such as the Aerial-aquatic Speedy Scanner (AASS) [27,28]. However, underwater object detection remains challenging because raw underwater images often suffer from severe degradation. Color distortion, low contrast, scattering-induced blur, and illumination-related noise reduce the visibility of key structures [29], leading detectors to miss objects or generate false positives. Although traditional enhancement techniques have been applied to mitigate these issues [30], they often fail to recover fine-grained details needed for reliable detection, especially in turbid or complex underwater scenes.

A number of studies have attempted to address these challenges through computer vision-based debris detection and enhanced YOLO architectures. Early methods employed acoustic sensors such as side-scan sonar, which cover wide areas but lack the resolution necessary for discriminating debris types and fail to detect soft materials like plastic bags [31]. Optical imaging on ROVs improved visual clarity, yet early applications relied heavily on manual annotation, resulting in limited efficiency and accuracy [32]. With advances in deep learning, CNNs became common for underwater object detection [33,34,35]. Two-stage models like Faster R-CNN provide strong accuracy [36] but are computationally heavy, while one-stage YOLO detectors [37,38,39,40] strike a balance between speed and performance. Many versions of YOLO have been modified for underwater tasks by using new backbones, attention modules, or feature fusion [41]. Among these enhanced YOLO variants, the latest improvement trends lie in feature alignment, multi-scale fusion, and contextual modeling. For instance, CM-YOLO [42] introduced a context-enhanced multi-scale fusion framework that aligns multi-level features to improve detection in complex scenes. CFFDNet [43] proposed a complementarity-aware multi-modal fusion design that integrates optical and SAR features to enhance cross-modal feature representation. Similar feature alignment techniques have been applied in other recent detectors, including DMFI-YOLO [44], RG-YOLO [45], and EFP-YOLO [46]. These models demonstrate that aligned and context-aware features improve robustness in object detection scenarios.

While existing enhanced YOLO studies have shown improved performance in some tasks, they still rely on high-quality input images [47,48]. Many studies use clear water datasets or basic image enhancement tools like histogram equalization or the Dark Channel Prior [49]. These tools do not work well in highly degraded conditions. Turbidity, unstable lighting, and occlusion make the image less clear and hide important features [50]. Perspective changes, messy backgrounds, and deformable waste objects also make detection harder [51]. This shows the need for effective image restoration methods, which serve as a foundation for reliable underwater object detection.

A growing number of studies have attempted to address image degradation by enhancing images prior to detection. GAN-based methods such as CycleGAN, UGAN, and their variants are commonly used to adjust color and contrast and are often paired with YOLO or Faster R-CNN. For example, [52] proposed a method that integrates a GAN-based color correction model with an object detection model to analyse the effect of enhancement on underwater object detection. Another work by [53] shows that enhanced underwater images may still yield poor detection results due to label quality issues. While these methods improve image quality, they may introduce color shifts or texture artifacts that can affect small-object detection. Transformer-based enhancement methods have also been proposed, with advantages in their self-attention mechanisms enabling them to construct a comprehensive global context [54,55,56]. However, these Transformer-based enhancement models usually require large datasets and high computational cost, and still struggle to restore fine structural details lost to scattering and blur. These limitations suggest that current GAN- and Transformer-based enhancement methods do not fully address blurring issues in small-object detection scenarios such as underwater debris detection.

Given these challenges, more flexible generative restoration frameworks have recently gained attention. Diffusion model is a new type of generative method for image enhancement [57]. Denoising Diffusion Probabilistic Models (DDPMs) learn to reverse a noise-adding process and have demonstrated success in super-resolution, inpainting, and color correction [49,58,59]. However, DDPMs rely on Gaussian noise, which does not accurately model underwater distortions. Cold Diffusion is a new version that changes this. It removes the need for random noise and uses fixed distortions like blur as the forward process. The model then learns how to undo these distortions [60]. This method fits underwater images well because blur and low contrast are common problems in these settings. Until now, few studies have used diffusion models in underwater remote sensing, especially for tasks like object detection [61]. Most work is still limited to general image tasks. This study applies Cold Diffusion in a detection pipeline. Unlike basic enhancement tools, Cold Diffusion can create lost texture and structure, making the image clearer. The detector then achieves better input and works more accurately [62]. This method helps solve the long-standing issue of poor image quality in underwater remote sensing.

To fill in the existing gaps in high-quality image enhancement and accurate object detection, this study presents UDD-YOLO, a modified YOLOv12n model for detecting underwater debris. First, it adds a Cold Diffusion model as a pre-processing step. This helps fix many of the image problems mentioned before. Then, it uses an improved detection network. This network has two new parts: (1) The original backbone is replaced with the AMC2f module. This part helps find debris of different sizes by improving multi-scale feature extraction. (2) The loss function is changed to UIoU, which improves the accuracy of bounding boxes, especially for irregular or partly hidden objects. Together, the proposed model is robust yet lightweight, which is suitable for edge-level deployment.

The main contributions of this study are as follows:

Application of diffusion models: A Cold Diffusion model is introduced as a pre-processing module as part of a complete underwater detection pipeline. This application is demonstrated to be effective in mitigating severe image degradation and boosting downstream detection accuracy.
An advanced lightweight detector: An advanced lightweight detector based on a tailored YOLOv12n architecture is proposed. The architecture incorporates an AMC2f feature extraction module and a UIoU loss function to collaboratively enhance detection performance for challenging underwater targets while keeping computational cost at a reasonable level. The effectiveness of each innovative component is validated through extensive ablation studies, and the model’s superior performance is demonstrated through comparative experiments against a wide range of mainstream detectors.
In-depth robustness analysis: A series of systematic tests was carried out to assess the model’s resilience to common underwater perturbations, including variations in brightness, Gaussian blur, scattering (haze), and lens distortion. The results confirm its stability and potential for practical field applications.

This paper is organized as follows. Section 2 describes the data and methods, including the open-source debris dataset, the proposed detection framework, its three key improvements, the training settings, and the evaluation metrics. Section 3 presents the experimental results, covering the ablation analysis, comparisons with other common algorithms, and comparisons with different image enhancement methods. Section 4 discusses the experimental findings and provides additional evaluations on model robustness and generalizability. Section 5 concludes the study and outlines potential directions for future work.

2. Materials and Methods

Figure 1 shows the overall framework of this study. A model integrating a Cold Diffusion module, an AMC2f module, and a UIoU loss function was developed. The model was then evaluated against other detectors, including YOLO variants, Faster R-CNN, RT-DETR-L, and MobileNetV2-SSD, as well as under different image enhancement methods. Its robustness was assessed under various degraded visual conditions such as brightness change, Gaussian blur, scattering, and lens distortion. Its transferability was examined through cross-dataset tests to determine whether the model could generalize to new underwater scenes.

2.1. Dataset

In this study, a public dataset for underwater plastic pollution detection [47] was used. The dataset was made for detecting and monitoring many kinds of plastic and other human-made debris underwater. It has a large set of images that show polluted aquatic environments and capture human waste in different underwater scenes. To deal with the poor quality of underwater images, the Dark Channel Prior (DCP) method was used as a pre-processing step. This improved the image contrast and made it easier to recognize and detect debris.

The dataset contains 15 classes of debris: mask, can, cellphone, electronics, glass bottle (gbottle), glove, metal, misc, net, plastic bag (pbag), plastic bottle (pbottle), general plastic, rod, sunglasses, and tire. Figure 2 presents examples with bounding boxes that indicate the position and size of objects.

For training, images were set to 640 × 640 pixels. The initial learning rate was 0.01. Optimization used Stochastic Gradient Descent (SGD) with momentum 0.937 and weight decay 0.0005. To enlarge the training set and improve model robustness, Mosaic augmentation was applied [63]. This method randomly chooses several images, applies flipping and scaling, and then combines them into one image. This created more variation in background and object layouts and helped the model learn better in real underwater scenes with clutter and occlusion.

2.2. Model Architecture

The core of UDD-YOLO is a detection network built on the lightweight YOLOv12n architecture. YOLOv12n is used as the baseline because it has a good balance of speed and performance. It is also suitable for deployment on underwater platforms with limited resources [64]. Figure 3 shows that our model keeps the main structure of YOLOv12n but adds three changes for underwater debris detection. The changes are (1) an image restoration preprocessor with Cold Diffusion to improve input quality; (2) an AMC2f module to improve multi-scale feature extraction; and (3) a Unified-IoU (UIoU) loss function to make object localization more precise.

2.2.1. Cold Diffusion Model

A fundamental challenge in underwater object detection is the degradation of image quality. As shown in Figure 2, underwater images often suffer from low contrast, color casts, and, most critically, blurring caused by light scattering and absorption. These issues obscure the essential structural details and textures of debris, significantly impairing the feature extraction capabilities of detection models like YOLOv12n and leading to reduced accuracy. Traditional enhancement techniques, such as filtering or histogram equalization, mainly adjust pixel intensity distributions, which improves contrast or color balance. However, they are unable to reconstruct lost fine structural details or severely blurred edges, resulting in limited improvement for downstream detection. To address this limitation, we introduce an advanced image enhancement preprocessor based on the principles of diffusion models. Our approach is inspired by the powerful generative framework of Denoising Diffusion Probabilistic Models (DDPM). On the basis of DDPM, a Cold Diffusion paradigm is adapted, which replaces stochastic Gaussian noise with a deterministic degradation operator that simulates the underwater blur process. By learning to reverse this specific degradation during training, the model recovers sharper edges, clearer object boundaries, and fine textures that are typically lost to scattering and absorption. Because the degradation is deterministic rather than random, this formulation matches the physical characteristics of underwater imaging more closely. In practice, the restored structural information improves the feature extraction capability of YOLOv12n, particularly for small or low-contrast debris objects. Compared with commonly used enhancement methods, Cold Diffusion not only increases perceptual clarity but also generates features that are more aligned with the detector’s feature space, leading to consistent performance gains in our experiments.

Foundational Theory: Denoising Diffusion Probabilistic Models (DDPM)

DDPM is defined as a forward diffusion process

q (x_{0})

that gradually corrupts the original data

x_{0}

over T time steps [58]. At each timestep, the data is perturbed with small noise following a predefined variance schedule, ultimately converting the clean input into an approximate Gaussian distribution. In the forward process, a clean image

x_{0}

is gradually perturbed over

T

steps by adding Gaussian noise according to a predefined variance schedule. A commonly used closed-form expression is

q (x_{t}| x_{0}) = N (x_{t}; \sqrt{1 - β_{t}} x_{0}, β_{t} I)

(1)

A parameterized reverse process is designed to recover the original data from the degraded input

p_{θ} (x_{0}) = \int p_{θ} (x_{0 : T}) d x_{1 : T}

.

p_{θ} (x_{0}| x_{t}) = N (x_{0}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))

(2)

To ensure that the forward process produces data

x_{T}

that approximates a standard normal distribution, the noise schedule hyperparameters

β_{t}

in the diffusion process are carefully designed. As a result, the prior distribution

p (x_{T})

at the final timestep T is typically set to a standard Gaussian. The reverse process, parameterized by a neural network, is trained to recover the denoising path from the prior distribution back to the original data.

Cold Diffusion Module for Underwater Restoration

However, the Gaussian noise assumption in classical DDPMs does not match the real degradation in underwater imagery. Underwater images are mainly affected by blur, haze, and color shift caused by light scattering, rather than by pixel-wise independent Gaussian noise. Training a model to invert Gaussian noise may therefore fail to fully capture the structured distortions that are most relevant for underwater perception.

To address this mismatch, we adopt the Cold Diffusion paradigm, which replaces stochastic Gaussian noise with deterministic degradation operators that explicitly mimic underwater distortions. Instead of sampling noise, the forward trajectory is defined as

x_{t} = D_{t} (x_{0}),

(3)

where

D_{t} (\cdot)

denotes a deterministic transform that applies progressively stronger blur and contrast reduction to the clean image

x_{0}

as

t

increases. The reverse model

f_{θ} (x_{t}, t)

is then trained to restore the degraded image toward its clean counterpart step by step. In this formulation, the diffusion mechanism is preserved, but the forward degradation is now physically motivated. Rather than undoing Gaussian noise, the model learns to invert realistic underwater degradations, making the restored images more structurally faithful and visually consistent for downstream detection. The training objective for the Cold Diffusion module is defined as a reconstruction loss:

L_{C D} (θ) = E_{x_{0, t}} [∥ x_{0} - f_{θ} (x_{t}, t) ∥_{1}],

(4)

where

f_{θ} (x_{t}, t)

predicts a refined image at timestep

t

that should approximate the original clean image

x_{0}

.

Based on the above formulation, we implement the Cold Diffusion module as a front-end restoration network for UDD-YOLO. Figure 4 shows the overall design of the network. The network adopts an encoder–decoder architecture with skip connections, similar to U-Net structures commonly used in image restoration. The encoder gradually downsamples the input

x_{t}

to extract hierarchical features capturing both global context (e.g., large-scale haze and illumination gradients) and local structures (e.g., debris edges and textures). The decoder then upsamples and fuses these features, using skip connections to preserve fine details that are critical for small objects such as fishing nets or plastic bags. A lightweight embedding of the timestep

t

is injected into intermediate layers, enabling the same network to adapt its behavior along different points of the degradation-restoration trajectory.

Furthermore, by combining this pre-processing strategy with YOLOv12n, we effectively decouple the enhancement and detection stages, enabling the detection network to concentrate on identifying key patterns within clean inputs, while the DDPM module specializes in mitigating visual distortions introduced by underwater environments. Empirical evaluations in Section 3 demonstrate that this joint framework achieves superior detection accuracy compared to baseline methods without pre-processing.

During the training phase of the Cold Diffusion module (illustrated in Figure 5), a clean underwater image

x_{0}

is first sampled from the data distribution. A timestep

t \in {1, \dots, T}

is then randomly drawn, and the corresponding degraded image

x_{t}

is obtained using the timestep-dependent degradation operator

D_{t} (\cdot)

defined. In our implementation,

D_{t}

gradually increases the blur strength and decreases the image contrast as

t

grows, so that larger timesteps correspond to more severe underwater degradations, consistent with the progressive scattering effects in real scenes.

Instead of injecting Gaussian noise as in classical DDPMs, we construct a degraded image

x_{t}

through a timestep-dependent deterministic operator

D_{t} (\cdot)

:

x_{t} = D_{t} (x_{0}) = C_{t} (B_{t} (x_{0})),

(5)

where

B_{t} (x_{0}) = x_{0} * K_{σ_{t}}

applies a Gaussian blur with a standard deviation

σ_{t}

that increases with the timestep

t

, and

C_{t} (y) = α_{t} y + (1 - α_{t}) \overset{ˉ}{y}

denotes a contrast attenuation operator with a decaying contrast factor

α_{t}

and global mean intensity

\overset{ˉ}{y}

. Both

σ_{t}

and

α_{t}

are scheduled as simple monotonic functions of the timestep

t

, where

σ_{t}

increases and

α_{t}

decreases with

t

. Therefore,

D_{t}

gradually increases the blur strength and decreases the image contrast as

t

grows, so that larger timesteps correspond to more severe underwater degradations, consistent with the progressive scattering and attenuation effects in real underwater scenes. The pair

(x_{t}, t)

is then fed into the restoration network

f_{θ}

, which is optimized using the Cold Diffusion reconstruction loss

L_{C D} (θ)

introduced in this section.

This training strategy encourages the model to learn a stable reverse path that incrementally removes blur and restores local contrast, enabling it to recover fine debris details from strongly degraded underwater image inputs.

2.2.2. AMC2f Module

Although the YOLOv12n baseline is known for its lightweight and efficient design, its standard feature extraction modules still exhibit limitations in multi-scale object perception. This challenge is especially critical in underwater marine debris monitoring tasks, where objects such as plastic bottles, fishing nets, tires, and face masks differ drastically in size, shape, material, and packaging. Occlusion by algae or sediment and partial embedding in the seabed make feature extraction across scales more difficult. With fixed receptive fields in conventional convolution modules of the YOLOv12n architecture, small objects may lose fine local details, whereas large objects may lose their global structure.

To solve these problems, this study introduces a new module called AMC2f (Adapter-enhanced Multi-cognitive C2f), inspired by the Mona (Multi-cognitive Visual Adapter) mechanism [65]. The internal architecture of the Mona block is shown on the right side of Figure 6. Mona begins with Layer Normalization, followed by a Down Projection operation to reduce channel dimensionality. Its core multi-scale perception is enabled by three parallel depth-wise convolution branches with kernel sizes of 3 × 3, 5 × 5, and 7 × 7, which capture receptive fields at different scales. Their outputs are fused via averaging and further processed by a 1 × 1 convolution, a GeLU activation, and an Up Projection layer, with residual connections ensuring stable gradient propagation. Although Mona provides multi-scale modeling and helps capture semantic information across different receptive fields, its use within underwater scenes still faces challenges. Underwater debris often has blurred boundaries and irregular shapes, and smaller receptive fields may fail to detect vague or low-contrast targets. Similarly, while the original A2C2f module in YOLOv12n has some global modeling ability, it extracts features in a coarse way and cannot capture local details well in underwater scenes.

Because of these limitations, our proposed AMC2f module modifies the feature processing pipeline rather than simply inserting a Mona block into the C2f framework. As shown on the left side of Figure 6, input features first pass through a conventional convolutional block (CBS) before entering the core of the module, which consists of several sequential Mona blocks. The outputs from these blocks are concatenated with a skip connection from the original input and then processed by another CBS block. This result is scaled by a learnable parameter denoted as self.gamma, which adaptively scales the enhanced feature branch before it is merged with the identity branch. Self.gamma enables content-adaptive refinement according to the degradation level of underwater inputs without increasing inference complexity. Finally, the refined features are added back to the main feature stream through a residual connection. These task-specific modifications are not present in the original Mona design and were introduced to address the edge smoothing, scale variation, and high-frequency noise commonly observed in underwater debris images, making AMC2f a task-adapted extension rather than a simple combination of the Mona block and C2f.

By stacking multiple Mona blocks to benefit from their multi-scale processing capability and incorporating the adaptive self.gamma gating mechanism, AMC2f provides a stronger and more flexible feature modeling capacity than using a single Mona block alone. The residual design supports stable learning even with poor-quality underwater images, while the combination of shallow and deep features enhances robustness against occlusion, shape variations, and background clutter. These characteristics make AMC2f particularly effective for underwater debris detection, where objects exhibit diverse sizes, irregular textures, and can be partly hidden by marine snow, sediments, or algae.

2.2.3. Unified-IoU (UIoU) Loss Function

To solve the common problems in underwater object detection, such as large differences in object scale, frequent occlusion, low visibility, and background clutter, a new bounding box regression loss called Unified Intersection-over-Union (UIoU) is proposed. UIoU is based on normal IoU losses but adds adaptive weighting and geometric consistency constraints. This helps improve object localization and separation in noisy underwater scenes.

Unlike IoU, GIoU, or CIoU, which treat all localization errors the same, UIoU changes the regression gradient based on the predicted IoU score. For high-IoU predictions, the loss tightens the bounding box toward the ground truth. This makes small localization errors more visible and pushes the model to adjust more carefully. For low-quality predictions, the loss softens the penalization by expanding the predicted box and reducing its contribution to the objective function, which helps to stabilize training and prevent gradient explosion in early epochs.

To guide the training process, a cosine annealing schedule is employed to dynamically adjust the scaling ratio. In the early stages of training, the model emphasizes low-quality predictions to accelerate convergence and ensure difficult samples are not neglected. To handle occlusion and overlapping objects, UIoU uses a Focal-inv mechanism. It is similar to Focal Loss but works in the opposite way. It lowers the weight of low-confidence predictions and raises the weight of high-IoU predictions. This makes the model focus on reliable detections and reduce false positives.

UIoU also combines different IoU variants and adds more geometric information, such as IoU score, center distance, aspect ratio, box shape, and orientation. These constraints make bounding box regression stronger and more accurate when detecting debris like nets, tires, and plastic bags. UIoU improves bounding box regression with dynamic scaling. At the start of training, the model makes coarse box adjustments. Later, it makes fine adjustments. A scaling factor called ratio changes from 2.0 to 0.5 during training, which controls the resizing of predicted boxes. This is defined in the following equation:

w^{'} = w \times r a t i o

(6)

h^{'} = h \times r a t i o

(7)

Here, w and h denote the original width and height of the predicted bounding box, and

\tilde{B}

denotes the scaled box. The basic IoU regression term is

L_{b a s e} = 1 - I o U (\tilde{B}, B^{*}),

(8)

where

I o U (\tilde{B}, B^{*})

can follow advanced IoU variants so that center distance, aspect ratio, shape and orientation constraints are implicitly incorporated. To emphasize high-IoU and high-confidence predictions while reducing the effect of noisy predictions, the Focal-inv weight is defined by combining the IoU score and the classification confidence

p

:

s = α \cdot I o U + (1 - α) p,

(9)

where α balances the contribution of IoU and confidence. Finally, the Unified-IoU loss is written as follows:

L_{U I o U} = (α \cdot I o U + (1 - α) p)^{γ} [1 - I o U (\tilde{B}, B^{*})]

(10)

Here, γ is the focusing parameter of the Focal-inv term. This adjustment enhances the model’s focus on hard examples during training. Specifically, the coefficient

α

controls the contribution of the IoU score, while γ governs the influence of the confidence score. UIoU further adds constraints like center point alignment, width-height ratio consistency, and angular similarity. These help the model match object shapes, even for irregular debris under occlusion or complex positions.

The detection framework starts with pre-processing. A Cold Diffusion model restores underwater images and improves clarity and contrast. This reduces the effects of turbidity and poor lighting and gives a better base for feature extraction. In the backbone, the usual YOLOv12n modules are replaced with the AMC2f module. AMC2f improves multi-scale perception and extracts stronger features from both small and large debris. Finally, UIoU is used during training for bounding box regression. Its adaptive weighting and scaling make localization more precise while keeping good generalization.

These three strategies operate synergistically: the Cold Diffusion model improves image quality and supplies high-fidelity input for feature extraction; the AMC2f module extracts high-quality features from enhanced images; and the UIoU loss function optimizes localization based on these features. Together, they transform the lightweight YOLOv12n baseline into an efficient, accurate, and robust underwater object detection system, meticulously tailored for the complexities of underwater environments. The effectiveness of this integrated framework will be thoroughly validated in the subsequent experimental section.

2.3. Network Training and Optimization

Table 1 lists the experimental setup used in this study. The experiments ran on Ubuntu 22.04 with 120 GB memory and an NVIDIA GeForce RTX 4090 GPU (24 GB). The system had a 16-core Intel(R) Xeon(R) Gold 6430 processor. Training and testing were carried out with PyTorch 2.1.0, and GPU support was provided through CUDA 12.1. The Python version was 3.10.

The Cold Diffusion restoration network is first trained independently on the UDD training split using only the reconstruction losses. After convergence, its parameters are frozen, and the network is used as a fixed preprocessing module in front of the detector. During the subsequent training of UDD-YOLO, gradients from the detection loss are not back-propagated into the diffusion module; only the YOLOv12n backbone, neck, head, and the parameters related to UIoU are updated. Both stages use the same underlying training images from the UDD training set. The Cold Diffusion stage adopts simple geometric augmentations, such as random cropping and horizontal flipping, whereas the detector training follows the standard YOLO configuration with Mosaic-based augmentation, scale jittering, and label smoothing. This two-stage strategy allows the diffusion model to specialize in underwater restoration while keeping the detector optimization focused on robust feature learning and localization.

To ensure a fair comparison, all baseline detectors were trained under a unified training protocol, as shown in Table 2. For all YOLO-family models, the input resolution was fixed to 640

\times

640, and training was performed for 200 epochs with a batch size of 32 and 16 data-loading workers. Stochastic Gradient Descent (SGD) was adopted as the optimizer, with an initial learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005, following the official YOLO training configuration. Mosaic-based data augmentation, including random flipping and scaling, was applied consistently across all YOLO-based experiments, together with label smoothing to improve generalization. Other representative baselines (Faster R-CNN, MobileNetv2-SSD, RT-DETR-L, and Mamba-YOLO-T) were retrained on the dataset using the same input resolution, number of epochs, and data augmentation policy, while keeping their remaining hyperparameters aligned with their official implementations. No early stopping was used; all models were trained for the full 200 epochs, ensuring comparable training budgets across methods.

In all experiments, each raw underwater image is first passed through a lightweight Dark Channel Prior (DCP)-based dehazing step to suppress strong surface reflections and global water haze. The resulting DCP-corrected image is then fed into the subsequent enhancement module and finally into the YOLOv12n-based detector. In our main configuration, the enhancement module is the proposed Cold Diffusion restoration network, which refines the DCP output and provides structurally clearer inputs for UDD-YOLO.

2.4. Evaluation Metrics for Object Detection

The effectiveness of an object detection model is evaluated using metrics that assess classification accuracy and localization precision [47]. Precision and Recall are the main indicators of detection quality. Precision is defined as the ratio of true positives (TP) to all predicted positives. It shows how well the model reduces false positives (FPs). Recall is the proportion of true positives relative to the total actual positives. It shows how well the model finds relevant objects [48]. Localization is assessed using Intersection over Union (IoU), which is the overlap area divided by the union area of the predicted and ground-truth boxes. A higher IoU means better localization [66].

Precision, Recall, and IoU focus on single predictions or single classes. To check overall performance, Average Precision (AP) and mean Average Precision (mAP) are used. AP refers to the area under the Precision–Recall curve for a single class. mAP is obtained by averaging AP across all classes, providing an overall measure of detection performance.

The formulas for these metrics are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

R e c a l l = \frac{T P}{T P + F N}

(12)

I o U = \frac{A r e a o f O v e r l a p}{A r e a o f U n i o n}

(13)

A P (y, y^{*}) = \frac{1}{N} \sum_{C} a r e a (P r)

(14)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(15)

In this study, mAP@50 is adopted as the primary evaluation metric. This variant assesses mean Average Precision at a fixed IoU threshold of 0.5, striking a balance between strict spatial accuracy and the flexibility required for real-world object variability [67]. The mAP@50 metric is particularly suitable for underwater environments, where object shapes, scales, and occlusion levels vary substantially. By aggregating AP across categories at this threshold, mAP@50 captures both inter-class generalization and intra-class consistency, offering a robust indicator of model effectiveness in complex scenarios. We also report the mAP@[0.5:0.95], which averages AP over IoU thresholds from 0.5 to 0.95 with a step of 0.05. Compared with mAP@50, this metric provides a stricter and more fine-grained evaluation of localization quality.

In addition to these accuracy-oriented indicators, we also report the computational complexity and runtime efficiency of each detector. Specifically, we use FLOPs, expressed in units of GFLOPs (denoted as FLOPs(G)), to quantify the number of floating-point operations required for a single forward pass at an input resolution of 640 × 640. Furthermore, FPS (frames per second) is adopted to measure the actual inference with a batch size of 1. Note that FLOPs(G) and FPS are computed only for the detector backbone, neck, and head, excluding the DCP and Cold Diffusion enhancement modules.

Collectively, these metrics provide a rigorous framework for evaluating object detection models. While Precision, Recall, and IoU assess specific aspects of detection quality, AP and mAP offer broader insights across categories. GFLOPs and FPS characterize how suitable each model is for real-time deployment on resource-constrained underwater platforms. Together, they facilitate a nuanced and balanced evaluation of a model’s strengths, weaknesses, and real-world applicability in underwater debris detection.

3. Results

3.1. Ablation Experiments

To quantitatively assess both the separate and combined effects of the proposed components, ablation experiments were carried out on three main parts: (①) the UIoU loss function, (②) the AMC2f feature extraction module, and (③) the Cold Diffusion-based image enhancement mechanism. The experimental results and corresponding performance metrics are summarized in Table 3.

This baseline performance reflects the efficiency-oriented design of YOLOv12n, which prioritizes fast inference and compact size over complex contextual reasoning. Its high precision (80.8%) stems from its conservative bounding box regression, which favors confident, well-localized detections. However, the lower recall (68.0%) suggests a tendency to miss ambiguous or partially occluded debris, especially in underwater scenes with turbidity or overlapping objects. This trade-off is characteristic of lightweight detectors lacking deeper multi-scale or semantic modeling mechanisms, which are crucial for recovering difficult targets in visually degraded environments. The parameter count of 2.56 M remains relatively low, but this configuration does not yet benefit from targeted optimization modules (e.g., UIoU or AMC2f) that improve generalization under uncertainty or visual degradation. Despite its efficiency, this configuration exhibited limited robustness in handling visual noise, scale variation, and boundary ambiguity, which are challenges commonly encountered in underwater debris detection.

As shown in Table 3, each proposed module contributes incrementally to performance improvement, and the full model achieves the highest mAP@50. The three components are the UIoU loss function, the AMC2f feature extractor, and the Cold Diffusion pre-processing procedure. In the YOLOv12n-1 configuration, the standard IoU-based loss was replaced with the proposed UIoU loss function. This modification resulted in a slight reduction in precision to 78.8%, while recall improved to 71.4% and mAP@50 increased to 77.9%. The number of parameters and the model size of YOLOv12n-1 remain unchanged compared with the YOLOv12 baseline.

Building upon this, the YOLOv12n-2 variant incorporated UIoU together with the AMC2f module. This configuration achieved notable gains across all key metrics, with precision reaching 82.4%, recall increasing to 77.3%, and mAP@50 improving to 81.0%. Compared with YOLOv12n-1, YOLOv12n-2 added only 3616 parameters, and the overall model size remains 5.5 MB, preserving its lightweight design.

The complete integration of all proposed components further improved overall performance. With the inclusion of Cold Diffusion pre-processing, the model achieved the best results, reaching 87.3% precision, 75.1% recall, and 81.8% mAP@50. This configuration retains the same number of parameters and the same model size as YOLOv12n-2.

Figure 7 illustrates the qualitative impact of the proposed components using representative examples of underwater debris. By comparing the confidence scores and bounding boxes across each row, it can be observed that the baseline YOLOv12n model shows limited confidence (average scores of 0.33, 0.64, and 0.72) and incomplete object localization (bounding boxes not fully aligned with the object). After introducing the UIoU loss in the YOLOv12n-1 configuration, the bounding boxes become more stable, and the confidence scores increase moderately, indicating improved localization precision. With the addition of the AMC2f module in the YOLOv12n-2 configuration, feature representation is further strengthened, yielding higher confidence and more complete object boundaries even under low contrast or partial occlusion. The full model with Cold Diffusion pre-processing achieves the highest confidence scores and the most accurate detections across all examples.

In summary, the ablation study validates the individual and combined effectiveness of the proposed components. The UIoU loss enhances localization precision through quality-aware regression; the AMC2f module improves multi-scale feature representation; and the Cold Diffusion mechanism mitigates visual degradation in underwater imagery. Together, these components form an effective framework designed to address the specific challenges of underwater debris detection, achieving high accuracy without sacrificing computational efficiency.

3.2. Comparison with Different Detectors

To assess the performance of UDD-YOLO, it was compared against several widely used detectors. These involved two-stage models like Faster R-CNN, transformer-based detectors like RT-DETR, SSD networks, and lightweight YOLO variants. All models used the same dataset, image resolution, and training setup to ensure fairness in comparison. Table 4 reports the results, covering precision, recall, mAP@50, parameter count, model size, FPS and FLOPs. In addition, all metrics are measured for the detector (backbone + neck + head) only, with an input size of 640 × 640 and batch size = 1 on a single NVIDIA GeForce RTX 4090 GPU; the DCP and Cold Diffusion preprocessing modules are not included in these complexity metrics.

Among the two-stage models, Faster R-CNN achieved a respectable precision of 75.6%, recall of 73.0%, and mAP@50 of 75.5%, but suffered from high computational complexity, with over 28 million parameters and a model size of 108.7 MB. Despite its accurate region proposal mechanism, the large model size and slower inference make it less suitable for real-time deployment in resource-constrained underwater environments.

Lightweight one-stage detectors like MobileNetv2-SSD and YOLOv3-tiny offer fast inference and compact size, but they deliver relatively lower detection performance, with mAP@50 of 54.3% and 65.5%. These shortcomings arise from limited representational ability and weaker robustness when detecting small or partly occluded underwater debris.

Recent YOLO family models, including YOLOv5n, YOLOv8n, YOLOv9t, YOLOv10n, and YOLOv11n, achieved balanced results in accuracy and efficiency. Among them, YOLOv8n reached a peak mAP@50 of 78.2% and precision of 86.2%, highlighting the advantage of its updated head and neck design. YOLOv12n, the direct baseline for our study, reported mAP@50 of 76.8%, but had limited recall (68.0%) due to weaker adaptability to image degradation and inconsistency in underwater scenes.

As presented in Table 4, the proposed UDD-YOLO surpasses multiple advanced baseline detectors with higher mAP@50, demonstrating strong detection ability under real underwater conditions. The proposed model demonstrated the best overall performance, achieving a precision of 87.3%, recall of 75.1%, and mAP@50 of 81.8%, while maintaining the same parameter size (2.56 M) and model footprint (5.5 MB) as YOLOv12n. This improvement validates the effectiveness of the three integrated components: the Cold Diffusion pre-processing helps correct brightness variations and recover structural details lost due to blur or scattering, enhancing the overall image quality prior to detection. Meanwhile, the adaptive self.gamma mechanism in AMC2f improves the model’s responsiveness to different lighting conditions. Additionally, AMC2f’s multi-branch structure enables robust feature extraction across varying levels of blur. When local details are obscured, the larger 5 × 5 and 7 × 7 kernels can still capture global patterns and object contours, compensating for the limited effectiveness of smaller kernels like 3 × 3.

As representative recent architectures, RT-DETR-L and Mamba-YOLO-T provide strong baselines in this comparison. RT-DETR-L leverages a transformer encoder–decoder with global self-attention and query-based decoding, which is well-suited for complex object interactions. However, on our dataset, it achieves only 75.1% mAP@50 with 25.7 M parameters, whereas UDD-YOLO attains 81.8% mAP@50 with only 2.56 M parameters. This indicates that a carefully tailored lightweight architecture with diffusion-based pre-processing can surpass a heavy transformer detector in both accuracy and real-time efficiency for underwater debris detection. Similarly, Mamba-YOLO-T exploits Mamba-based state space modeling and obtains 76.5% mAP@50 with 5.99 M parameters but still lags behind UDD-YOLO in precision, recall, and overall mAP@50. These comparisons highlight that, despite the advanced global modeling capabilities of transformer- and Mamba-based detectors, the proposed UDD-YOLO achieves a more favorable accuracy-efficiency trade-off by explicitly addressing underwater degradation.

In addition, our model achieves a favorable trade-off between detection accuracy and computational cost. Although integrating the enhanced modules slightly increases the FLOPs(G) (0.5 G) compared with the original YOLOv12n baseline, the overall complexity still remains in the lightweight regime and is clearly lower than that of the larger detectors considered in the comparison. At the same time, UDD-YOLO maintains real-time inference with competitive FPS, indicating that the performance gain mainly comes from more effective feature representation.

Figure 8 compares the detection outputs of various detectors, showing that several baselines either miss the target or produce less accurate bounding boxes under degraded underwater conditions. In contrast, the proposed UDD-YOLO generates clearer and more precisely localized detections with a higher confidence score, demonstrating its advantage in handling blur, low contrast, and complex backgrounds.

In conclusion, the proposed method outperforms all comparison models across key detection metrics while maintaining a lightweight and deployable architecture. These results affirm the practicality and scalability of our approach for real-time underwater garbage detection tasks, particularly in environments with complex visual noise, occlusions, and diverse object geometries.

3.3. Comparison with Different Image Enhancement Methods

To further clarify the effect of the proposed Cold Diffusion enhancement on detection performance, we conducted a comparison experiment by pairing a fixed YOLOv12n-2 in Section 3.1 with different underwater enhancement strategies. In this experiment, the detector weights are kept unchanged, and only the input pre-processing is modified at inference time. Specifically, we evaluate six configurations on the test set: (1) YOLOv12n-2 with raw underwater images, (2) YOLOv12n-2 with global histogram equalization (HE), (3) YOLOv12n-2 with Contrast Limited Adaptive Histogram Equalization (CLAHE), (4) YOLOv12n-2 with Dark Channel Prior (DCP)-based dehazing, (5) YOLOv12n-2 with a representative GAN-based enhancement method [64], and (6) YOLOv12n-2 with the proposed Cold Diffusion enhancement (UDD-YOLO).

For all configurations in this subsection, the input to each enhancement module (HE, CLAHE, DCP-only, StyleGAN3, and Cold Diffusion) is the same DCP-corrected image, and the enhanced result is then fed to the same YOLOv12n-2 detector. The detection metrics, including precision, recall, mAP@50, and mAP@[0.5:0.95], are reported in Table 5.

YOLOv12n-2 uses the raw underwater images as input and therefore reflects the intrinsic robustness of the baseline detector. As shown in Table 5, this configuration achieves 80.9% precision, 75.3% recall, 80.1% mAP@50, and 60.1% mAP@[0.5:0.95]. Histogram Equalization (HE) is a global contrast enhancement technique that redistributes the intensity of the histogram to approximate a uniform distribution, thereby stretching frequently occurring gray levels over a wider range. When applied before detection, YOLOv12n-2 + HE yields 81.1% precision, 74.1% recall, 80.3% mAP@50, and 58.3% mAP@[0.5:0.95]. Compared with the raw baseline, mAP@50 remains similar, but recall and mAP@[0.5:0.95] decrease, indicating that HE tends to amplify background noise and haze. As a result, the detector misses more objects, especially small or distant debris.

Contrast Limited Adaptive Histogram Equalization (CLAHE) performs histogram equalization in small tiles and clips the histogram to avoid over-amplifying noise, aiming to enhance local contrast under non-uniform illumination. With CLAHE, YOLOv12n-2 attains 83.6% precision, 74.6% recall, 80.8% mAP@50, and 61.2% mAP@[0.5:0.95]. The higher precision and slightly improved mAP@50 and mAP@[0.5:0.95] compared with the raw baseline suggest that some low-visibility debris becomes more salient and better localized. However, there is a reduction in recall compared to the YOLOv12n-2 (RAW).

Dark Channel Prior (DCP)-based enhancement estimates a transmission map and ambient light under the assumption that at least one color channel is very dark in haze-free patches, thereby removing veiling light and sharpening edges. When combined with YOLOv12n-2, DCP improves precision to 82.4% and slightly increases mAP@50 to 81.0%, while recall increases to 76.3%. This modest mAP gain indicates that suppressing underwater haze and enhancing boundaries can help the detector localize clearer debris.

StyleGAN3-based underwater enhancement uses a generator-discriminator framework to learn a mapping from raw to visually improved images, focusing on correcting color casts and improving perceptual quality. In Table 5, YOLOv12n-2 + StyleGAN3 reaches 86.5% precision, 75.6% recall, 81.5% mAP@50, and 62.1% mAP@[0.5:0.95]. The notably higher precision suggests that GAN-based enhancement makes debris visually more salient and increases the detector’s confidence. However, the recall remains lower than the baseline. Consequently, the overall gain over raw images is modest, and the enhancement is not uniformly beneficial across all debris types.

UDD-YOLO integrates a Cold Diffusion-based restoration module that performs deterministic blur-and-contrast modeling tailored to underwater degradation. With this complete design, UDD-YOLO achieves 87.3% precision, 75.1% recall, 81.8% mAP@50, and 62.3% mAP@[0.5:0.95], which is the best overall precision–recall trade-off. Compared with the YOLOv12n-2 (Raw) baseline, UDD-YOLO improves precision by 6.4 points, mAP@50 by 1.7 points, and mAP@[0.5:0.95] by 2.2 points, while maintaining a comparable recall level.

3.4. Effect of Different IoU-Based Regression Losses

To evaluate the effect of different IoU-based regression losses under a fair setting, we conduct a controlled experiment where only the bounding-box regression loss is changed while keeping the other architectures and training protocol identical. Specifically, we adopt the UDD-YOLO as the base model and compare five loss variants: the default CIoU loss [68], GIoU [69], SIoU [70], and the adopted UIoU (UDD-YOLO). All models are trained on the dataset with the same input resolution (640

\times

640), number of epochs, optimizer, and data augmentation strategy as described in Section 2.3. The comparison results, including precision, recall, mAP@50, and mAP@[0.5:0.95], are summarized in Table 6.

CIoU explicitly incorporates the overlap area, center distance, and aspect ratio into a unified loss and is widely used in recent YOLO variants. When the original loss is replaced by CIoU, UDD-YOLO attains 84.9% precision, 75.0% recall, 81.1% mAP@50, and 60.9% mAP@[0.5:0.95]. Compared with the original loss, this corresponds to gains of +1.1 points in precision, +1.0 points in recall, +0.3 points in mAP@50, and +0.5 points in mAP@[0.5:0.95].

GIoU extends the standard IoU by penalizing the area of the smallest enclosing box, aiming to accelerate convergence when the predicted and ground-truth boxes do not overlap. With GIoU, UDD-YOLO achieves 84.1% precision, 74.7% recall, 81.3% mAP@50, and 60.3% mAP@[0.5:0.95]. Although mAP@50 is slightly higher than that of CIoU, the recall and mAP@[0.5:0.95] are lower than those of both CIoU and the original loss.

Under SIoU, UDD-YOLO obtains 85.6% precision, 74.4% recall, 80.7% mAP@50, and 61.8% mAP@[0.5:0.95]. Compared with the original loss, SIoU markedly improves precision and increases mAP@[0.5:0.95] by 1.4 points, but slightly reduces recall and mAP@50. This pattern suggests that SIoU favors high-quality, well-aligned boxes and improves performance at stricter IoU thresholds, yet may sacrifice some detections of ambiguous or low-contrast debris, leading to a modest drop in overall detection rate.

The proposed UIoU integrates cosine-annealed scaling and Focal-inv weighting into an IoU-style loss, dynamically emphasizing high-IoU, high-confidence predictions while suppressing noisy low-IoU samples. With UIoU, UDD-YOLO achieves 87.3% precision, 75.1% recall, 81.8% mAP@50, and 62.3% mAP@[0.5:0.95], which are the best overall results among all tested losses. Relative to the original loss, UIoU improves precision by 3.5 points, recall by 1.1 points, mAP@50 by 1.0 points, and mAP@[0.5:0.95] by 1.9 points. Compared with CIoU, UIoU still gains 2.4 points in precision and 1.4 points in mAP@[0.5:0.95]. These improvements confirm that quality-aware reweighting and dynamic scaling are particularly effective for underwater debris, where many objects are small or partially occluded and accurate boundary regression is crucial.

4. Discussion

4.1. Key Modifications Contributing to Performance Improvement

The ablation study, detectors comparison, image enhancement methods comparison, and loss function comparison presented in Section 3.1, Section 3.2, Section 3.3 and Section 3.4 provide insight into why the proposed components contribute to improved detection performance under complex underwater conditions.

For the loss function, UIoU enhances localization robustness by dynamically reweighting predictions according to their estimated quality. Compared with CIoU, GIoU, and SIoU, UIoU provides a more stable optimization signal under irregular object shapes and partial occlusion. CIoU benefits from explicit geometric constraints, but its improvements under strict IoU thresholds remain limited in cluttered underwater scenes. GIoU adjusts boxes more aggressively when predictions and targets do not overlap, yet its advantage does not fully translate to high-IoU localization in underwater debris detection, where targets are small and frequently embedded in complex backgrounds. SIoU improves orientation and aspect-ratio consistency, but its gains remain moderate compared with the quality-aware formulation of UIoU. These comparisons collectively highlight why UIoU is better aligned with underwater debris detection, where bounding boxes often feature irregular shapes and uncertain boundaries.

The AMC2f module strengthens multi-scale feature representation by capturing both fine textures and larger structural cues. This enables the detector to maintain confident predictions even in low-contrast scenes or under partial occlusion, explaining the consistent improvements observed across the ablation results. Its adaptive multi-branch structure ensures that both coarse and fine features are preserved, providing stable representations for downstream localization and classification.

The Cold Diffusion enhancement mechanism contributes substantially to the final performance. By restoring edge sharpness and local contrast, it provides clearer visual cues that improve feature extraction and facilitate more complete object delineation. Unlike general enhancement tools such as HE, CLAHE, DCP, or GAN-based enhancement, Cold Diffusion avoids brightness inconsistency, halo artifacts, and generated textures. Instead, it performs task-oriented structural restoration, resulting in more consistent recall and higher precision across diverse underwater conditions.

Together, these three components form a complementary framework: Cold Diffusion improves the input domain, AMC2f extracts robust multi-scale features, and UIoU refines localization with high stability. Their combined effects explain the substantial improvements reported in Section 3 and demonstrate the suitability of the proposed design for underwater debris detection.

4.2. Robustness Against Degraded Input Data

In real-world underwater environments, image quality is frequently degraded due to various factors such as illumination variation, suspended particles, optical distortion, and lens fogging. To evaluate the robustness of the proposed detection model under such adverse conditions, a series of controlled degradation tests was performed on the marine debris dataset. The performance under these conditions is visualized in Figure 9, with subplots (a–d) corresponding to brightness, blur, scattering, and distortion perturbations, respectively.

4.2.1. Brightness Perturbation

To assess the model’s robustness under varying illumination conditions, brightness perturbations were applied by converting RGB images into the LAB color space and systematically modifying the L-channel, which encodes luminance information. By applying offset values of ΔL = {−30, −20, −10, 0, +10, +20, +30}, a wide spectrum of underwater lighting scenarios was simulated, ranging from extremely dark to overly bright environments.

As depicted in Figure 9a, all models exhibited performance degradation at extreme luminance levels, primarily due to reduced contrast and loss of edge definition. However, the proposed UDD-YOLO model consistently achieved the highest mAP@50 across all brightness conditions, indicating superior adaptability to illumination inconsistencies. This resilience may be attributed to two key design elements: (1) the Cold Diffusion pre-processing module, which enhances local contrast and restores details lost due to underexposure or overexposure, and (2) the adaptive γ-gating mechanism within the AMC2f module, which dynamically adjusts the model’s feature response based on varying input intensities. Such adaptability is essential for real-world deployment, especially in underwater missions where natural light fluctuates with depth, turbidity, and artificial light interference [71].

4.2.2. Gaussian Blur Perturbation

Gaussian blur with standard deviations σ = {0.5, 1.0, 1.5, 2.0, 2.5} was applied to emulate vision degradation resulting from underwater housing fogging or particulate matter in turbid water columns. This operation acts as a low-pass filter, diminishing sharp transitions and thereby degrading edge definition, as described mathematically below:

G (x, y) = \frac{1}{2 π σ^{2}} e^{- \frac{x^{2} + y^{2}}{2 σ^{2}}}

(16)

As illustrated in Figure 9b, while detection accuracy of all models declined with increased blur intensity, the proposed model retained superior performance. This suggests that the model’s architecture effectively captures global semantic information, which is less sensitive to local texture degradation [72].

4.2.3. Scattering and Haze Simulation

Underwater visibility is often compromised by the presence of suspended particles and dissolved organic matter, which cause forward scattering and haze effects. To simulate this phenomenon, scattering noise was applied using a synthetic scattering function, with coefficients β = {0.5, 1.0, 1.5, 2.0, 2.5}, thereby degrading image clarity in a controlled manner.

As shown in Figure 9c, detection accuracy declined across all models with increasing haze levels due to contrast attenuation and edge softening. The proposed model maintained a notably higher detection performance, outperforming other baselines under all tested conditions. This can be largely credited to the Cold Diffusion module, which acts as an effective pre-enhancement filter by recovering degraded structural and textural information from the noisy inputs. Additionally, the multi-scale convolution branches within AMC2f help preserve coarse-to-fine object representations even in low-visibility situations, reinforcing detection accuracy in hazy scenes such as estuarine zones, harbor waters, or offshore environments impacted by sediment plumes [68].

4.2.4. Lens Distortion

Wide-angle and fisheye lenses commonly used in underwater photography introduce nonlinear spatial distortion, leading to geometric warping effects that challenge conventional object detection networks. To replicate such conditions, increasing levels of radial distortion were simulated using coefficients k = {0.0001, 0.0002, 0.0005, 0.0010, 0.0020}, mimicking real-world optical aberrations caused by underwater dome ports or wide-angle housings.

In Figure 9d, it is evident that most baseline detectors suffered from substantial drops in accuracy as distortion increased, with object boundaries being stretched or compressed. UDD-YOLO, however, exhibited consistent performance, showcasing its ability to tolerate moderate-to-severe geometric distortion. This robustness stems from the spatial feature alignment mechanism implicitly introduced through the AMC2f’s multi-scale structure and the residual feature aggregation in Mona blocks, which together allow the network to recalibrate feature maps and mitigate spatial inconsistencies. This property enhances its applicability in practical marine surveys where diverse optics and camera configurations are unavoidable.

4.3. Generalizability and Transferability

The proposed model demonstrates superior robustness under multiple types of image degradation. This capability enhances its applicability to real-world underwater garbage detection tasks, where environmental variables and imaging imperfections are unavoidable. The integration of Cold Diffusion pre-processing, the AMC2f module, and the Unified-IoU loss is likely responsible for this enhanced robustness across diverse visual challenges.

To further evaluate the generalizability and transferability of the proposed UDD-YOLO model in diverse underwater environments, we conducted additional experiments on two independent underwater litter datasets collected from different geographic and hydrological conditions: the Riverbed Litter Dataset from the Hongqi River in Yunnan, China [26], and the Seafloor Debris Dataset from Koh Tao, Thailand [5]. Remarkably, UDD-YOLO achieved a mean Average Precision (mAP) of 0.83 on the riverbed dataset, surpassing the best-performing model, RBL-YOLO (mAP = 0.80), proposed in the original study. Similarly, on the Koh Tao seafloor dataset, UDD-YOLO reached an mAP of 0.94, outperforming the previously reported optimal model SFD-YOLO (mAP = 0.91). These results demonstrate the strong transferability and robustness of UDD-YOLO across datasets with distinct underwater conditions, including variations in water turbidity, object morphology, and background complexity.

Notably, both RBL-YOLO and SFD-YOLO were developed by modifying the YOLOv8s architecture, which is significantly larger in model size compared to UDD-YOLO. In contrast, our proposed model maintains comparable or superior detection performance with only approximately one-fourth the model size, highlighting its potential for deployment on resource-constrained platforms such as Autonomous Underwater Vehicles (AUVs), remotely operated systems, or handheld detection devices. This compact yet powerful architecture underscores the practical advantages of UDD-YOLO in real-world underwater ecological monitoring scenarios, particularly where computational and storage resources are limited.

5. Conclusions

This study presents UDD-YOLO, a novel underwater object detection framework that leverages diffusion-driven image enhancement to address the persistent challenges of poor visibility, occlusion, and morphological diversity in underwater environments. By integrating a Cold Diffusion pre-processing module, an AMC2f feature extraction backbone, and a UIoU loss function, the proposed method significantly enhances both visual quality and detection robustness in complex marine scenes.

Comprehensive tests on a public underwater plastic pollution dataset show that the proposed model outperforms other methods over eleven advanced detection frameworks, such as Faster R-CNN, RT-DETR-L, YOLOv8n, and YOLOv12n. The model reached mAP@50 of 81.8%, with 87.3% precision and 75.1% recall, while maintaining a compact design of only 2.56 M parameters and 5.5 MB in size. These results validate the effectiveness of each integrated module: the Cold Diffusion module restores structural and textural fidelity without introducing stochastic noise, the AMC2f module enables multi-scale and morphology-aware representation, and the UIoU loss strengthens localization performance under occlusion and boundary ambiguity.

The proposed architecture not only achieves high detection accuracy while keeping computational efficiency, which makes it suitable for real-time use on edge devices like underwater drones or autonomous robots. Furthermore, the diffusion-based enhancement strategy marks a paradigm shift in underwater computer vision by introducing generative restoration as a pre-detection process.

Future research will focus on expanding this framework to multi-object tracking and instance segmentation tasks in real underwater environments, as well as integrating adaptive diffusion models that can dynamically respond to scene-specific degradation levels. The promising results of this study lay the foundation for intelligent and scalable systems for marine monitoring, contributing to the long-term goal of environmental preservation and sustainable ocean resource management.

Author Contributions

Conceptualization, F.Z. and J.T.; methodology, J.T. and F.Z.; software, J.T., F.Z., H.W. and J.C.; validation, J.T., Y.L., H.W. and N.X.; formal analysis, J.T.; investigation, J.T., F.Z., Y.C., H.W. and J.C.; resources, F.Z., J.C., P.L. and N.X.; data curation, J.T., F.Z. and Y.C.; writing—original draft preparation, J.T.; writing—review and editing, F.Z., Y.C., Y.L., J.S., F.X., J.C., P.L. and N.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the Japan Science and Technology Agency’s SPRING Program (JST SPRING) (JPMJSP2108), the National Natural Science Foundation of China (42571378), and the Natural Science Foundation of Jiangsu Province (BK20240258).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ROVs	Remotely operated vehicles
YOLO	You Only Look Once
UAVs	Unmanned aerial vehicles
SSS	Speedy Sea Scanner
P-SSS	Portable Speedy Sea Scanner
DDPMs	Denoising Diffusion Probabilistic Models
DCP	Dark Channel Prior
SGD	Stochastic Gradient Descent
UIoU	Unified Intersection-over-Union
DDPM	Denoising Diffusion Probabilistic Models
ELBO	Evidence Lower Bound
MSE	Mean squared error
AMC2f	Adapter-enhanced Multi-cognitive C2f
Mona	Multi-cognitive Visual Adapter
CBS	Conventional convolutional block
IoU	Intersection over Union
AP	Average Precision
mAP	Mean Average Precision
AUVs	Autonomous Underwater Vehicles

References

Canals, M.; Pham, C.K.; Bergmann, M.; Gutow, L.; Hanke, G.; van Sebille, E.; Angiolillo, M.; Buhl-Mortensen, L.; Cau, A.; Ioakeimidis, C.; et al. The quest for seafloor macrolitter: A critical review of background knowledge, current methods and future prospects. Environ. Res. Lett. 2021, 16, 023001. [Google Scholar] [CrossRef]
Zhao, F.; Huang, B.; Wang, J.; Shao, X.; Wu, Q.; Xi, D.; Liu, Y.; Chen, Y.; Zhang, G.; Ren, Z.; et al. Seafloor debris detection using underwater images and deep learning-driven image restoration: A case study from Koh Tao, Thailand. Mar. Pollut. Bull. 2025, 214, 117710. [Google Scholar] [CrossRef]
Jiang, S.; Xu, N.; Li, Z.; Huang, C. Satellite derived coastal reclamation expansion in China since the 21st century. Glob. Ecol. Conserv. 2021, 30, e01797. [Google Scholar] [CrossRef]
Yu, P.; Ou, Y.; Chen, Y.; Zhang, H.; Deng, X.; Xu, N. Nature-based solutions in coastal urbanization: Addressing environmental and socio-economic challenges. Earth Crit. Zone 2025, 100032. [Google Scholar] [CrossRef]
Madricardo, F.; Ghezzo, M.; Nesto, N.; Mc Kiver, W.J.; Faussone, G.C.; Fiorin, R.; Riccato, F.; Mackelworth, P.C.; Basta, J.; De Pascalis, F.; et al. How to deal with seafloor marine litter: An overview of the state-of-the-art and future perspectives. Front. Mar. Sci. 2020, 7, 505134. [Google Scholar] [CrossRef]
Shao, X.; Chen, H.; Zhao, F.; Magson, K.; Chen, J.; Li, P.; Wang, J.; Sasaki, J. Multi-label classification for multi-temporal, multi-spatial coral reef condition monitoring using vision foundation model with adapter learning. Mar. Pollut. Bull. 2025, 223, 119054. [Google Scholar] [CrossRef]
Pierdomenico, M.; Casalbore, D.; Chiocci, F.L. Massive benthic litter funnelled to deep sea by flash-flood generated hyperpycnal flows. Sci. Rep. 2019, 9, 5330. [Google Scholar] [CrossRef]
Xu, N.; Ma, Y.; Yang, J.; Wang, X.H.; Wang, Y.; Xu, R. Deriving tidal flat topography using ICESat-2 laser altimetry and sentinel-2 imagery. Geophys. Res. Lett. 2022, 49, e2021GL096813. [Google Scholar] [CrossRef]
Xu, N.; Xu, H.; Li, W.; Lu, H.; Song, Y.; Yao, J.; Ma, Y.; Ren, H.; He, T.; Mo, F.; et al. A large-scale estimation method for beach slopes using ICESat-2 altimeter: A case study of New Zealand. Int. J. Appl. Earth Obs. Geoinf. 2025, 142, 104768. [Google Scholar] [CrossRef]
Sharma, H.; Kumar, H.; Mangla, S.K. Enablers to computer vision technology for sustainable E-waste management. J. Clean. Prod. 2023, 412, 137396. [Google Scholar] [CrossRef]
Zhao, F.; Xu, D.; Ren, Z.; Shao, X.; Wu, Q.; Liu, Y.; Wang, J.; Song, J.; Chen, Y.; Zhang, G.; et al. Mamba-based super-resolution and semi-supervised YOLOv10 for freshwater mussel detection using acoustic video camera: A case study at Lake Izunuma, Japan. Ecol. Inform. 2025, 90, 103324. [Google Scholar] [CrossRef]
Hu, L.; Xu, N.; Liang, J.; Li, Z.; Chen, L.; Zhao, F. Advancing the Mapping of Mangrove Forests at National-Scale Using Sentinel-1 and Sentinel-2 Time-Series Data with Google Earth Engine: A Case Study in China. Remote Sens. 2020, 12, 3120. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, Z.; Xu, N.; Li, Y. Fully automatic training sample collection for detecting multi-decadal inland/seaward urban sprawl. Remote Sens. Environ. 2023, 298, 113801. [Google Scholar] [CrossRef]
Chen, J.; Li, P.; Lee, J.; Shao, X.; Zhao, F.; Sasaki, J. Optimization of oblique drone photogrammetry for avoiding sun glint in submerged seagrass mapping. Estuar. Coast. Shelf Sci. 2025, 322, 109356. [Google Scholar] [CrossRef]
UNEP. From Pollution to Solution: A Global Assessment of Marine Debris and Plastic Pollution—Synthesis. 2021. Available online: https://wedocs.unep.org/20.500.11822/36965 (accessed on 19 October 2021).
Xu, N.; Wang, Y.; Huang, C.; Jiang, S.; Jia, M.; Ma, Y. Monitoring coastal reclamation changes across Jiangsu Province during 1984–2019 using landsat data. Mar. Policy 2022, 136, 104887. [Google Scholar] [CrossRef]
Xuan, W.; Wang, J.; Qi, H.; Chen, Z.; Zheng, Z.; Zhong, Y.; Xia, J.; Yokoya, N. DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding. arXiv 2025, arXiv:2505.21076. [Google Scholar] [CrossRef]
Wang, J.; Ma, A.; Chen, Z.; Zheng, Z.; Wan, Y.; Zhang, L.; Zhong, Y. EarthVQANet: Multi-task visual question answering for remote sensing image understanding. ISPRS J. Photogramm. Remote Sens. 2024, 212, 422–439. [Google Scholar] [CrossRef]
Wang, J.; Zheng, Z.; Chen, Z.; Ma, A.; Zhong, Y. Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5481–5489. [Google Scholar] [CrossRef]
Wang, J.; Zhong, Y.; Ma, A.; Zheng, Z.; Wan, Y.; Zhang, L. LoveNAS: Towards multi-scene land-cover mapping via hierarchical searching adaptive network. ISPRS J. Photogramm. Remote Sens. 2024, 209, 265–278. [Google Scholar] [CrossRef]
Li, Y.; Zhou, P.; Zhou, G.; Wang, H.; Lu, Y.; Peng, Y. A comprehensive survey of visible and infrared imaging in complex environments: Principle, degradation and enhancement. Inf. Fusion 2025, 119, 103036. [Google Scholar] [CrossRef]
Zhao, F.; Chen, Y.; Xi, D.; Liu, Y.; Wang, J.; Tabeta, S.; Mizuno, K. Enhanced hermit crabs detection using super-resolution reconstruction and improved YOLOv8 on UAV-captured imagery. Mar. Environ. Res. 2025, 210, 107313. [Google Scholar] [CrossRef]
You, S.; Li, B.; Chen, Y.; Ren, Z.; Liu, Y.; Wu, Q.; Tao, J.; Zhang, Z.; Zhang, C.; Xue, F.; et al. Rose-Mamba-YOLO: An enhanced framework for efficient and accurate greenhouse rose monitoring. Front. Plant Sci. 2025, 16, 1607582. [Google Scholar] [CrossRef]
Wang, J.; Xuan, W.; Qi, H.; Liu, Z.; Liu, K.; Wu, Y.; Chen, H.; Song, J.; Xia, J.; Zheng, Z.; et al. DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response. arXiv 2025, arXiv:2505.21089. [Google Scholar]
Chen, J.; Sasaki, J. Mapping of Subtidal and Intertidal Seagrass Meadows via Application of the Feature Pyramid Network to Unmanned Aerial Vehicle Orthophotos. Remote Sens. 2021, 13, 4880. [Google Scholar] [CrossRef]
Zhao, F.; Liu, Y.; Wang, J.; Chen, Y.; Xi, D.; Shao, X.; Tabeta, S.; Mizuno, K. Riverbed litter monitoring using consumer-grade aerial-aquatic speedy scanner (AASS) and deep learning based super-resolution reconstruction and detection network. Mar. Pollut. Bull. 2024, 209, 117030. [Google Scholar] [CrossRef] [PubMed]
Fan, Z.; Wang, J.; Liu, Y.; Chen, Y.; Shao, X.; Xi, D.; Ma, B.; Chen, Y.; Mizuno, K. Consumer-grade aerial-aquatic speedy scanner (AASS) for efficient underwater monitoring with deep learning. In Proceedings of the OCEANS 2024-Halifax, Halifax, NS, Canada, 23–26 September 2024; IEEE: Piscataway, NJ, USA; pp. 1–4. [Google Scholar]
Zhao, F.; Shao, X.; Wang, J.; Chen, Y.; Xi, D.; Liu, Y.; Chen, J.; Sasaki, J.; Mizuno, K. A novel underwater Holothurians monitoring system using consumer-grade amphibious UAV with Mamba-based Super-Resolution Reconstruction and enhanced YOLOv10. Mar. Environ. Res. 2025, 212, 107510. [Google Scholar] [CrossRef]
Majchrowska, S.; Mikołajczyk, A.; Ferlin, M.; Klawikowska, Z.; Plantykow, M.A.; Kwasigroch, A.; Majek, K. Deep learning-based waste detection in natural and urban environments. Waste Manag. 2022, 138, 274–284. [Google Scholar] [CrossRef]
Wu, T.-W.; Zhang, H.; Peng, W.; Lü, F.; He, P.-J. Applications of convolutional neural networks for intelligent waste identification and recycling: A review. Resour. Conserv. Recycl. 2023, 190, 106813. [Google Scholar] [CrossRef]
Chai, Y.; Yu, H.; Xu, L.; Li, D.; Chen, Y. Deep learning algorithms for sonar imagery analysis and its application in aq-uaculture: A review. IEEE Sens. J. 2023, 23, 28549–28563. [Google Scholar] [CrossRef]
Jakovljevic, G.; Govedarica, M.; Alvarez-Taboada, F. A Deep Learning Model for Automatic Plastic Mapping Using Unmanned Aerial Vehicle (UAV) Data. Remote Sens. 2020, 12, 1515. [Google Scholar] [CrossRef]
Lei, J.; Wang, H.; Lei, Z.; Li, J.; Rong, S. CNN–Transformer Hybrid Architecture for Underwater Sonar Image Segmentation. Remote Sens. 2025, 17, 707. [Google Scholar] [CrossRef]
Nga, Y.Z.; Rymansaib, Z.; Treloar, A.A.; Hunter, A. Automated Recognition of Submerged Body-like Objects in Sonar Images Using Convolutional Neural Networks. Remote Sens. 2024, 16, 4036. [Google Scholar] [CrossRef]
Cheng, L.; Liu, Y.; Zhang, B.; Hu, Z.; Zhu, H.; Luo, B. Direction of Arrival Joint Prediction of Underwater Acoustic Communication Signals Using Faster R-CNN and Frequency–Azimuth Spectrum. Remote Sens. 2024, 16, 2563. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar] [CrossRef]
Klukowski, D.; Lubczonek, J.; Adamski, P. A Method of Simplified Synthetic Objects Creation for Detection of Underwater Objects from Remote Sensing Data Using YOLO Networks. Remote Sens. 2025, 17, 2707. [Google Scholar] [CrossRef]
Zhang, M.; Xu, S.; Song, W.; He, Q.; Wei, Q. Lightweight Underwater Object Detection Based on YOLO v4 and Multi-Scale Attentional Feature Fusion. Remote Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
Xu, Z.; Wang, R.; Cao, T.; Guo, W.; Shi, B.; Ge, Q. AquaPile-YOLO: Pioneering Underwater Pile Foundation Detection with Forward-Looking Sonar Image Processing. Remote Sens. 2025, 17, 360. [Google Scholar] [CrossRef]
Qu, S.; Cui, C.; Duan, J.; Lu, Y.; Pang, Z. Underwater small target detection under YOLOv8-LA model. Sci. Rep. 2024, 14, 16108. [Google Scholar] [CrossRef]
González-Sabbagh, S.P.; Robles-Kelly, A. A survey on underwater computer vision. ACM Comput. Surv. 2023, 55, 268. [Google Scholar] [CrossRef]
Min, L.; Dou, F.; Zhang, Y.; Shao, D.; Li, L.; Wang, B. CM-YOLO: Context Modulated Representation Learning for Ship Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4202414. [Google Scholar] [CrossRef]
Hu, J.; Li, Y.; Zhi, X.; Shi, T.; Zhang, W. Complementarity-Aware Feature Fusion for Aircraft Detection via Unpaired Opt2SAR Image Translation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5628019. [Google Scholar] [CrossRef]
Yu, X.; Liu, Y.; Hu, H.; Li, X.; Bo, M.; Zhang, D.; Zhou, Z. DMFI-YOLO: Dynamic multi-scale feature interaction for enhanced underwater object detection based on YOLO. Multimed. Syst. 2025, 31, 258. [Google Scholar] [CrossRef]
Zheng, Z.; Yu, W. RG-YOLO: Multi-scale feature learning for underwater target detection. Multimed. Syst. 2025, 31, 26. [Google Scholar] [CrossRef]
Lyu, L.; Liu, Y.; Xu, X.; Yan, P.; Zhang, J. EFP-YOLO: A quantitative detection algorithm for marine benthic organisms. Ocean Coast. Manag. 2023, 243, 106770. [Google Scholar] [CrossRef]
Samal, A. An Image Dataset of Plastics Found Underwater for Detection & Monitoring. Kaggle. 2021. Available online: https://www.kaggle.com/datasets/arnavs19/underwater-plastic-pollution-detection/data (accessed on 27 November 2025).
Dadboud, F.; Patel, V.; Mehta, V.; Bolic, M.; Mantegh, I. Single-stage UAV detection and classification with YOLOv5: Mosaic data augmentation and PANet. In Proceedings of the 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Washington, DC, USA, 16–19 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–8. [Google Scholar]
Wang, Y.; Song, W.; Fortino, G.; Qi, L.-Z.; Zhang, W.; Liotta, A. An experimental-based review of image enhancement and image restoration methods for underwater imaging. IEEE Access 2019, 7, 140233–140251. [Google Scholar] [CrossRef]
Zhou, W.; Zheng, F.; Yin, G.; Pang, Y.; Yi, J. Yolotrashcan: A deep learning marine debris detection network. IEEE Trans. Instrum. Meas. 2022, 72, 5002012. [Google Scholar] [CrossRef]
Xiao, Y.; Lepetit, V.; Marlet, R. Few-shot Object Detection and Viewpoint Estimation for Objects in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3090–3106. [Google Scholar] [CrossRef]
Katayama, T.; Song, T.; Shimamoto, T.; Jiang, X. GAN-based Color Correction for Underwater Object Detection. In Proceedings of the OCEANS 2019 MTS/IEEE SEATTLE, Seattle, WA, USA, 27–31 October 2019. [Google Scholar] [CrossRef]
Lucas, E.; Awad, A.; Geglio, A.; Moradi, S.; Saleem, A.; Havens, T.; Galloway, A.; Paheding, S. Underwater Image Enhancement and Object Detection: Are Poor Object Detection Results On Enhanced Images Due to Missing Human Labels? In Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops, Tucson, AZ, USA, 28 February–4 March 2025; CVF Open Access. pp. 1520–1525. [Google Scholar]
Qing, Y.; Wang, Y.; Yan, H.; Xie, X.; Wu, Z. Unformer: A Transformer-Based Approach for Adaptive Multiscale Feature Aggregation in Underwater Image Enhancement. IEEE Trans. Artif. Intell. 2025, 6, 1024–1037. [Google Scholar] [CrossRef]
Wang, B.; Xu, H.; Jiang, G.; Yu, M.; Ren, T.; Luo, T.; Zhu, Z. UIE-Convformer: Underwater Image Enhancement Based on Convolution and Feature Fusion Transformer. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 1952–1968. [Google Scholar] [CrossRef]
Shen, Z.; Xu, H.; Luo, T.; Song, Y.; He, Z. UDAformer: Underwater image enhancement based on dual attention transformer. Comput. Graph. 2023, 111, 77–88. [Google Scholar] [CrossRef]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.-H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 105. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems 33, Virtual, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, Virtual, 18–24 July 2021; pp. 8162–8171. Available online: https://proceedings.mlr.press/v139/nichol21a.html (accessed on 18 February 2021).
Bansal, A.; Borgnia, E.; Chu, H.M.; Li, J.; Kazemi, H.; Huang, F.; Goldblum, M.; Geiping, J.; Goldstein, T. Cold diffusion: Inverting arbitrary image transforms without noise. In Proceedings of the Advances in Neural Information Processing Systems 36, New Orleans, LA, USA, 10–16 December 2023; pp. 41259–41282. [Google Scholar]
Elmezain, M.; Saoud, L.S.; Sultan, A.; Heshmat, M.; Seneviratne, L.; Hussain, I. Advancing underwater vision: A survey of deep learning models for underwater object recognition and tracking. IEEE Access 2025, 13, 17830–17867. [Google Scholar] [CrossRef]
Cao, Z.; Cao, S.; Wu, X.; Hou, J.; Ran, R.; Deng, L.J. DDRF: Denoising diffusion model for remote sensing image fusion. arXiv 2023, arXiv:2304.04774. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Hoffman, M.D.; Johnson, M.J. ELBO surgery: Yet another way to carve up the variational evidence lower bound. In Proceedings of the Workshop in Advances in Approximate Bayesian Inference, NIPS, Barcelona, Spain, 9 December 2016; Volume 1. [Google Scholar]
Yin, D.; Hu, L.; Li, B.; Zhang, Y.; Yang, X. 5% > 100%: Breaking performance shackles of full fine-tuning on visual recognition tasks. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 20071–20081. [Google Scholar]
Flusser, J.; Farokhi, S.; Hoschl, C.; Suk, T.; Zitova, B.; Pedone, M. Recognition of images degraded by gaussian blur. IEEE Trans. Image Process. 2015, 25, 790–806. [Google Scholar] [CrossRef] [PubMed]
Shen, Y.; Zhang, F.; Liu, D.; Pu, W.; Zhang, Q. Manhattan-distance IOU loss for fast and accurate bounding box regression and object detection. Neurocomputing 2022, 500, 99–114. [Google Scholar] [CrossRef]
Wang, X.; Song, J. ICIoU: Improved loss based on complete intersection over union for bounding box regression. IEEE Access 2021, 9, 105686–105695. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
Li, H.; Zhao, F.; Xue, F.; Wang, J.; Liu, Y.; Chen, Y.; Wu, Q.; Tao, J.; Zhang, G.; Xi, D.; et al. Succulent-YOLO: Smart UAV-Assisted Succulent Farmland Monitoring with CLIP-Based YOLOv10 and Mamba Computer Vision. Remote Sens. 2025, 17, 2219. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef]

Figure 1. Framework of underwater debris detection.

Figure 2. Examples from the underwater plastic pollution detection dataset.

Figure 3. The framework of the UDD-YOLO for underwater debris detection. The system comprises two main stages: (1) a Cold Diffusion module. (2) an enhanced YOLOv12n detector featuring the AMC2f module and UIoU loss.

Figure 4. Examples of Network architecture of the denoising diffusion probabilistic model.

Figure 5. Denoising Diffusion Probabilistic Model (DDPM) forward process (left to right) and reverse process (right to left).

Figure 6. Architecture of the AMC2f module. Here, self.gamma is a learnable parameter, while other architectural choices, including kernel sizes, are fixed parameters.

Figure 7. Ablation study results showing the impact of different components.

Figure 8. Detection results are compared among different models.

Figure 9. Robustness evaluation of object detection models under various image degradations. (a) Brightness variation. (b) Gaussian blur. (c) Scattering and haze. (d) Lens distortion.

Table 1. Experimental setup.

Parameter	Value
Operating system	Ubuntu 22.04
RAM	120 GB
GPU	NVIDIA GeForce RTX 4090
GPU memory	24 GB
CPU	16-core Intel(R) Xeon(R) Gold 6430
Framework	PyTorch 2.10
CUDA version	12.1
Python version	3.10

Table 2. Training experimental setup for all detectors.

Hyperparameter	Value
Input resolution	640 × 640
Training epochs	200
Batch size	32
Data-loading workers	16
Optimizer	Stochastic Gradient Descent (SGD)
Initial learning rate	0.01
Momentum	0.937
Weight decay	0.0005

Table 3. Performance comparison of proposed components in ablation studies.

Model	①	②	③	P (%)	R (%)	mAP50 (%)	Parameters	Size (MB)
YOLOv12n	×	×	×	80.8	68.0	76.8	2,559,653	5.5
YOLOv12n-1	√	×	×	78.8	71.4	77.9	2,559,653	5.5
YOLOv12n-2	√	√	×	82.4	77.3	81.0	2,556,037	5.5
UDD-YOLO	√	√	√	87.3	75.1	81.8	2,556,037	5.5

A tick mark (√) shows that the module was applied, while a cross (×) means the module was not applied.

Table 4. Comparison of detection model performance on the test dataset.

Model	P (%)	R (%)	mAP50 (%)	Parameters	Size (MB)	FPS	FLOPs (G)
Faster R-CNN	75.6	73.0	75.5	28,418,758	108.7	2.7	342.8
MobileNetv2-SSD	54.5	53.1	54.3	5,405,452	21.4	58.6	197.3
RT-DETR-L	74.5	72.6	75.1	25,669,405	53.6	210.9	57.9
Mamba-YOLO-T	83.0	70.3	76.5	5,986,933	12.3	115.8	92.4
YOLOv3tiny	78.5	60.6	65.5	12,135,374	24.4	379.4	17.6
YOLOv5n	76.6	70.8	75.6	2,505,869	5.3	218.7	21.4
YOLOv8n	86.2	68.7	78.2	3,008,573	6.3	227.6	9.1
YOLOv9t	80.0	67.8	76.2	1,973,709	4.7	197.4	10.8
YOLOv10n	76.1	68.9	74.0	2,268,093	5.8	185.2	11.2
YOLOv11n	81.3	70.0	76.9	2,585,077	5.5	342.7	7.2
YOLOv12n	80.8	68.0	76.8	2,556,923	5.5	247.9	9.7
UDD-YOLO	87.3	75.1	81.8	2,556,037	5.5	232.1	10.2

Table 5. Comparison of detection performance with different enhancement methods.

Model	P (%)	R (%)	mAP50 (%)	mAP@0.5:0.95 (%)
YOLOv12n-2 (Raw)	80.9	75.3	80.1	60.1
YOLOv12n-2 + HE	81.1	74.1	80.3	58.3
YOLOv12n-2 + CLAHE	83.6	74.6	80.8	61.2
YOLOv12n-2 + DCP	82.4	76.3	81.0	60.3
YOLOv12n-2 + StyleGAN3	86.5	75.6	81.5	62.1
UDD-YOLO	87.3	75.1	81.8	62.3

Table 6. Comparison of detection performance with different loss functions.

Model	P (%)	R (%)	mAP50 (%)	mAP@0.5:0.95 (%)
UDD-YOLO with CIoU	83.9	74.0	81.1	60.9
UDD-YOLO with GIoU	84.1	74.7	81.3	60.3
UDD-YOLO with SIoU	85.6	74.4	80.7	61.8
UDD-YOLO	87.3	75.1	81.8	62.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tao, J.; Zhao, F.; Chen, Y.; Liu, Y.; Xue, F.; Song, J.; Wu, H.; Chen, J.; Li, P.; Xu, N. Diffusion-Enhanced Underwater Debris Detection via Improved YOLOv12n Framework. Remote Sens. 2025, 17, 3910. https://doi.org/10.3390/rs17233910

AMA Style

Tao J, Zhao F, Chen Y, Liu Y, Xue F, Song J, Wu H, Chen J, Li P, Xu N. Diffusion-Enhanced Underwater Debris Detection via Improved YOLOv12n Framework. Remote Sensing. 2025; 17(23):3910. https://doi.org/10.3390/rs17233910

Chicago/Turabian Style

Tao, Jianghan, Fan Zhao, Yijia Chen, Yongying Liu, Feng Xue, Jian Song, Hao Wu, Jundong Chen, Peiran Li, and Nan Xu. 2025. "Diffusion-Enhanced Underwater Debris Detection via Improved YOLOv12n Framework" Remote Sensing 17, no. 23: 3910. https://doi.org/10.3390/rs17233910

APA Style

Tao, J., Zhao, F., Chen, Y., Liu, Y., Xue, F., Song, J., Wu, H., Chen, J., Li, P., & Xu, N. (2025). Diffusion-Enhanced Underwater Debris Detection via Improved YOLOv12n Framework. Remote Sensing, 17(23), 3910. https://doi.org/10.3390/rs17233910

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Diffusion-Enhanced Underwater Debris Detection via Improved YOLOv12n Framework

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Model Architecture

2.2.1. Cold Diffusion Model

Foundational Theory: Denoising Diffusion Probabilistic Models (DDPM)

Cold Diffusion Module for Underwater Restoration

2.2.2. AMC2f Module

2.2.3. Unified-IoU (UIoU) Loss Function

2.3. Network Training and Optimization

2.4. Evaluation Metrics for Object Detection

3. Results

3.1. Ablation Experiments

3.2. Comparison with Different Detectors

3.3. Comparison with Different Image Enhancement Methods

3.4. Effect of Different IoU-Based Regression Losses

4. Discussion

4.1. Key Modifications Contributing to Performance Improvement

4.2. Robustness Against Degraded Input Data

4.2.1. Brightness Perturbation

4.2.2. Gaussian Blur Perturbation

4.2.3. Scattering and Haze Simulation

4.2.4. Lens Distortion

4.3. Generalizability and Transferability

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI