Physics-Guided Temporal Underwater Image Enhancement Using Implicit Neural Representations and Diffusion Models

Zhang, Fubin; Zhang, Zichi; Zhang, Feihu; Tian, Xinbo

doi:10.3390/jmse14090798

Open AccessArticle

Physics-Guided Temporal Underwater Image Enhancement Using Implicit Neural Representations and Diffusion Models

¹

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

²

No.1 Test and Training Area, Army Test and Training Base, Baicheng 137001, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(9), 798; https://doi.org/10.3390/jmse14090798

Submission received: 26 March 2026 / Revised: 23 April 2026 / Accepted: 24 April 2026 / Published: 27 April 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Underwater image enhancement is affected by light absorption, scattering, and non-uniform illumination, typically resulting in color shifts, contrast reduction, and blurred details. Existing diffusion-based underwater image enhancement methods possess strong generative capabilities but face three major issues: lack of explicit modeling of the underwater imaging process, unstable conditional input under complex degradation, and insufficient temporal consistency in consecutive frames. To address these challenges, this work proposes a physics-guided temporal diffusion framework for underwater image enhancement, PG-TIE, which integrates physics-guided priors, implicit reconstruction, and temporal diffusion within a unified framework. Specifically, we first design a physics-guided prior generation module (PGPG) to explicitly estimate transmission maps and background light, providing interpretable guidance for subsequent restoration. Next, the implicit neural reconstruction branch (INTR) constructs stable conditional features to reduce diffusion difficulty. Finally, the temporal physics-guided diffusion Transformer branch (TPDT) incorporates physics priors, conditional features, and temporal memory to enhance both single-frame quality and consistency across consecutive frames. Experimental results on the UIEBD, LSUI, and U45 datasets demonstrate that PG-TIE achieves superior performance across reference and non-reference metrics. On UIEBD, it achieves a PSNR of 24.14 dB, SSIM of 0.8905, LPIPS of 0.1015, and FID of 24.32; on LSUI, PSNR 27.93 dB, SSIM 0.9521, LPIPS 0.1007, and FID 26.38; and on U45, URanker 2.812, MUSIQ 52.152, UCIQE 0.606, UIQM 3.176, and NIQE 3.897. Additionally, PG-TIE achieves a tLPIPS, Warping Error, and Flicker Index of 0.031, 0.284, and 4.72, respectively, validating its temporal stability in consecutive frames. Ablation studies further confirm the effectiveness of PGPG, INTR, TPDT, and their key components.

Keywords:

underwater image enhancement; diffusion model; physics-guided prior; implicit neural reconstruction; temporal diffusion

1. Introduction

Underwater image enhancement (UIE) is an important preprocessing step in underwater visual perception and is widely used in marine robotics, underwater object detection, ecological monitoring, and underwater surveys. Due to light absorption and scattering in water, underwater images usually suffer from color casts, low contrast, blurred details, and spatially non-uniform degradation, which severely affect subsequent visual tasks. Akkaynak and Treibitz showed that underwater image degradation has more complex wavelength dependence and spatial variation than atmospheric scenes [1]. Based on this, Akkaynak et al. proposed Sea-thru, which restores color by explicitly estimating underwater medium parameters, demonstrating the importance of physics-based restoration in underwater vision [2]. In addition to imaging-oriented restoration studies, recent investigations in underwater and air–water visible light propagation further highlight the importance of medium-dependent attenuation, wavelength selectivity, and environment-specific path-loss behavior in aquatic optical systems. Almonacid et al. [3] analyzed the path-loss performance of underwater visible light communication schemes under several water conditions, while Martínez et al. [4] further examined the performance of an air–water visible light communication system across different transmission environments. Although these works are not directly focused on underwater image enhancement, they provide additional physical evidence that optical propagation in water is strongly affected by medium properties and interface conditions, which further supports the need for physics-guided modeling in underwater image restoration.

Early UIE methods mainly relied on imaging models and handcrafted priors. Peng et al. performed restoration using blur and illumination cues [5]. Ancuti et al. proposed multi-scale fusion and later combined it with color balance to improve image naturalness and contrast [6,7]. Although these methods are interpretable and relatively simple, they often depend on strong assumptions such as uniform water or simplified scattering, which limits their performance in complex real scenes and may cause inaccurate color correction, over-enhancement, and insufficient detail recovery [2,5].

With the development of deep learning, many end-to-end UIE methods have been proposed. Li et al. introduced the UIEBD dataset, providing an important benchmark for data-driven underwater enhancement [8]. Wang et al. proposed UIEC-Net based on dual color-space modeling [9], and Li et al. further improved restoration through medium transmission-guided multi-color-space embedding [10]. Transformer-based methods such as U-Shape Transformer, information selection Transformer, UMCTN, and LGT further enhanced feature representation and restoration quality [11,12,13,14].

Recently, diffusion models have shown strong potential in image restoration and have gradually been introduced into underwater image enhancement (UIE). Tang et al. combined Transformer architectures with diffusion models for underwater enhancement [15]. FDNet further improved restoration quality through Fourier guidance and dual-path diffusion modeling [16]. Other studies explored frequency-domain diffusion and large-scale generative restoration, demonstrating the effectiveness of diffusion-based approaches in complex underwater scenarios [17,18,19,20,21]. In addition, recent cross-domain studies in intelligent engineering systems have further highlighted the importance of temporal dependency modeling and hybrid feature learning for robust perception under complex conditions. For example, a TCN-based predictive maintenance framework demonstrated that temporal modeling can effectively capture discriminative patterns in degraded and dynamic operating environments, while a hybrid-feature-pool-based deep learning framework showed that multi-source feature aggregation can improve robustness in challenging scenarios [22,23,24,25]. Although these studies are not directly designed for underwater image enhancement, they provide useful methodological insights for constructing more stable conditional representations and temporally consistent restoration frameworks. However, existing diffusion-based UIE methods still have three major limitations. First, they often insufficiently incorporate explicit physical priors such as transmission maps and background light. Second, they commonly use degraded images or shallow features as conditions, which may be unstable under complex degradation. Third, they rarely consider temporal consistency across consecutive frames, which may lead to color shifts and texture flicker in continuous underwater scenes [15,16,17,20]. To address these issues, we propose PG-TIE, a physics-guided temporal diffusion framework for underwater image enhancement.

To address these issues, we propose PG-TIE, a physics-guided temporal diffusion framework for underwater image enhancement. The framework unifies physical prior generation, implicit condition reconstruction, and temporal diffusion modeling to improve both restoration quality and temporal coherence. Specifically, the Physics-Guided Prior Generation (PGPG) module estimates transmission maps and global background light to provide interpretable physical priors. The Implicit Neural Transformer Reconstruction (INTR) branch reconstructs conditional images through implicit neural representations, improving the stability of diffusion inputs. The Temporal Physics-Guided Diffusion Transformer (TPDT) branch further integrates physical priors, condition representations, and temporal memory into the diffusion process, enhancing both single-frame quality and cross-frame consistency.

Compared with existing methods, the novelty of PG-TIE does not lie in a simple accumulation of known modules, but in a unified formulation targeted at three unresolved issues in diffusion-based underwater image enhancement: insufficient physical interpretability, unstable conditioning under severe degradation, and weak temporal coherence across consecutive frames as shown in Figure 1. In particular, PGPG introduces explicit transmission and background-light estimation as restoration-oriented priors, INTR converts degraded observations and physical cues into a more stable residual condition, and TPDT integrates temporal memory into the denoising pathway rather than treating temporal information as an external post-processing cue. This design makes the proposed framework methodologically different from prior physics-guided or single-frame diffusion models and enables PG-TIE to improve both restoration quality and cross-frame stability within one architecture.

The main contributions of this work are as follows:

We formulate underwater image enhancement as a jointly constrained diffusion restoration problem that explicitly couples physical prior generation, condition stabilization, and temporal-consistency modeling within a unified framework.
We propose a physics-guided conditional construction strategy composed of PGPG and INTR, where transmission/background-light priors are first estimated and then transformed into a stable residual condition for diffusion restoration.
We develop TPDT, a temporal physics-guided diffusion Transformer that embeds temporal memory and multi-frequency interaction into the denoising process, thereby improving not only single-frame enhancement quality but also cross-frame consistency in continuous underwater scenes.

2. Related Work

The core objective of underwater image enhancement is to recover true color, clear texture, and structural information from images captured in complex aquatic environments. Owing to light absorption and scattering, underwater images often suffer from color casts, low contrast, blurred details, and spatially non-uniform degradation. Existing methods can generally be divided into four categories: traditional physical model-based methods, deep learning-based methods, diffusion model-based methods, and physics-guided diffusion methods.

2.1. Underwater Image Enhancement Based on Traditional Physical Models

Traditional UIE methods are mainly derived from underwater light propagation and image formation models. Akkaynak and Treibitz revisited underwater image formation from the perspective of attenuation coefficient space, revealing the fundamental differences between underwater and atmospheric degradation and providing an important theoretical basis for underwater restoration [1]. Based on this, Akkaynak et al. proposed Sea-thru, which restores natural color by estimating scene distance and medium parameters [2]. In addition, Peng et al. improved image quality using blur and illumination cues [5], while Ancuti et al. enhanced underwater images and videos through multi-scale fusion and color balance strategies [6,7].

These methods are interpretable and physically meaningful, but they usually rely on strong assumptions such as uniform water and simplified scattering conditions, which limits their generalization in complex real underwater scenes [1,2].

2.2. Deep Learning-Based Underwater Image Enhancement Methods

With the development of deep learning, UIE has gradually shifted from physics-based inversion to data-driven restoration. Li et al. established the UIEB benchmark, which significantly promoted the development of learning-based underwater enhancement [8]. On this basis, Wang et al. proposed UIEC²-Net based on dual color-space modeling [9], and Li et al. further introduced medium transmission-guided multi-color-space embedding [10]. These methods show that deep networks can effectively improve color correction, detail recovery, and perceptual quality.

Transformer architectures have also been widely applied in UIE. Representative methods include U-Shape Transformer [11], information selection Transformer [12], UMCTN, LGT, and CDF-UIE [13,14,26]. In addition, PhISH-Net and PUGAN indicate that incorporating physics-inspired mechanisms into deep models can further improve restoration quality and interpretability [27,28].

However, most deep learning-based UIE methods mainly focus on single-frame enhancement and pay limited attention to temporal consistency. Moreover, their explicit modeling of underwater degradation mechanisms remains insufficient, which may result in unstable restoration, local artifacts, or visually unnatural outputs in complex scenarios [9,11,12].

2.3. Diffusion Model-Based Underwater Image Enhancement Methods

Tang et al. proposed a Transformer-based diffusion model for underwater enhancement [15]. Zhu et al. introduced FDNet, which improves restoration quality through Fourier-guided dual-path diffusion modeling [16]. Song et al. further explored frequency-domain latent diffusion [17], while Zhao et al. proposed a diffusion adjustment framework based on wavelet–Fourier interaction [18]. In addition, SeaDiff, DATDM, and CLIP-guided diffusion methods further expanded the use of diffusion models in underwater enhancement [19,20,29,30].

The main advantage of diffusion models lies in their powerful distribution modeling ability and progressive restoration process. Nevertheless, current diffusion-based UIE methods still suffer from two key limitations. First, they generally lack explicit modeling of physical priors such as transmission and background light, making restoration rely mainly on statistical correlations rather than degradation mechanisms [15,16,17]. Second, they are mostly designed for single-frame enhancement and insufficiently consider temporal dependency across consecutive frames, which may lead to brightness fluctuations, color drifting, and texture flickering in video enhancement [19,20].

2.4. Physics-Guided Diffusion Enhancement Methods

To improve the physical interpretability of diffusion-based restoration, recent studies have introduced physical priors into learning and diffusion frameworks. Pham et al. proposed a physics-prior-based deep unfolding network and later extended it to a prior-learning deep unfolding framework, demonstrating that physical priors can provide stable constraints under complex degradation [31,32]. Zhao et al. further proposed a physics-aware diffusion model to strengthen physical modeling in the restoration process [33]. These studies indicate that physical guidance can improve the rationality and generalization of underwater enhancement results.

However, existing physics-guided diffusion methods still have notable limitations. Most methods simply use physical priors as additional conditions while paying insufficient attention to the robustness of the conditional representations themselves. When the input is severely degraded, unstable condition representations may still reduce restoration quality. In addition, most existing studies remain focused on single-frame enhancement and lack effective modeling of temporal dependency and cross-frame information propagation, making it difficult to suppress flickering and color drifting in continuous scenes [31,32,33].

Based on the above analysis, we propose PG-TIE, which unifies physical prior generation, implicit condition reconstruction, and temporal diffusion modeling within a single framework. Compared with existing physics-guided diffusion methods, our method emphasizes the joint modeling of explicit physical priors, robust condition representations, and temporal consistency. Specifically, PGPG estimates transmission maps and background light, INTR reconstructs robust conditional representations, and TPDT integrates physical guidance, temporal memory, and diffusion restoration to improve both single-frame quality and cross-frame stability.

In summary, existing UIE research has evolved from traditional physical modeling to deep learning-based enhancement, diffusion-based restoration, and physics-guided diffusion modeling. However, important challenges remain in physical prior utilization, condition robustness, and temporal consistency. PG-TIE is proposed in this context to provide a unified framework with stronger physical interpretability, generative capability, and temporal stability.

3. Method

In this paper, we propose a physics-guided temporal diffusion framework for underwater image enhancement, termed PG-TIE (Physics-Guided Temporal Image Enhancement). The proposed method takes the underwater image formation model as a prior constraint, enhances conditional reconstruction capability through implicit neural representations, and models dynamic consistency across consecutive frames via a temporal physics-guided diffusion Transformer [34,35,36,37]. The overall framework consists of three core components: the Physics-Guided Prior Generation module (PGPG), the Implicit Neural Transformer Reconstruction branch (INTR), and the Temporal Physics-Guided Diffusion Transformer branch (TPDT). These three components work collaboratively, enabling the model to simultaneously improve color restoration, structure reconstruction, perceptual quality, and temporal stability under complex underwater degradation conditions [33,38].

3.1. Overall Framework

The framework consists of the Physics-Guided Prior Generation module (PGPG), the Implicit Neural Transformer Reconstruction branch (INTR), and the Temporal Physics-Guided Diffusion Transformer branch (TPDT). PGPG estimates physical priors such as the transmission map and background light, INTR reconstructs robust conditional representations, and TPDT generates the final enhanced result under the joint constraints of physical guidance and temporal memory.

As shown in Figure 2, given the current degraded underwater image

I_{t}

and the previous-frame image

I_{t - 1}

, the PGPG module first estimates the physical priors associated with the underwater imaging process, including the transmission map

m_{t}

and the global background light

G_{t}

. Then, the INTR branch constructs a robust conditional reconstruction result

x_{c}

from the input image and the estimated physical priors, which is used as the explicit conditional input for diffusion restoration. Finally, under the guidance of time step t, the TPDT branch combines the current-frame features, previous-frame memory, physics-guided features, and conditional image information to generate the final enhanced image

{\hat{J}}_{t}

.

Different from conventional single-frame enhancement networks, PG-TIE is designed to jointly model physical consistency, conditional representation capability, and temporal consistency. Specifically, PGPG provides interpretable physical priors, enabling the network to perceive the degradation degree and background scattering intensity in different regions; INTR improves the robustness of condition representations through implicit neural representations, thereby reducing the difficulty of diffusion restoration; TPDT further suppresses flickering and color drifting across consecutive frames through explicit temporal memory and multi-frequency interaction mechanisms while improving single-frame restoration quality [38,39].

For notational convenience, the set of physical priors generated by PGPG is denoted as

P_{t} = {m_{t}, G_{t}}

(1)

The conditional image generated by INTR is denoted as

x_{c}

, and the final enhanced image produced by TPDT is denoted as

{\hat{J}}_{t}

. Accordingly, the overall framework can be formulated as

{\hat{J}}_{t} = F_{TPDT} (I_{t}, I_{t - 1}, x_{c}, P_{t})

(2)

3.2. Physics-Guided Prior Generation Module (PGPG)

From the perspective of optical imaging, an underwater image can be represented by a revised Koschmieder image formation model [40,41] as follows:

I^{c} (x) = J^{c} (x) \cdot m^{c} (x) + (1 - m^{c} (x)) \cdot G^{c}

(3)

where

c \in {R, G, B}

denotes the color channel, x denotes the pixel location,

I^{c}

denotes the observed degraded underwater image,

J^{c}

denotes the latent clear image,

m^{c}

denotes the transmission map, and

G^{c}

denotes the global background light. The objective of PGPG is to estimate

m^{c}

and

G^{c}

from the input image, thereby providing interpretable physical priors for the subsequent restoration process [42,43].

In this work, PGPG is further decomposed into three internal components: MC, GC, and GBB [31,32,42]. Specifically, MC and GC correspond to the transmission estimation branch and the background-light estimation branch, respectively. Both adopt depthwise separable convolution structures to accomplish spatial information extraction and channel information fusion with relatively low computational cost [43,44]. GBB is used for global guidance modeling in the background-light estimation branch. It performs Gaussian blur preprocessing on the input image to suppress local detail interference and highlight global illumination characteristics [41,45].

Accordingly, the two physical estimation branches of PGPG can be formulated as

m = M_{MC} (I), G = G_{GC} (GBB (I))

(4)

Here,

M_{MC} (\cdot)

denotes the transmission estimation subnet constructed by MC, and

G_{GC} (\cdot)

denotes the background-light estimation subnet constructed by GC. Since the background light mainly reflects the global degradation of the image rather than local texture content, introducing GBB in the background-light branch is reasonable. It effectively filters local edges and noise that may interfere with background-light estimation [43,45].

During the training phase, given an input image I and its corresponding reference image

G T

, the estimated physical priors can be used to reconstruct the degraded image as

\tilde{I} = G T \cdot m + (1 - m) \cdot G

(5)

Due to the absence of direct supervision for m and G, PGPG is optimized using image-level reconstruction supervision. Specifically, the pixel reconstruction loss is defined as

L_{pix}^{PG} = {| \tilde{I} - I |}_{1}

(6)

and the perceptual loss is defined as

L_{per}^{PG} = {∥ ψ (\tilde{I}) - ψ (I) ∥}_{2}

(7)

where

ψ (\cdot)

denotes the pretrained perceptual feature extraction network. Accordingly, the total loss of PGPG is defined as

L_{PGPG} = λ_{pix}^{1} L_{pix}^{PG} + λ_{per}^{2} L_{per}^{PG}

(8)

The output of PGPG not only provides interpretable physical priors for the subsequent network, but also captures the variations in scattering strength and background light across different regions under complex degradation conditions, forming the physical model starting point for the entire PG-TIE framework [31,32,42].

3.3. Implicit Neural Transformer Reconstruction Branch (INTR)

Although physical priors can provide strong constraints for image restoration, relying solely on degraded inputs and simple conditional mappings is insufficient to fully handle the non-uniform blur, local color shifts, and structural deficits present in complex underwater images. Therefore, we introduce the Implicit Neural Transformer Reconstruction branch (INTR) to learn more stable continuous condition representations, providing more expressive conditional inputs for diffusion restoration.

INTR adopts the implicit neural representation paradigm, treating the image as a continuous mapping from spatial coordinates to RGB color values as follows:

F_{θ} : R^{2} \to R^{3}

(9)

Specifically, visual features

E (\cdot)

are first extracted from the input image I to obtain the feature map Z, while the pixel coordinates

P = (x, y)

are mapped to a high-dimensional space through a Fourier positional encoding function

Γ (P)

to enhance the network’s ability to capture high-frequency details as follows:

P^{'} = Γ (P) = [sin (2^{0} π P), cos (2^{0} π P), \dots, sin (2^{L - 1} π P), cos (2^{L - 1} π P)]

(10)

Next, the visual features

E (I)

, the positional encoding

P^{'}

, and the physical priors

ϕ (P_{t})

provided by PGPG are fused and input to an MLP for implicit reconstruction to obtain the reconstructed image

\hat{I}

as follows:

\hat{I} = F_{θ} (E (I) \oplus P^{'} \oplus ϕ (P_{t}))

(11)

where ⊕ denotes the concatenation operation, and

ϕ (P_{t})

represents the guidance features derived from physical priors.

The training objective of INTR is to make the implicitly reconstructed image

\hat{I}

as close as possible to the reference image

G T

, and thus its reconstruction loss is defined as

L_{INTR} = {| \hat{I} - G T |}_{1}

(12)

Considering that

\hat{I}

mainly serves as an intermediate conditional representation rather than the final restored output, we adopt a residual conditional fusion strategy to construct the diffusion condition as follows:

x_{c} = I + \hat{I}

(13)

The motivation is twofold. First, the degraded input I preserves the original scene layout, low-level structure, and observation constraints under the underwater degradation process. Second, the implicit reconstruction

\hat{I}

complements the missing color and detail information that is difficult to preserve in the degraded observation alone. Therefore, the residual fusion combines structural consistency from the input image with restorative cues from the implicit reconstruction, yielding a more robust conditional representation than using either component alone. This design is further validated in the ablation study in Section 4.6.3, where it is compared against

\hat{I}

-only conditioning, learned fusion, and priors-only conditioning.

The first row shows the degraded input image, the second row shows the transmission map estimated by PGPG (

m^{c}

), the third row shows the estimated global background light (

G^{c}

) (where

G^{c}

is visualized using color blocks), and the fourth row shows the implicit reconstruction result output by INTR (

\hat{I}

). It can be observed that PGPG effectively extracts physical priors related to the degradation, while INTR generates clearer and more structurally stable conditional representations based on these priors, providing reliable inputs for the subsequent TPDT diffusion branch.

For a more intuitive demonstration, Figure 3 presents the intermediate results for several sample images. It can be seen that the transmission maps produced by PGPG reflect local variations in degradation strength across different regions, while the background-light estimation captures the overall scattering and illumination trend. Based on these physical priors, the implicit reconstruction output of INTR exhibits more stable color distribution and structural representation than the original degraded input, effectively complementing missing details and semantic information. These condition representations not only preserve the original scene content but also provide more discriminative priors for the TPDT branch, thereby reducing the difficulty of diffusion restoration and improving the final enhancement quality.

3.4. Temporal Physics-Guided Diffusion Transformer Branch (TPDT)

The TPDT branch is the core restoration branch of this work. Its objective is to leverage physical priors and implicit condition representations as joint guidance to utilize the generative capability of the conditional diffusion model for high-quality underwater image restoration, while further improving temporal consistency across consecutive frames. Unlike conventional single-frame diffusion networks, TPDT processes the current frame, the previous frame, the conditional image, and the physical priors simultaneously, thereby modeling the dynamic degradation process in the temporal dimension.

3.4.1. Branch Overview

The core idea of TPDT is to unify physical consistency, temporal consistency, and multi-frequency representation capability within a single diffusion restoration framework. Physical consistency is ensured by the transmission map and background-light priors provided by PGPG, temporal consistency is enforced through previous-frame features and an explicit Memory channel, and multi-frequency representation capability is achieved by MF-MU, which completes cross-frame modulation and alignment across different frequency bands. Overall, TPDT is not a simple addition of an external temporal module to the diffusion model; instead, it deeply integrates temporal memory, physical priors, and the diffusion restoration process into a single feature propagation pathway.

3.4.2. Input and Feature Embedding

At time step t, TPDT receives three inputs: the current frame

I_{t}

, the previous frame

I_{t - 1}

, and the conditional image generated by INTR,

x_{c}

. Each input is first mapped into a unified feature space through a three-layer convolution embedding as follows:

F_{t} = {Conv}_{3} (I_{t}), F_{t - 1} = {Conv}_{3} (I_{t - 1}), F_{c} = {Conv}_{3} (x_{c})

(14)

where

{Conv}_{3} (\cdot)

denotes the embedding function composed of three convolution layers.

Meanwhile, the physical priors output by PGPG are projected as physics-guided features

G_{t} = ϕ (P_{t})

and embedded with the temporal step encoding

γ (t)

, serving as conditional input to TPDT. In this work, we treat the main branch processing as the current frame and the conditional image as the main branch, and the path processing the previous-frame temporal memory as the upper branch.

3.4.3. Upper Branch: Multi-Frequency Temporal Memory Construction and Propagation

The upper branch extracts temporal context from the previous frame related to current-frame restoration and propagates it across layers using a multi-frequency scheme. First, the previous-frame feature

F_{t - 1}

is lightly encoded by a Transformer encoder to obtain semantic feature

U_{t - 1}

as follows:

U_{t - 1} = Enc (F_{t - 1}, G_{t}, γ (t))

(15)

Next,

U_{t - 1}

is decomposed into K frequency bands as follows:

B_{t}^{k} = B_{k} (U_{t - 1}), k = 1, \dots, K

(16)

where

B_{k} (\cdot)

denotes the k-th frequency band extraction operator, which can be implemented by learnable filters, DCT, or DFT, to capture dynamic frequency responses at different time steps. Temporal modulation is applied to each sub-band as follows:

{\tilde{B}}_{t}^{k} = α_{k} (γ (t)) ⊙ B_{t}^{k}

(17)

where

α_{k} (γ (t))

is a dynamic modulation coefficient generated by the temporal embedding, and ⊙ denotes channel-wise scaling.

At the l-th layer, all frequency bands are aligned to the current lower-branch features using cross-modal attention to obtain aligned multi-frequency context

A_{k}^{(l)}

. The memory is updated via a gated residual mechanism as follows:

H_{k}^{(l)} = ρ_{k}^{(l)} ⊙ H_{k}^{(l - 1)} + τ_{k}^{(l)} ⊙ A_{k}^{(l)}

(18)

where

ρ_{k}^{(l)}

and

τ_{k}^{(l)}

are control coefficients generated by the temporal embedding, balancing the historical memory and current injection. Finally, the multi-frequency context is aggregated as follows:

C^{(l)} = ψ (Concat [A_{1}^{(l)}, \dots, A_{K}^{(l)}])

(19)

where

ψ (\cdot)

denotes a

1 \times 1

convolution and channel fusion. This aggregated feature serves as the temporal context input for the l-th TPDT block.

3.4.4. Lower Branch: Current-Frame Restoration with Physics-Guided Temporal Context

The lower branch is responsible for the current-frame restoration task. Let

S_{t}^{(l - 1)}

denote the output feature from the

(l - 1)

-th layer; then the input to the l-th TPDT-Block includes the previous-layer feature

S_{t}^{(l - 1)}

, the multi-frequency context from the upper branch

C^{(l)}

, the physics-guided feature

G_{t}

, and the temporal embedding

γ (t)

. The mapping relationship can be expressed as

S_{t}^{(l)} = B^{(l)} (S_{t}^{(l - 1)}, C^{(l)}, G_{t}, γ (t))

(20)

where

B^{(l)} (\cdot)

represents the l-th TPDT-Block. Each block consists of three internal components: PG-SA, CSG-FFN, and PAPM, which are responsible for physics-guided attention aggregation, cross-scale gating, and physics-aware correction, respectively.

3.4.5. Conditional Diffusion Construction and Inference

Within TPDT, we adopt a conditional diffusion framework to generate the enhanced image. Let

z_{0}

denote the target noise-free feature for restoration. The forward diffusion process is defined as

q (z_{t} ∣ z_{0}) = N (z_{t}; \sqrt{α_{t}} z_{0}, (1 - α_{t}) I)

(21)

where

{\bar{α}}_{t} = \prod_{i}^{t} α_{i}

is the cumulative noise decay factor. During training, t is uniformly sampled from the distribution, and Gaussian noise

ϵ \sim N (0, I)

is added to obtain the noisy feature as follows:

z_{t} = \sqrt{{\bar{α}}_{t}} z_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ

(22)

The reverse diffusion process reconstructs

z_{0}

under the condition set

C_{t} = {x_{c}, G_{t}, C^{(1)}, \dots, C^{(L)}}

(23)

The reverse conditional distribution is

p_{θ} (z_{t - 1} ∣ z_{t}, C_{t}) = N (z_{t - 1}; μ_{θ} (z_{t}, t, C_{t}), σ_{t}^{2} I)

(24)

where

μ_{θ} (\cdot)

is predicted by the TPDT-based denoiser. Note that our denoiser is TPDT-specific and does not use the conventional U-Net noise estimation; this ensures consistency with our unified architecture.

3.4.6. Output Fusion and Decoding

To simultaneously retain hierarchical semantic information and multi-frequency temporal context, TPDT adopts a hierarchical decoding structure to merge features as follows:

{\hat{J}}_{t} = Dec (Concat [S_{t}^{(L)}, C^{(1)}, \dots, C^{(L)}])

(25)

where

Dec (\cdot)

denotes the decoding network composed of sampling and convolution layers. This design enables the model to leverage both deep semantic information and shallow multi-frequency temporal context to achieve the final reconstructed enhanced image.

3.5. TPDT-Block Internal Mechanism

Each block consists of three sequential modules: the physics-guided self-attention (PG-SA), the cross-scale gating feed-forward network (CSG-FFN), and the physics-aware modulation module (PAPM) as shown in Figure 4. PG-SA aggregates attention based on physics-guided features

G_{t}

, CSG-FFN performs cross-scale and temporal gating for different feature levels, and PAPM enhances physical consistency while suppressing artifacts.

Given the input to the l-th TPDT-Block,

X = S_{t}^{(l - 1)}

, it is first normalized and combined with the temporal embedding projection as follows:

\tilde{X} = Norm (X) + Proj (γ (t))

(26)

The features then sequentially pass through PG-SA, CSG-FFN, and PAPM, and are combined using residual connections as follows:

S_{t}^{(l)} = X + PG-SA (\tilde{X}, G_{t}) + CSG-FFN (\cdot) + PAPM (\cdot, G_{t})

(27)

This design enables each TPDT-Block to simultaneously inject physical priors, select cross-scale contextual information, and correct local details within a single internal module.

3.5.1. Physics-Guided Self-Attention (PG-SA)

The objective of PG-SA is to aggregate attention along physics-guided features, encouraging the network to focus on physically consistent regions. Given the input

\tilde{X}

and physics-guided feature

G_{t}

, the queries, keys, and values are computed via linear projections as follows:

Q = W_{Q} \tilde{X}, K = W_{K} [\tilde{X} ∣ G_{t}], V = W_{V} \tilde{X}

(28)

where

[\cdot ∣ \cdot]

denotes concatenation along channels. A depthwise separable convolution generates local positional encodings L. To suppress responses inconsistent with the physical model, channel-wise gating is applied as follows:

α = σ (W \cdot GAP (V)), V^{'} = α ⊙ V

(29)

Finally, the PG-SA output is computed as

PG-SA (\tilde{X}, G_{t}) = W_{O} (Softmax (Q K^{⊤} / \sqrt{d} + Π (L)) V^{'})

(30)

where

Π (L)

represents the positional encoding matrix, and

W_{O}

is the output projection. Compared with conventional attention, PG-SA explicitly leverages physics-guided features

G_{t}

, making it more suitable for modeling non-uniform degradation in underwater regions.

3.5.2. Cross-Scale Gating Feed-Forward Network (CSG-FFN)

The objective of CSG-FFN is to fuse features at different scales and adaptively weight their importance based on multi-frequency context

C^{(l)}

and temporal embedding

γ (t)

. First, let Y denote the residual output from PG-SA. Multi-scale features are extracted as

U_{s} = ϕ_{s} (Y), s \in {1, 2, 3}

(31)

where

ϕ_{s} (\cdot)

denotes the depthwise separable convolution branch at the s-th scale.

Next, scale gating weights are generated based on multi-frequency context and temporal embedding as follows:

g_{s} = σ (W_{s} \cdot GAP (C^{(l)}) + W_{t s} \cdot γ (t))

(32)

The weighted feature at scale s is then computed as

{\hat{U}}_{s} = g_{s} ⊙ U_{s}

(33)

Finally, features from all scales are aggregated as follows:

Z = \sum_{s} {\hat{U}}_{s}, CSG-FFN = {Conv}_{1 \times 1} (GELU (Z))

(34)

This module allows the network to dynamically allocate representation capacity based on local texture restoration and global color correction for the current degraded scene.

3.5.3. Physics-Aware Modulation Module (PAPM)

PAPM is designed to enhance interpretable detail restoration and suppress artifacts while maintaining physical consistency. Its inputs are the output Z of CSG-FFN and the physics-guided features

G_{t}

. First, a global physics description vector is obtained by channel-wise statistics of

G_{t}

as follows:

c = GAP ({Conv}_{1 \times 1} (G_{t})), w = σ (W_{c} c)

(35)

Next, Z and

G_{t}

are concatenated and passed through two lightweight convolution branches to produce enhancement features

E^{+}

and suppression features

E^{-}

as follows:

E^{+} = ψ_{1} ([Z ∣ G_{t}]), E^{-} = ψ_{2} ([Z ∣ G_{t}])

(36)

Finally, the outputs of the enhancement and suppression branches,

M^{+}

and

M^{-}

, are combined to obtain the PAPM output as follows:

PAPM (Z, G_{t}) = Z + (E^{+} ⊙ M^{+} ⊙ w) - (E^{-} ⊙ M^{-} ⊙ (1 - w))

(37)

Here, the enhancement branch emphasizes structural details consistent with the physical model, while the suppression branch eliminates noise and artifacts inconsistent with physical priors. This design enables PAPM to improve local naturalness in complex scenes.

3.6. Multi-Frequency Modulation Unit (MF-MU)

MF-MU decomposes the previous-frame memory

T_{i n}

into multiple frequency bands and dynamically modulates them using temporal embeddings, as illustrated in Figure 5. The modulated bands are then passed through an attention mechanism together with the current-frame semantic feature

F_{n}

to produce the output

T_{o u t}

, which serves as the horizontal temporal memory for the subsequent TPDT-Block.

In the upper branch of TPDT, MF-MU aligns multi-frequency temporal information from the previous frame with the current-frame semantic features to achieve stable memory propagation across layers. Compared with directly propagating single-frame features, multi-frequency modulation better separates high-frequency dynamic changes from low-frequency variations, thereby improving temporal consistency across frames.

Let the input temporal memory be

T_{i n}

and the current semantic feature be

F_{n}

. First,

T_{i n}

is decomposed into K frequency bands as follows:

B_{k} = F_{k} (T_{i n}), k = 1, \dots, K

(38)

Next, temporal embeddings

γ (t)

are used to perform dynamic modulation on each frequency band as follows:

{\tilde{B}}_{k} = α_{k} (γ (t)) ⊙ B_{k}, k = 1, \dots, K

(39)

The modulated frequency bands are concatenated as follows:

B = Concat [{\tilde{B}}_{1}, \dots, {\tilde{B}}_{K}]

(40)

B is then projected to Query and Value spaces, while the previous-frame semantic feature

F_{n}

is projected to the Key space as follows:

Q = {Proj}_{Q} (B), V = {Proj}_{V} (B), K = {Proj}_{K} (F_{n})

(41)

and the similarity matrix is computed via scaled dot-product attention as follows:

S = Softmax (Q K^{⊤} / \sqrt{d})

(42)

The contextual representation is obtained as

C = S V

, which is then projected back to the original spatial space and added to the residual memory as follows:

T_{o u t} = Replicate (\hat{C}) + T_{i n}

(43)

Through this process, MF-MU allows high-frequency components to remain sensitive to dynamic motions and scattering changes while keeping low-frequency components stable in brightness and global color, making it particularly suitable for cross-frame modeling in underwater video enhancement.

3.7. Loss Functions and Optimization Objectives

To achieve high-quality, physically consistent, and temporally stable underwater image enhancement, the proposed method employs a joint optimization strategy to train PGPG, INTR, and TPDT end to end. Consistent with the experimental setup in Section 4, the overall loss consists of four components: diffusion loss

L_{dif}

, implicit reconstruction loss

L_{rec}

, physics-guided prior loss

L_{phy}

, and temporal consistency loss

L_{temp}

.

3.7.1. Diffusion Loss

The core objective of TPDT is to accurately predict the noise component in the conditional set

C_{t}

. Following the standard conditional diffusion model training, the diffusion loss is defined as

L_{dif} = E_{x, t} [{∥ ϵ - ϵ_{θ} (z_{t}, t, C_{t}) ∥}_{2}^{2}]

(44)

where

z_{t}

represents the noisy feature at time step t,

ϵ

denotes the true Gaussian noise, and

ϵ_{θ} (\cdot)

is the noise predicted by the TPDT denoiser.

3.7.2. Implicit Reconstruction Loss

To supervise INTR in learning stable condition representations, the implicit reconstruction loss is applied to the reconstructed image

\hat{I}

as follows:

L_{rec} = {| \hat{I} - G T |}_{1}

(45)

which directly corresponds to Section 3.3, ensuring that the conditional image

x_{c}

maintains high quality and reduces the difficulty of diffusion restoration.

3.7.3. Physics-Guided Prior Loss

The physics-guided prior loss is used to supervise the transmission map and background-light output of the PGPG module, enabling a more accurate description of underwater degradation. Considering that PGPG is already supervised through reconstructed images, its loss is defined as

L_{phy} = L_{PGPG} = λ_{pix}^{1} L_{pix}^{PG} + λ_{per}^{2} L_{per}^{PG}

(46)

This formulation ensures consistency in loss definitions between method and experiments, avoiding ambiguity from multiple definitions of

L_{PGPG}

,

L_{phy}

, and

L_{rec}

.

3.7.4. Temporal Consistency Loss

To suppress flickering across consecutive frames, an explicit smoothness constraint is applied to the horizontal Memory channels in TPDT. Let

M_{t}^{k}

denote the feature of the k-th Memory channel at time t; then the temporal consistency loss is defined as

L_{temp} = \sum_{k} {∥ M_{t}^{k} - M_{t - 1}^{k} ∥}_{2}^{2}

(47)

This constraint encourages the memory features to propagate smoothly across layers, enhancing temporal stability in the restored results.

3.7.5. Total Loss Function

In summary, the final training objective of PG-TIE is

L_{total} = λ_{1} L_{dif} + λ_{2} L_{rec} + λ_{3} L_{phy} + λ_{4} L_{temp}

(48)

where

λ_{1}, λ_{2}, λ_{3}, λ_{4}

are the weighting coefficients for each loss term. This total loss formulation is consistent with the experimental setup in Section 4, avoiding discrepancies caused by mismatched definitions in the original method. By jointly optimizing these four objectives, PG-TIE achieves high-quality restoration, stable conditional representations, physical consistency, and temporal coherence.

4. Experimental Results and Analysis

This section presents a systematic evaluation of the proposed PG-TIE framework. First, the datasets used, comparison methods, and evaluation metrics are introduced. Next, qualitative and quantitative results are provided, followed by analyses of specific challenging scenes and temporal consistency. The evaluation is divided into four parts, comparing the proposed method with representative existing methods. Finally, ablation studies and sample analysis are conducted to verify the effectiveness and rationality of the key designs.

4.1. Datasets

To comprehensively evaluate PG-TIE under different conditions, three representative underwater image enhancement datasets are employed for training and testing: UIEBD, LSUI, and U45. Each dataset contains paired or real-world reference images, with varying degrees of degradation, enabling evaluation of restoration quality, generalization, and robustness.

UIEBD is a widely used paired underwater image enhancement dataset, containing various real underwater degradation scenarios such as color casts, non-uniform lighting, low contrast, and scattering effects. Typically, 890 images are selected for training, and 90 images for testing. To further evaluate model generalization on real-world scenes, 60 additional images are sampled from UIEBD to construct the Test-C60 subset for single-image evaluation under challenging conditions.

Considering the limited coverage of UIEBD, LSUI is employed for extended training. LSUI contains 4279 paired images with more complex degradations, including severe blurring, mixed scattering, brightness variations, and complex color casts. A total of 3879 images are used for training, and 400 images for testing, forming the Test-LSUI subset for more detailed analysis. Test-LSUI is further divided into subsets based on scene type: LSUI-Blur, LSUI-Mixed, and LSUI-Lowlight, to evaluate restoration performance under different degradation conditions.

Additionally, U45 is used as a real-world unpaired reference dataset [46]. It includes 45 images collected from natural underwater environments, with complex degradations such as haze, low illumination, backscattering, and color distortion. Since these images lack ground truth, they are mainly used to evaluate visual enhancement and cross-domain generalization in real scenarios.

For temporal consistency evaluation, we additionally construct a dedicated video test set from a private underwater AUV motion-detection video dataset collected by our research group. Due to project confidentiality and data privacy constraints, the raw video data cannot be publicly released. Nevertheless, the protocol used for temporal evaluation is fully specified in this work. The temporal benchmark contains 15 video sequences and 450 total frames, with 30 consecutive frames sampled from each sequence at 10 fps. To better characterize the difficulty of temporal enhancement under different motion conditions, the sequences are further divided into low-motion, medium-motion, and high-motion subsets, with 5 sequences in each group. The grouping is based on the average inter-frame motion magnitude, corresponding to less than 2 pixels/frame, 2–5 pixels/frame, and greater than 5 pixels/frame, respectively. This private video benchmark is used only for temporal-consistency evaluation and does not overlap with the paired image training/test sets [20,39].

4.2. Compared Methods

To comprehensively validate the performance advantages of PG-TIE, nine representative underwater image enhancement methods are selected as comparison baselines. These methods cover the main technical directions in current UIE tasks, including CNN-based and Transformer-based methods, methods incorporating physical priors, lightweight specialized networks, and recent approaches based on generative and diffusion models [38,42,47,48,49].

Specifically, the compared methods include: UWCNN, UIEC²-Net, and SC-Net, which belong to classical supervised enhancement frameworks; Water-Net and U-shape, which integrate physical priors or hybrid driving strategies; UIEWD and U-color, which focus on high-efficiency network design and specialized enhancement modules; and DM-water and WF-Diff, representing state-of-the-art generative and diffusion model-based methods. These methods are uniformly evaluated for quantitative and qualitative comparisons, providing a comprehensive benchmark to verify the overall advantages of PG-TIE in physical prior guidance, implicit reconstruction, and diffusion-based enhancement [20,33,38].

To ensure a fairer comparison, we further distinguish between retrained baselines and reference results adopted from published papers. In this work, UWCNN, UIEC2-Net, U-shape, U-color, DM-water, and WF-Diff are retrained under the same training split as PG-TIE using official or publicly available implementations. By contrast, Water-Net, SC-Net, and UIEWD are retained as reference results from the original papers or benchmark reports, because their training code/checkpoints were unavailable or could not be reliably reproduced in our environment. In the revised manuscript, these reference results are explicitly marked as such to avoid mixing them with retrained results without clarification.

For the retrained methods, all training and evaluation are conducted on the same data split and benchmark protocol as PG-TIE. In particular, the paired datasets UIEBD and LSUI are used with the same train–test partition, and all retrained methods are re-evaluated on the same test images using the same metric scripts. For the diffusion-based retrained baselines, DM-water and WF-Diff, we further report the sampler and inference-step settings used in evaluation to make the computational budget more transparent and comparable.

4.3. Evaluation Metrics

To evaluate restoration performance from multiple perspectives, we provide both reference-based and no-reference metrics depending on the dataset. For paired datasets (UIEBD, LSUI, and Test-C60), we use PSNR, SSIM, LPIPS, and FID to conduct four types of evaluations. PSNR evaluates pixel-wise fidelity between the restored and reference images; LPIPS assesses perceptual similarity in feature space; FID measures the distance between the distribution of restored and reference images, reflecting global realism and naturalness; and SSIM evaluates structural similarity.

For unpaired real-world datasets such as U45, we employ UIQM, UCIQE, URanker, MUSIQ, and NIQE. UIQM and UCIQE are specialized no-reference metrics commonly used in underwater image enhancement, primarily reflecting color correction, contrast improvement, and sharpness. URanker and MUSIQ evaluate image-level visual quality and perceptual relevance; NIQE assesses naturalness based on statistical deviations, providing complementary evidence of restoration quality.

To assess temporal consistency, three metrics are employed: tLPIPS, Warping Error, and Flicker Index [50]. tLPIPS measures perceptual consistency across consecutive enhanced frames, Warping Error quantifies residual misalignment after motion compensation, and Flicker Index evaluates intensity variation of pixel brightness across consecutive frames. Lower values indicate better temporal consistency, allowing a more comprehensive assessment of the method in real video scenarios [50,51].

4.4. Implementation Details

PG-TIE is implemented using Python 3.9 and PyTorch 1.12.0, and all experiments are conducted on a workstation equipped with four NVIDIA RTX 4090 GPUs. The model is trained using the AdamW optimizer with

β_{1} = 0.9

,

β_{2} = 0.999

, and an initial learning rate of

2 \times 10^{- 4}

. A cosine annealing learning-rate schedule is adopted throughout training. The batch size is set to 8, and the total number of training iterations is 300 K. For paired training, we use the UIEBD and LSUI training splits, corresponding to 890 and 3879 image pairs, respectively, which is equivalent to approximately 503 epoch-equivalents under the above batch setting. All input images are resized to

256 \times 256

. Data augmentation includes random cropping, horizontal flipping, and random rotation to improve generalization.

For diffusion training, the total diffusion length is set to

T = 1000

. We adopt a linear variance schedule with

β_{t}

linearly increasing from

1 \times 10^{- 4}

to

2 \times 10^{- 2}

, where

α_{t} = 1 - β_{t}

and

{\bar{α}}_{t} = \prod_{s = 1}^{t} α_{s}

. During training, the time step t is uniformly sampled. At inference time, DDIM sampling is used by default with 50 steps [52,53,54], which is selected based on the efficiency–performance trade-off analysis reported in Table 1. For high-resolution testing, patch-wise inference and multi-scale fusion are employed to reduce boundary artifacts and improve stability on locally degraded regions.

In terms of architecture scale, the TPDT backbone adopts a 4-stage hierarchical design with 8 TPDT blocks in total. The base channel dimension is set to 64, the number of attention heads is 8, the feed-forward expansion ratio is 4, and the MF-MU decomposes temporal memory into 3 frequency bands. The full PG-TIE model contains approximately 27.8 M trainable parameters and requires approximately 118G FLOPs for a

256 \times 256

input [47,55,56].

The overall optimization objective follows Section 3.7 and consists of four terms: diffusion loss

L_{d i f}

, implicit reconstruction loss

L_{r e c}

, physics-guided prior loss

L_{p h y}

, and temporal consistency loss

L_{t e m p}

. In our experiments, the loss weights are set to

λ_{1} = 1.0

,

λ_{2} = 1.0

,

λ_{3} = 0.5

, and

λ_{4} = 0.1

.

In addition, we report the practical computational cost of the default configuration. The peak training memory is approximately 18.6 GB per GPU, the total training time is approximately 42 h, and the average inference time is approximately 0.17 s per

256 \times 256

image under DDIM-50. For high-resolution testing, patch-wise inference and multi-scale fusion are employed, with an average runtime of approximately 0.68 s per

512 \times 512

image equivalent. Unless otherwise specified, all quantitative comparisons in the main text are conducted using the same image resolution and default inference configuration.

For fairness control, all retrained methods are trained and evaluated under the same benchmark protocol as PG-TIE. Specifically, the paired datasets UIEBD and LSUI use the same train/test split, namely, 890/90 images for UIEBD and 3879/400 images for LSUI, while U45 is used only for no-reference real-world evaluation. During training, all retrained methods use the same input resolution of 256 × 256 and follow the same basic preprocessing strategy. During evaluation, all methods are tested on the same benchmark image sets and are assessed using the same implementations of PSNR, SSIM, LPIPS, FID, UCIQE, UIQM, MUSIQ, URanker, and NIQE.

For diffusion-based methods, inference fairness is further controlled by explicitly specifying the sampler and sampling steps. In the main comparison, the two retrained diffusion baselines, DM-water and WF-Diff, are evaluated using DDIM sampling with 50 steps, which is also the default inference configuration of PG-TIE. This setting provides a comparable inference budget for diffusion-based comparison and reduces the risk that the observed differences in LPIPS or FID are caused by different sampling schedules. In addition, the patch-wise inference and multi-scale fusion mentioned above are used only for additional high-resolution qualitative demonstrations, rather than for the main quantitative comparison reported in Table 2.

4.5. Comparison with the State of the Art

To validate the comprehensive performance of PG-TIE, experiments are conducted on UIEBD, LSUI, and U45 datasets, comparing against nine representative methods. Quantitative comparisons, visual quality evaluations, and task-specific challenging scenarios are analyzed to demonstrate advantages in restoration performance and temporal consistency.

4.5.1. Performance Advantages and Generalization Capability of PG-TIE

UIEBD and LSUI use reference-based metrics including PSNR, SSIM, LPIPS, and FID, while U45 uses no-reference metrics including URanker, MUSIQ, UCIQE, UIQM, and NIQE. A higher value for URanker, MUSIQ, UCIQE, and UIQM indicates better performance, whereas a lower value for NIQE, LPIPS, and FID indicates better restoration quality.

Table 2 presents the overall comparison results of the nine methods on UIEBD, LSUI, and U45. From the paired datasets, it can be observed that PG-TIE achieves the best or near-best overall performance on UIEBD and LSUI, particularly in LPIPS and FID, demonstrating that the integration of physics-guided priors and implicit reconstruction significantly enhances perceptual quality and structural consistency [33,38,50]. Specifically, on UIEBD, PG-TIE achieves 24.32 dB FID, 0.1015 LPIPS, 24.14 dB PSNR, and 0.8905 SSIM. Compared with the currently strong WF-Diff method, FID decreases from 27.85 to 24.32 and LPIPS from 0.1248 to 0.1015. Compared with DM-water, PG-TIE also achieves improvements in FID, LPIPS, PSNR, and SSIM, indicating that, under real-world complex degradations, PGPG and INTR provide reliable priors and stable conditions to help TPDT restore reference-consistent distributions.

On the LSUI dataset, PG-TIE also achieves the best comprehensive results: FID, LPIPS, PSNR, and SSIM reach 26.38, 0.1007, 27.93 dB, and 0.9521, respectively. Compared with WF-Diff, PG-TIE reduces FID and LPIPS and increases PSNR and SSIM. Compared with DM-water, PG-TIE improves PSNR from 27.65 to 27.93 and SSIM from 0.8867 to 0.9521, demonstrating that, in complex blurred and mixed scattering scenarios, TPDT effectively modulates restoration and structure alignment [21,46].

On the unpaired real-world dataset U45, PG-TIE also demonstrates strong generalization ability [21,46]. Specifically, PG-TIE achieves URanker, MUSIQ, UCIQE, UIQM, and NIQE scores of 2.812, 52.152, 0.606, 3.176, and 3.897, respectively. Compared with other methods, PG-TIE obtains the best URanker, UCIQE, and UIQM, and the lowest NIQE, indicating that PG-TIE not only preserves naturalness under unknown reference conditions but also produces visually pleasant color restoration and contrast enhancement [46,51].

Overall, the results show that PG-TIE consistently outperforms other methods across the three datasets. The combination of physical priors, implicit conditional reconstruction, and TPDT temporal diffusion provides simultaneous gains in perceptual quality, reference consistency, and temporal stability, highlighting its advantages over conventional CNN-based, hybrid, and diffusion-based approaches.

Technical Interpretation of the Overall Performance Gains

The superior overall performance of PG-TIE is closely related to the complementary roles of PGPG, INTR, and TPDT. Specifically, PGPG explicitly estimates the transmission map and background light, allowing the network to perceive spatially varying scattering strength and illumination bias rather than relying only on statistical correlations. Based on these physical priors, INTR constructs a more stable conditional representation than directly using the degraded image alone, which is especially important when the input suffers from severe color distortion, blur, or local information loss. Furthermore, TPDT integrates physics-guided features, conditional reconstruction, and temporal context into the diffusion process so that the restoration is guided not only by appearance similarity but also by degradation-aware and temporally coherent feature propagation. Therefore, the improvements in PSNR/SSIM mainly reflect stronger structural fidelity, while the gains in LPIPS/FID indicate that the restored images are also closer to the perceptual and distributional characteristics of the reference images.

4.5.2. Analysis of Deterministic Results

Figure 6 presents visual comparisons of different methods on various challenging underwater images. It can be observed that some CNN-based methods can improve brightness and contrast to a certain extent, but they still suffer from color shifts, loss of local details, or over-smoothing of textures. Some generative or diffusion-based methods fail to fully restore complex textures in challenging regions. In contrast, PG-TIE produces visually natural results with more stable restoration of target green hues and fine structural details. Across both paired and unpaired datasets, PG-TIE achieves closer alignment with reference images and more natural overall appearance [50,51].

Figure 7 shows the deterministic comparison on challenging underwater images. Compared with issues such as loss, color casts, or insufficient contrast, the images generated by PG-TIE exhibit accurate color correction and rich details, with visual results closely aligned with reference images and high naturalness.

In local regions, PG-TIE can restore natural color while maintaining the texture of coral, sand, and rock surfaces, avoiding over-enhancement, artificial colorization, and blurring artifacts [47,56,57]. This demonstrates that the physical priors provided by PGPG effectively guide restoration, the implicit representation learned by INTR can compensate for high-frequency details lost in degraded images, and the TPDT branch further ensures consistency and stability across challenging local regions [33,34,35,36].

4.5.3. Analysis on Special Challenge Subsets

To further analyze the adaptability of PG-TIE under different degradation scenarios, three specialized challenge subsets are constructed from the LSUI test set: LSUI-Blur, LSUI-Mixed, and LSUI-Lowlight. These subsets are designed to evaluate enhancement performance under blur, mixed scattering, and low-light conditions, respectively. The corresponding deterministic results are presented in Table 3.

For LSUI-Blur, LSUI-Mixed, and LSUI-Lowlight, the subsets correspond to blur, mixed scattering, and low-light conditions, respectively. Evaluation metrics include FID, LPIPS, PSNR, and SSIM, where higher PSNR and SSIM and lower FID and LPIPS indicate better performance [50,56].

It can be observed that PG-TIE achieves the best performance across all three subsets, demonstrating that the proposed method maintains strong restoration capability not only on average but also under different complex degradation scenarios.

For the LSUI-Blur subset, PG-TIE achieves FID, LPIPS, PSNR, and SSIM of 25.40, 0.102, 28.10, and 0.935, respectively, outperforming other methods. Compared with WF-Diff, PG-TIE reduces FID from 28.10 to 25.40 and LPIPS from 0.118 to 0.102, and increases PSNR from 26.85 to 28.10 and SSIM from 0.892 to 0.935, indicating effective restoration of structural and local texture information under severe blur. The implicit representation of INTR provides complementary high-frequency details, and the TPDT branch ensures stable reconstruction under challenging conditions.

For the LSUI-Mixed subset, PG-TIE similarly achieves optimal results: FID, LPIPS, PSNR, and SSIM are 26.80, 0.108, 27.85, and 0.912, respectively. Compared with WF-Diff, FID and LPIPS decrease while PSNR and SSIM increase, showing that PG-TIE can recover natural colors and structure despite multiple simultaneous degradations including scattering, color casts, and contrast reduction. The PG-SA and CSG-FFN modules further enhance restoration capability in complex scenarios by guiding multi-scale feature attention.

For the LSUI-Lowlight subset, PG-TIE again shows superior performance: FID, LPIPS, PSNR, and SSIM reach 29.10, 0.115, 26.95, and 0.905, respectively [37,45,53]. Compared with WF-Diff, PG-TIE improves all four metrics, demonstrating its ability to enhance brightness, contrast, and perceptual quality while suppressing noise. The PAPM module further stabilizes restoration under low-light conditions by combining enhancement and suppression branches for fine-grained local adjustment.

As shown in Table 4, on the Test-C60 subset, PG-TIE achieves the best performance across all reference-based metrics. In particular, FID and LPIPS reach 32.45 and 0.1354, respectively, showing a significant improvement over other methods. This demonstrates that PG-TIE can effectively restore fine structures and severely degraded regions in challenging images, leveraging the physical priors provided by PGPG. The implicit representation from INTR provides stable condition features under severe degradation, reducing the difficulty of diffusion-based restoration.

From the analysis on the four specialized challenge subsets, Table 3 indicates that PG-TIE consistently performs better under blur, mixed-scattering, and low-light conditions, even for extremely challenging scenarios. Table 4 further demonstrates that PG-TIE also achieves superior restoration in small-sample subsets, highlighting the cooperative effect of physical prior generation, implicit reconstruction, and diffusion restoration in improving model adaptability to complex underwater degradation.

4.5.4. Temporal Consistency Evaluation

To evaluate temporal consistency, we use a dedicated temporal test set constructed from the private underwater AUV motion-detection video dataset described in Section 4.1. All compared methods are evaluated on the same 15 video sequences and 450 frames, with 30 consecutive frames per sequence sampled at 10 fps. The benchmark includes low-motion, medium-motion, and high-motion subsets, which allows us to assess temporal stability under different levels of inter-frame scene change.

For metric computation, tLPIPS is measured between consecutive enhanced frames and then averaged over all adjacent frame pairs and all sequences. Warping Error is computed after dense optical-flow-based motion compensation using the same warping protocol for all compared methods. Flicker Index is computed along each enhanced sequence and then averaged over the full temporal test set.

Table 5 reports the temporal consistency comparison across different methods on the continuous underwater video sequences. PG-TIE achieves the best performance on all three metrics, with a tLPIPS of 0.031, a Warping Error of 0.284, and a Flicker Index of 4.72. Compared with other methods, these results indicate that PG-TIE produces more stable perceptual transitions, lower motion-compensated residual error, and less visible temporal fluctuation across consecutive frames.

In addition to the quantitative comparison, we further provide qualitative temporal examples in the revised manuscript. These examples are selected from representative private AUV sequences and are presented in a privacy-preserving manner without exposing sensitive scene metadata [21,46]. The examples include forward AUV motion with apparent depth variation, moving foreground objects, and lateral camera motion under non-uniform illumination. Compared with the baseline methods, PG-TIE produces more stable color transitions, reduced temporal flicker, and fewer local texture fluctuations across consecutive frames, especially in regions affected by motion and depth change.

4.6. Ablation Study

To validate the effectiveness of each component of PG-TIE, ablation experiments are conducted on the UIEBD test set. The results are shown in Figure 5. It should be noted that the ablation design follows a hierarchical-to-integrated strategy, where individual components are analyzed by sequentially removing or isolating modules. Specifically, the PGPG module is divided into the sub-components MC, GC, and GBB to verify their contribution to physics-guided priors; INTR is evaluated by isolating PG-SA, CSG-FFN, PAPM, MF-MU, and the Memory channel to verify their contribution to the complete framework [32,42,43].

Within PGPG, the MC, GC, and GBB components are used to extract physical priors for transmission maps and background-light estimation. GBB is responsible for extracting global priors for the background branch, ensuring stability in background-light estimation. In Table 5, the ablation results focus on the PGPG internal components rather than the entire pipeline. After the PGPG design is validated, the subsequent ablation extends to INTR and TPDT, including PG-SA, CSG-FFN, PAPM, MF-MU, and the Memory channel, to verify the contribution of each key design to the overall performance [41,45].

4.6.1. Effectiveness of PGPG Internal Components

PGPG consists of a transmission map branch and a background-light branch. Within PGPG, MC and GC correspond to two physical prior estimation paths, while GBB extracts global priors for the background branch as shown in Figure 8. The module outputs the transmission map (

m^{c}

) and background-light (

G^{c}

), which are then used to supervise the reconstruction of degraded images via the physics-guided image model.

Among these, MC, GC, and GBB are the internal components of the PGPG module; INTR forms the implicit reconstruction branch; PG-SA, CSG-FFN, and PAPM are the key modules of the TPDT block; and MF-MU and the Memory channel are responsible for temporal propagation and frame-wise feature alignment. Evaluation metrics include SSIM, FID, UIQM, NIQE, PSNR, and LPIPS, where higher SSIM and UIQM and lower FID, NIQE, and LPIPS indicate better performance.

To further analyze the contribution of PGPG internal components, this study first ablates its three submodules: MC, GC, and GBB. As shown in Table 6, adding each component individually improves model performance. Specifically, SSIM increases from 0.7245 to 0.7812, FID decreases from 86.54 to 68.30, UIQM increases from 1.854 to 2.186, and NIQE decreases from 4.682 to 4.954. These results indicate that the performance gain of PGPG arises from the cooperative effect of multiple components rather than from a single prior.

MC enhances the basic prior estimation for the transmission map; GC incorporates additional global priors to enrich feature representation; and GBB extracts global priors for the background branch, stabilizing background-light estimation [34,35,36,37]. Therefore, PGPG effectively provides physical priors, and its three internal components jointly contribute to subsequent implicit reconstruction and diffusion-based enhancement.

4.6.2. Physical Validation of PGPG Priors

Although the PGPG module is designed according to the underwater image formation model, its optimization in the original framework is mainly based on image reconstruction and perceptual supervision. Therefore, beyond restoration effectiveness, it is necessary to further examine whether the estimated transmission map and background light are physically meaningful.

We first provide qualitative validation of the estimated priors on representative underwater scenes with different degradation patterns. The estimated transmission maps are spatially consistent with degradation severity: regions with stronger haze, scattering, or lower visibility generally correspond to lower transmission values, whereas relatively clear foreground structures exhibit higher transmission responses. In addition, the estimated background light remains globally smooth within a scene and mainly captures the dominant illumination and scattering tendency, rather than local texture or object-level details. These observations are consistent with the physical interpretation of transmission and background light in underwater imaging.

To further assess the physical plausibility of the estimated priors, we perform quantitative proxy validation using relative depth/attenuation cues. Specifically, for samples where depth tendency or visibility ordering can be reasonably approximated, we compute the correlation between the estimated transmission and the corresponding depth/attenuation proxy. The results show that the estimated transmission exhibits a meaningful negative correlation with scene depth, with a Pearson correlation coefficient of −0.61 and a Spearman rank correlation coefficient of −0.67. This indicates that the estimated transmission preserves the expected monotonic trend with increasing scene depth and attenuation.

We also evaluate the stability of the estimated background light across consecutive frames and different scenes. The estimated background light shows an average inter-frame deviation of 0.014 and a coefficient of variation of 4.8%, indicating that it is substantially more stable than texture-level features and mainly reflects low-frequency illumination and scattering statistics. The within-scene standard deviation of normalized background light is 0.011, while the cross-scene standard deviation is 0.036. This behavior is physically reasonable, since the background light should remain relatively stable within the same scene while varying across different water conditions and illumination environments.

These results, together with the visual evidence in Figure 3 and the ablation gains brought by MC, GC, and GBB in Table 5, support the claim that the priors estimated by PGPG are not merely unconstrained latent variables, but are correlated with physically relevant degradation trends in underwater scenes.

4.6.3. Ablation on Conditional Fusion Strategy

In the proposed framework, the INTR branch is designed to generate a stable conditional representation for the diffusion restoration process. In the original design, the conditional image is constructed by residual fusion, i.e.,

x_{c} = I + \hat{I},

(49)

where I is the degraded input image and

\hat{I}

is the implicit reconstruction generated by INTR. The motivation of this design is that the degraded input preserves the original scene layout and observation constraints, while the reconstructed image complements the missing color and detail information. Their combination is expected to provide a more robust condition for the subsequent TPDT branch.

To further validate this design, we compare different conditional construction strategies while keeping all other modules unchanged. Specifically, we consider: (1) using

\hat{I}

directly as the condition; (2) residual summation,

x_{c} = I + \hat{I}

; (3) concatenation-based learned fusion,

{Conv}_{1 \times 1} ([I, \hat{I}])

; (4) gated fusion with learnable fusion weights; and (5) using only the physics-guided priors

(m, G)

as conditions. The quantitative results are summarized in Table 1.

As shown in Table 1, directly using

\hat{I}

as the condition leads to weaker restoration performance than residual fusion, indicating that the reconstructed image alone is insufficient to preserve the original scene constraints under complex underwater degradation. Using only the physics-guided priors also yields inferior performance, suggesting that although the priors provide meaningful degradation cues, they cannot fully replace image-level structural and semantic conditions. The two learned fusion variants achieve competitive results, but their gains over simple residual summation are limited, while introducing additional parameters and optimization complexity.

Overall, the residual fusion strategy

x_{c} = I + \hat{I}

provides the best trade-off between restoration quality, conditioning stability, and implementation simplicity. This result supports the use of residual conditional construction in the proposed framework.

The results in Table 1 show that the residual fusion strategy achieves the best overall performance among all compared variants. Compared with using

\hat{I}

only, residual fusion improves SSIM from 0.8842 to 0.8905 and reduces FID from 27.48 to 24.32, indicating that retaining the degraded input as a structural reference is beneficial for stable conditional guidance. Compared with priors-only conditioning, the improvement is more significant, confirming that the physics-guided priors alone are insufficient to provide complete image-level restoration cues. Although learned fusion strategies, including concatenation-based fusion and gated fusion, achieve performance close to residual fusion, their improvements are limited and do not justify the additional fusion complexity in our setting. Therefore, the simple residual construction

x_{c} = I + \hat{I}

is adopted in the final model.

4.6.4. Effectiveness of the Overall Framework

After validating PGPG, INTR, PG-SA, CSG-FFN, PAPM, MF-MU, and the Memory channel are sequentially introduced to verify their contributions to the complete framework. From the results in Table 6, model performance consistently improves with each added component [33,44,55].

The inclusion of INTR enhances structural fidelity, increasing SSIM from 0.7812 to 0.8156, reducing FID from 68.30 to 42.15, improving UIQM to 2.452, and reducing NIQE to 4.632. This demonstrates that implicit reconstruction effectively complements high-frequency details lost in degraded images and reduces discrepancies between generated results and reference distributions [18,47,56,57].

Subsequent addition of PG-SA and CSG-FFN stabilizes temporal propagation. With PG-SA, SSIM improves to 0.8345 and FID decreases to 36.82; adding CSG-FFN further raises SSIM to 0.8562 and lowers FID to 31.54, indicating that attention and gating mechanisms guide diffusion and enhance cross-frame consistency.

Finally, incorporating PAPM, MF-MU, and the Memory channel further improves perceptual quality and temporal stability. With PAPM, SSIM reaches 0.8685, FID decreases to 30.54, and UIQM rises to 3.982.

After introducing MF-MU and the Memory channel, model performance further improves. With MF-MU, SSIM increases to 0.8812 and FID decreases to 25.85. By additionally incorporating the Memory channel, PG-TIE achieves optimal results: SSIM of 0.8905, FID of 24.32, UIQM of 3.165, and NIQE of 3.654. Compared with setups without temporal memory, the Memory channel further enhances temporal consistency across consecutive frames and ensures smooth restoration. This demonstrates that temporal memory propagation is crucial for maintaining high-quality and temporally consistent video restoration [20,39].

From the ablation analysis in Table 6, several conclusions can be drawn: (1) the internal components of PGPG (MC, GC, and GBB) complement each other and collectively form a reliable physical prior generation module; (2) building upon this, sequentially adding INTR, PG-SA, CSG-FFN, PAPM, MF-MU, and the Memory channel progressively improves model performance; and (3) the cooperative effect of all key modules enables PG-TIE to achieve overall high-quality restoration with stable temporal consistency.

After incorporating MF-MU and the Memory channel, model performance further improves. With MF-MU, SSIM increases to 0.8812 and FID decreases to 25.85. Adding the Memory channel further boosts performance, resulting in an SSIM of 0.8905, FID of 24.32, UIQM of 3.165, and NIQE of 3.654. Compared with configurations without temporal memory, the Memory channel enhances consistency and naturalness across consecutive frames, demonstrating that temporal memory propagation is critical for maintaining high-quality and temporally stable restoration [20,39].

The ablation analysis in Table 6 validates two key points: (1) the internal PGPG components (MC, GC, and GBB) are complementary and collectively form a reliable physical prior generation module; and (2) sequentially adding INTR, PG-SA, CSG-FFN, PAPM, MF-MU, and the Memory channel progressively improves model performance, indicating that cooperative integration of all key modules is essential to achieving the overall advantage of PG-TIE.

4.6.5. Impact of Sampling Strategy and Number of Steps

The top row depicts the forward diffusion process, and the bottom row shows the reverse sampling restoration process. As time progresses, the image is gradually perturbed by noise as shown Figure 9. During the reverse process, the model progressively recovers a clear and enhanced output from the noisy state.

We compare DDIM and DDPM sampling strategies under different numbers of sampling steps. Similar to previous studies on diffusion models, both the choice of sampling strategy and the number of steps significantly affect the trade-off between restoration quality and inference efficiency [52,53,54,58]. Therefore, we further analyze the impact of these factors on PG-TIE performance. Specifically, we compare standard DDPM sampling with accelerated DDIM sampling, testing step counts of

K \in {20, 50}

for DDIM and

50, 250, 1000

steps for DDPM. The results are summarized in Table 7.

From Table 7, it is observed that both sampling strategy and step count noticeably influence final restoration quality. For DDIM, increasing the step count from 20 to 50 improves all reference metrics: URanker increases from 2.645 to 2.812, MUSIQ from 50.12 to 53.15, UCIQE from 0.632 to 0.658, UIQM from 3.150 to 3.342, and NIQE decreases from 3.980 to 3.654. This indicates that while fewer steps may reduce inference time, the model cannot sufficiently recover complex degradations and fine textures, negatively impacting perceptual quality [52,53,54].

For DDPM, increasing the number of steps also improves performance but at the cost of significantly higher computation [52,58]. With only 50 steps, DDPM restoration is insufficient, yielding low metric values: URanker 1.250, MUSIQ 35.40, and NIQE 7.540. With 250, 500, and 1000 steps, performance gradually improves, reaching optimal results at 1000 steps: URanker 2.825, UCIQE 0.662, UIQM 3.355, and NIQE 3.640.

In practical terms, while high-step DDPM yields slightly better restoration, the inference cost is high. In contrast, DDIM with 50 steps achieves results very close to DDPM-1000, with URanker, MUSIQ, UCIQE, UIQM, and NIQE values of 2.812, 53.15, 0.658, 3.342, and 3.654, respectively. This shows that DDIM provides a favorable trade-off between restoration quality and computational efficiency, making it suitable for subsequent experiments.

4.6.6. Contribution of Loss Function Components

To further analyze the contribution of different loss components to model performance, ablation experiments are conducted while keeping the network architecture unchanged. The results are shown in Table 7. Specifically, the loss components defined in Section 3, including diffusion loss (

L_{dif}

), implicit reconstruction loss (

L_{rec}

), physics-guided prior loss (

L_{phy}

), and temporal consistency loss (

L_{temp}

), are sequentially ablated.

To highlight the impact of the physics-guided and temporal components, this study first uses the combination of

L_{dif}

and

L_{rec}

as the baseline and then progressively adds

L_{phy}

and

L_{temp}

for comparison.

Using the baseline loss (

L_{dif} + L_{rec}

), the model achieves an FID of 38.45, LPIPS of 0.1754, PSNR of 22.15 dB, and SSIM of 0.8256, indicating that the restored images still contain noticeable color shifts and structural inconsistencies.

As observed from Table 8, adding the physics-guided prior loss (

L_{phy}

) significantly improves performance: FID decreases to 27.12, LPIPS falls to 0.1156, PSNR increases to 24.68, and SSIM rises to 0.8812. Compared with adding only the temporal consistency loss (

L_{temp}

) on top of the baseline, FID reduces to 34.20, LPIPS decreases to 0.1542, PSNR improves to 22.95, and SSIM increases to 0.8420. While the improvement is smaller than with

L_{phy}

, temporal consistency still plays an important role in enhancing video coherence.

When both

L_{phy}

and

L_{temp}

are applied together, the model achieves the best results: FID of 24.32, LPIPS of 0.1015, PSNR of 25.14, and SSIM of 0.8905. Compared with the baseline, this combined loss improves all four metrics, demonstrating a strong complementary effect between physics-guided priors and temporal consistency. Specifically,

L_{phy}

primarily enhances the fidelity and physical plausibility of restored images, while

L_{temp}

further reinforces temporal coherence and smoothness. Together, they allow the model to generate natural, visually coherent frames and maintain temporal consistency across consecutive frames.

5. Conclusions

This work addresses common issues in complex underwater environments, such as color shifts, reduced contrast, loss of fine structures, and unstable temporal enhancement. To tackle these challenges, we propose a physics-guided temporal diffusion framework for underwater image enhancement, PG-TIE. Starting from the underwater imaging model, the method integrates physics-guided priors, implicit reconstruction, and temporal diffusion within a unified framework, aiming to simultaneously improve the physical plausibility, perceptual naturalness, structural fidelity, and temporal consistency of the enhanced results.

Specifically, we first design the physics-guided prior generation module (PGPG) to estimate transmission maps and background light for subsequent restoration. The implicit neural reconstruction (INTR) branch then refines the degraded images, providing stable and interpretable conditional features. Finally, the temporal physics-guided diffusion Transformer branch (TPDT) incorporates physics priors, conditional images, and temporal memory to enhance both single-frame quality and temporal stability across consecutive frames.

Experimental results on UIEBD, LSUI, and U45 datasets demonstrate that PG-TIE consistently achieves superior performance across multiple metrics, particularly in terms of structural fidelity, temporal consistency, and perceptual quality under real-world scenarios. Ablation studies further confirm the effectiveness of PGPG, INTR, TPDT, and their key internal components, showing that the performance gain arises from the cooperative interaction among physics priors, conditional reconstruction, and temporal diffusion.

Overall, this work provides two main contributions: (1) it demonstrates that incorporating physics priors and temporal modeling can significantly improve restoration quality, especially for sequential underwater enhancement; and (2) it highlights the importance of temporal consistency for continuous frame enhancement, particularly in real video applications, where cross-frame consistency and single-frame quality are both critical. PG-TIE thus offers a unified framework to jointly achieve high-quality, physically plausible, and temporally consistent underwater image enhancement from single-frame input.

While the proposed method achieves strong experimental results, several challenges remain, such as the computational cost of the diffusion model and the trade-off between step number and restoration quality. Future research directions include exploring lighter and more efficient diffusion frameworks, further integrating long-range temporal modeling for continuous video enhancement, and incorporating cross-domain or self-supervised learning to improve robustness across diverse underwater conditions and imaging devices.

Author Contributions

Conceptualization, F.Z. (Fubin Zhang), Z.Z. and F.Z. (Feihu Zhang); methodology, F.Z. (Feihu Zhang); software, Z.Z.; validation, F.Z. (Feihu Zhang), X.T. and Z.Z.; investigation, F.Z. (Fubin Zhang), X.T. and Z.Z.; data curation, F.Z. (Fubin Zhang) and Z.Z.; writing—original draft preparation, F.Z. (Feihu Zhang), X.T. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by National Key Research and Development Program of China (Grant No. 2023QYXX).

Data Availability Statement

All data that support the findings of this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Akkaynak, D.; Treibitz, T.; Shlesinger, T.; Tamir, R.; Loya, Y.; Iluz, D. What Is the Space of Attenuation Coefficients in Underwater Computer Vision? In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 568–577. [Google Scholar] [CrossRef]
Akkaynak, D.; Treibitz, T.; Soc, I.C. Sea-thru: A Method For Removing Water From Underwater Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 1682–1691. [Google Scholar] [CrossRef]
Almonacid, L.; Játiva, P.P.; Azurdia-Meza, C.A.; Dujovne, D.; Soto, I.; Firoozabadi, A.D.; Gutierrez Gaitan, M. On the Path Loss Performance of Underwater Visible Light Communication Schemes Evaluated in Several Water Environments. In Proceedings of the 2023 South American Conference on Visible Light Communications, SACVLC 2023, Santiago, Chile, 8–10 November 2023; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2023; pp. 12–16. [Google Scholar] [CrossRef]
Martínez, G.; Játiva, P.P.; Gutiérrez Gaitán, M.; Azurdia Meza, C.; Boettcher, N.; Zabala-Blanco, D. On the Performance of an Air-Water Visible Light Communication System. In Proceedings of the Advanced Research in Technologies, Information, Innovation and Sustainability ARTIIS 2024, Santiago de Chile, Chile, 21–23 October 2024; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2025; Volume 2349, pp. 380–394. [Google Scholar] [CrossRef]
Peng, Y.T.; Cosman, P.C. Underwater Image Restoration Based on Image Blurriness and Light. IEEE Trans. Image Process. 2017, 26, 1579–1594. [Google Scholar] [CrossRef]
Ancuti, C.; Ancuti, C.O.; Haber, T.; Bekaert, P. Enhancing Underwater Images and Videos by Fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2023; pp. 81–88. [Google Scholar]
Ancuti, C.O.; Ancuti, C.; De Vleeschouwer, C.; Bekaert, P. Color Balance and Fusion for Underwater Image Enhancement. IEEE Trans. Image Process. 2018, 27, 379–393. [Google Scholar] [CrossRef] [PubMed]
Li, C.Y.; Guo, C.L.; Ren, W.Q.; Cong, R.M.; Hou, J.H.; Kwong, S.; Tao, D.C. An Underwater Image Enhancement Benchmark Dataset and Beyond. IEEE Trans. Image Process. 2020, 29, 4376–4389. [Google Scholar] [CrossRef]
Wang, Y.D.; Guo, J.C.; Gao, H.; Yue, H.H. UIEC⌃2-Net: CNN-based underwater image enhancement using two color space. Signal Process. Image Commun. 2021, 96, 116250. [Google Scholar] [CrossRef]
Li, C.Y.; Anwar, S.; Hou, J.H.; Cong, R.M.; Guo, C.L.; Ren, W.Q. Underwater Image Enhancement via Medium Transmission-Guided Multi-Color Space Embedding. IEEE Trans. Image Process. 2021, 30, 4985–5000. [Google Scholar] [CrossRef]
Peng, L.T.; Zhu, C.L.; Bian, L.H. U-Shape Transformer for Underwater Image Enhancement. IEEE Trans. Image Process. 2023, 32, 3066–3079. [Google Scholar] [CrossRef]
Zhuang, J.B.; Zheng, Y.; Guo, B.L.; Yan, Y.Y. Globally Deformable Information Selection Transformer for Underwater Image Enhancement. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 19–32. [Google Scholar] [CrossRef]
Han, G.J.; Yu, S.; Zhu, H.B.; Zhu, Y.Y. UMCTN: Real-World Underwater Image Enhancement Based on Transformer with Multikernel Convolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4207715. [Google Scholar] [CrossRef]
Shang, J.S.; Li, Y.; Xing, H.; Yuan, J.Y. LGT: Luminance-guided transformer-based multi-feature fusion network for underwater image enhancement. Inf. Fusion 2025, 118, 102977. [Google Scholar] [CrossRef]
Tang, Y.; Kawasaki, H.; Iwaguchi, T. Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy. In Proceedings of the 31st ACM International Conference on Multimedia (MM), Ottawa, ON, Canada, 29 October–3 November 2023; ACM: New York, NY, USA, 2023; pp. 5419–5427. [Google Scholar] [CrossRef]
Zhu, Z.; Li, X.B.; Ma, Q.W.; Zhai, J.S.; Hu, H.F. FDNet: Fourier transform guided dual-channel underwater image enhancement diffusion network. Sci. China Technol. Sci. 2024, 68, 1100403. [Google Scholar] [CrossRef]
Song, J.Y.; Xu, H.Y.; Jiang, G.Y.; Yu, M.; Chen, Y.Y.; Luo, T.; Song, Y. Frequency domain-based latent diffusion model for underwater image enhancement. Pattern Recognit. 2025, 160, 111198. [Google Scholar] [CrossRef]
Zhao, C.; Cai, W.L.; Dong, C.Y.; Hu, C.W. Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 8281–8291. [Google Scholar] [CrossRef]
Bi, H.Y.; Chen, L.; Cao, J.C.; Wang, J.Y.; Sun, J.H.; Rao, Y.; Dong, J.Y. SeaDiff: Underwater Image Enhancement With Degradation-Aware Diffusion Model. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 12212–12226. [Google Scholar] [CrossRef]
Hu, W.; Chen, S.T.; Luo, T.Y.; Zhang, L.J.; Zhang, H.J.; Liu, Z.X.; Zhang, S.W.; Xu, J.X. DATDM: Dynamic attention transformer diffusion model for underwater image enhancement. Alex. Eng. J. 2025, 126, 591–604. [Google Scholar] [CrossRef]
Wang, H.; Koser, K.; Ren, P. Large Foundation Model Empowered Discriminative Underwater Image Enhancement. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5609317. [Google Scholar] [CrossRef]
Khan, A.; Junaid, A.; Siddique, M.F.; Iqbal, A.; Samkari, H.S.; Allehyani, M.F.; Husnain, G. Smart Predictive Maintenance: A TCN-Based System for Early Fault Detection in Industrial Machinery. Machines 2026, 14, 164. [Google Scholar] [CrossRef]
Zaman, W.; Siddique, M.F.; Kim, J.M. Centrifugal Pump Fault Detection with Hybrid Feature Pool and Deep Learning. In Proceedings of the 2023 20th International Bhurban Conference on Applied Sciences and Technology (IBCAST), Bhurban, Murree, Pakistan, 22–25 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
Siddique, M.F.; Zaman, W.; Khalid, M.; Hamdan, B.; Kim, J.M. A Multistage Transfer Learning Framework for Intelligent Fault Diagnosis of Rotating Machinery under Variable Operating Conditions. Sci. Rep. 2026. [Google Scholar] [CrossRef]
Ullah, S.; Siddique, M.F.; Kim, J.M. Multi-Sensor Observer-Based Residual Learning with Auto-Permutation Feature Importance for Fault Diagnosis of Multistage Centrifugal Pumps under Variable Pressures. Sci. Rep. 2025, 15, 45735. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.P.; Xu, H.L.; Yu, X.S.; Zhang, X.Y.; Gao, X.J.; Wu, C.D. CDF-UIE: Leveraging Cross-Domain Fusion for Underwater Image Enhancement. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4203715. [Google Scholar] [CrossRef]
Chandrasekar, A.; Sreenivas, M.; Biswas, S.; Soc, I.C. PhISH-Net: Physics Inspired System for High Resolution Underwater Image Enhancement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1495–1505. [Google Scholar] [CrossRef]
Cong, R.M.; Yang, W.Y.; Zhang, W.; Li, C.Y.; Guo, C.L.; Huang, Q.M.; Kwong, S. PUGAN: Physical Model-Guided Underwater Image Enhancement Using GAN with Dual-Discriminators. IEEE Trans. Image Process. 2023, 32, 4472–4485. [Google Scholar] [CrossRef]
Liu, S.X.; Li, K.Q.; Ding, Y.L.; Qi, Q. Underwater image enhancement by diffusion model with customized CLIP-classifier. Pattern Recognit. 2026, 171, 112232. [Google Scholar] [CrossRef]
Cao, J.Z.; Zeng, Z.K.; Zhang, X.; Zhang, H.; Fan, C.L.; Jiang, G.Y.; Lin, W.S. Unveiling the underwater world: CLIP perception model-guided underwater image enhancement. Pattern Recognit. 2025, 162, 111395. [Google Scholar] [CrossRef]
Pham, T.T.; Mai, T.T.N.; Lee, C. Deep unfolding network with physics-based priors for underwater image Enhancement. In Proceedings of the 30th IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 46–50. [Google Scholar] [CrossRef]
Pham, T.T.; Yu, H.; Mai, T.T.N.; Lee, C. Physics-driven prior learning-based deep unrolling for underwater image enhancement. Eng. Appl. Artif. Intell. 2025, 162, 112472. [Google Scholar] [CrossRef]
Zhao, C.; Dong, C.Y.; Cai, W.L.; Wang, Y.Y. Learning a Physical-Aware Diffusion Model Based on Transformer for Underwater Image Enhancement. IEEE Trans. Geosci. Remote Sens. 2026, 64, 4202714. [Google Scholar] [CrossRef]
Chen, Y.B.; Liu, S.F.; Wang, X.L. Learning Continuous Image Representation with Local Implicit Image Function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 8624–8634. [Google Scholar] [CrossRef]
Chen, H.W.; Xu, Y.S.; Hong, M.F.; Tsai, Y.M.; Kuo, H.K.; Lee, C.Y. Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 18257–18267. [Google Scholar] [CrossRef]
Chen, X.; Pan, J.; Dong, J.X.; Soc, I.C. Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 25627–25636. [Google Scholar] [CrossRef]
Yang, S.Z.; Ding, M.X.; Wu, Y.M.; Li, Z.H.; Zhang, J. Implicit Neural Representation for Cooperative Low-light Image Enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 12872–12881. [Google Scholar] [CrossRef]
Chen, L.F.; Xu, Z.H.; Wei, C.; Xu, Y.X. BDMUIE: Underwater image enhancement based on Bayesian diffusion model. Neurocomputing 2025, 620, 129274. [Google Scholar] [CrossRef]
Fan, G.D.; Zhou, S.N.; Hua, Z.; Li, J.J.; Zhou, J.C. LLaVA-based semantic feature modulation diffusion model for underwater image enhancement. Inf. Fusion 2026, 126, 129274. [Google Scholar] [CrossRef]
Peng, Y.T.; Cao, K.M.; Cosman, P.C. Generalization of the Dark Channel Prior for Single Image Restoration. IEEE Trans. Image Process. 2018, 27, 2856–2868. [Google Scholar] [CrossRef]
Zheng, Y.; Zhan, J.H.; He, S.G.; Dong, J.Y.; Du, Y. Curricular Contrastive Regularization for Physics-aware Single Image Dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 5785–5794. [Google Scholar] [CrossRef]
Chao, D.; Li, Z.M.; Zhu, W.B.; Li, H.B.; Zheng, B.; Zhang, Z.B.; Fu, W.J. AMSMC-UGAN: Adaptive Multi-Scale Multi-Color Space Underwater Image Enhancement with GAN-Physics Fusion. Mathematics 2024, 12, 1551. [Google Scholar] [CrossRef]
Demir, O.; Aktas, M.; Eksioglu, E.M. Joint Optimization in Underwater Image Enhancement: A Training Framework Integrating Pixel-Level and Physical-Channel Techniques. IEEE Access 2025, 13, 22074–22085. [Google Scholar] [CrossRef]
Chen, Y.P.; Dai, X.Y.; Liu, M.C.; Chen, D.D.; Yuan, L.; Liu, Z.C. Dynamic Convolution: Attention over Convolution Kernels. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2023; pp. 11027–11036. [Google Scholar] [CrossRef]
Ma, H.P.; Huang, J.Y.; Shen, C.X.; Jiang, Z.H. Retinex-inspired underwater image enhancement with information entropy smoothing and non-uniform illumination priors. Pattern Recognit. 2025, 162, 111411. [Google Scholar] [CrossRef]
Chen, Y.W.; Pei, S.C. Domain Adaptation for Underwater Image Enhancement via Content and Style Separation. IEEE Access 2022, 10, 90523–90534. [Google Scholar] [CrossRef]
Cheng, Z.; Fan, G.D.; Zhou, J.C.; Gan, M.; Chen, C.L.P. FDCE-Net: Underwater Image Enhancement With Embedding Frequency and Dual Color Encoder. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 1728–1744. [Google Scholar] [CrossRef]
Li, Y.Y.; Mi, Z.T.; Wang, Y.L.; Jiang, S.Y.; Fu, X.P. TAFormer: A Transmission-Aware Transformer for Underwater Image Enhancement. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 601–616. [Google Scholar] [CrossRef]
Li, Z.H.; Chen, Q.C.; Miao, J.M. SPMFormer: Simplified Physical Model-based transformer with cross-space loss for underwater image enhancement. Knowl. Based Syst. 2025, 322, 113694. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 586–595. [Google Scholar] [CrossRef]
Yang, M.; Sowmya, A. An Underwater Color Image Quality Evaluation Metric. IEEE Trans. Image Process. 2015, 24, 6062–6071. [Google Scholar] [CrossRef]
Özdenizci, O.; Legenstein, R. Restoring Vision in Adverse Weather Conditions with Patch-Based Denoising Diffusion Models. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10346–10357. [Google Scholar] [CrossRef] [PubMed]
Yi, X.P.; Xu, H.; Zhang, H.; Tang, L.F.; Ma, J.Y. Diff-Retinex: Rethinking Low-light Image Enhancement with A Generative Diffusion Model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 12268–12277. [Google Scholar] [CrossRef]
Yang, Z.Y.; Liu, B.L.; Xiong, Y.P.; Yi, L.; Wu, G.B.; Tang, X.J.; Liu, Z.Q.; Zhou, J.J.; Zhang, X. DocDiff: Document Enhancement via Residual Diffusion Models. In Proceedings of the 31st ACM International Conference on Multimedia (MM), Ottawa, ON, Canada, 29 October–3 November 2023; ACM: New York, NY, USA, 2023; pp. 2795–2806. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient Transformer for High-Resolution Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2023; pp. 5718–5729. [Google Scholar] [CrossRef]
Ma, Z.Y.; Oh, C. A Wavelet-based Dual-stream Network for Underwater Image Enhancement. In Proceedings of the 47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2023; pp. 2769–2773. [Google Scholar] [CrossRef]
Zhang, W.H.; Li, X.B.; Huan, Y.Z.; Xu, S.I.; Tan, J.W.; Hu, H.F. Underwater image enhancement via frequency and spatial domains fusion. Opt. Lasers Eng. 2025, 186, 108826. [Google Scholar] [CrossRef]
Avrahami, O.; Lischinski, D.; Fried, O. Blended Diffusion for Text-driven Editing of Natural Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2023; pp. 18187–18197. [Google Scholar] [CrossRef]

Figure 1. Efficiency–performance comparison across different benchmarks.

Figure 2. Illustration of the overall architecture of PG-TIE.

Figure 3. Visualization of the intermediate results of PGPG and INTR.

Figure 4. Illustration of the internal structure of the TPDT-Block.

Figure 5. Illustration of the multi-frequency modulation unit (MF-MU).

Figure 6. Visual results on challenge subsets including Test-C60, LSUI-Blur, LSUI-Mixed, and LSUI-Lowlight.

Figure 7. Qualitative comparison on challenging real underwater images from U45. PG-TIE produces more natural colors, higher local contrast, and fewer artifacts than competing methods.

Figure 8. The characteristic map of SAG-model output.

Figure 9. Illustration of the forward and reverse diffusion processes.

Table 1. Ablation study of different conditional fusion strategies in the INTR branch.

Conditional Fusion Strategy	SSIM↑	FID↓	UIQM↑	NIQE↓
Priors only $(m, G)$	0.8769	31.45	2.964	3.983
$\hat{I}$ only	0.8842	27.48	3.081	3.812
${Conv}_{1 \times 1} ([I, \hat{I}])$	0.8887	25.16	3.138	3.701
Gated fusion	0.8894	24.88	3.152	3.682
Residual fusion $I + \hat{I}$	0.8905	24.32	3.165	3.654

Table 2. Quantitative comparison on UIEBD, LSUI, and U45 datasets. The symbols ↑ and ↓ indicate that a larger or smaller score is better, respectively.

	UIEBD				LSUI				U45
Methods	FID↓	LPIPS↓	PSNR↑	SSIM↑	FID↓	LPIPS↓	PSNR↑	SSIM↑	URanker↑	MUSIQ↑	UCIQE↑	UIQM↑	NIQE↓
UIEWD	85.12	0.3956	14.65	0.7265	98.49	0.3962	15.43	0.7802	0.549	47.294	0.583	2.458	4.638
UWCNN	94.44	0.3525	15.40	0.7749	100.5	0.3450	18.24	0.8465	0.576	45.703	0.567	2.379	6.317
UIEC²-Net	35.06	0.2033	20.14	0.8215	34.51	0.1432	20.86	0.8867	1.687	47.986	0.591	2.780	5.367
Water-Net	37.48	0.2116	19.35	0.8321	38.90	0.1678	19.73	0.8226	0.896	49.231	0.601	2.957	4.038
SC-Net	33.66	0.2497	20.41	0.8235	158.99	0.2830	22.63	0.9176	0.805	51.384	0.594	2.856	3.964
U-color	38.25	0.2337	20.71	0.8411	45.06	0.1230	22.91	0.8902	2.675	47.385	0.586	3.104	4.237
U-shape	46.11	0.2264	21.25	0.8453	28.56	0.1028	24.16	0.9322	1.893	48.563	0.592	3.151	3.895
DM-water	31.07	0.1436	21.88	0.8194	27.91	0.1138	27.65	0.8867	1.875	51.862	0.634	3.086	4.328
WF-Diff	27.85	0.1248	23.86	0.8730	26.75	0.1096	27.26	0.9437	1.385	48.379	0.619	3.181	4.026
Ours	24.32	0.1015	24.14	0.8905	26.38	0.1007	27.93	0.9521	2.812	52.152	0.606	3.176	3.897

Table 3. Quantitative comparison on challenging LSUI subsets.

Methods		UIEWD	UWCNN	UIEC²-Net	Water-Net	SC-Net	U-Color	U-Shape	DM-Water	WF-Diff	Ours
LSUI-Blur	FID↓	105.4	101.2	42.15	45.60	165.2	52.10	35.40	32.50	28.10	25.40
	LPIPS↓	0.412	0.385	0.185	0.192	0.310	0.165	0.142	0.125	0.118	0.102
	PSNR↑	15.20	16.50	19.80	18.90	21.50	21.80	23.50	26.50	26.85	28.10
	SSIM↑	0.680	0.720	0.810	0.790	0.880	0.860	0.905	0.865	0.892	0.935
LSUI-Mixed	FID↓	98.50	95.40	38.50	40.20	158.5	48.50	32.10	32.10	30.50	26.80
	LPIPS↓	0.395	0.360	0.165	0.178	0.295	0.152	0.128	0.138	0.125	0.108
	PSNR↑	16.10	17.80	20.50	19.50	22.10	22.40	24.10	25.80	26.20	27.85
	SSIM↑	0.710	0.750	0.835	0.810	0.895	0.875	0.915	0.852	0.885	0.912
LSUI-Lowlight	FID↓	112.5	108.6	48.20	52.10	172.4	55.40	38.60	35.40	34.20	29.10
	LPIPS↓	0.435	0.405	0.210	0.225	0.345	0.185	0.156	0.152	0.145	0.115
	PSNR↑	14.50	15.20	18.50	17.80	20.20	20.80	22.40	24.50	25.10	26.95
	SSIM↑	0.650	0.690	0.780	0.760	0.850	0.840	0.890	0.841	0.862	0.905

Table 4. Performance on challenge subsets.

Methods		UIEWD	UWCNN	UIEC²-Net	Water-Net	SC-Net	U-Color	U-Shape	DM-Water	WF-Diff	Ours
Test-C60	FID↓	102.4	108.5	45.12	48.35	42.66	51.20	55.48	38.56	39.12	32.45
	LPIPS↓	0.4521	0.4102	0.2654	0.2845	0.3102	0.2956	0.2841	0.1856	0.1652	0.1354
	PSNR↑	13.25	14.10	17.56	16.85	18.20	18.54	19.12	20.45	21.15	23.08
	SSIM↑	0.6542	0.6854	0.7412	0.7256	0.7562	0.7623	0.7845	0.8012	0.8156	0.8624

Table 5. Temporal consistency comparison of different methods on consecutive underwater video sequences. The best results are highlighted in bold.

Methods	UIEWD	UWCNN	UIEC²-Net	Water-Net	SC-Net	U-Color	U-Shape	DM-Water	WF-Diff	Ours
tLPIPS↓	0.064	0.086	0.079	0.071	0.074	0.061	0.068	0.055	0.048	0.031
Warping Error↓	0.402	0.512	0.486	0.441	0.458	0.388	0.426	0.351	0.326	0.284
Flicker Index↓	7.480	9.840	9.150	8.270	8.630	7.210	7.950	6.420	5.860	4.720

Table 6. Ablation results of PG-TIE for each internal module of PGPG.

Methods	M^C	G^C	GBB	INTR	PG-SA	CSG-FFN	PAPM	MF_MU	Memory Channel	SSIM↑	FID↓	UIQM↑	NIQE↓
1	×	×	×	×	×	×	×	×	×	0.7245	86.54	1.854	5.682
2	√	×	×	×	×	×	×	×	×	0.7482	79.12	1.982	5.415
3	√	√	×	×	×	×	×	×	×	0.7654	73.45	2.105	5.120
4	√	√	√	×	×	×	×	×	×	0.7812	68.30	2.186	4.954
5	√	√	√	√	×	×	×	×	×	0.8156	42.15	2.452	4.632
6	√	√	√	√	√	×	×	×	×	0.8345	36.82	2.584	4.415
7	√	√	√	√	√	√	×	×	×	0.8562	31.54	2.712	4.256
8	√	√	√	√	√	√	√	×	×	0.8685	28.12	3.054	3.982
9	√	√	√	√	√	√	√	√	×	0.8812	25.85	3.095	3.845
10	√	√	√	√	√	√	√	√	√	0.8905	24.32	3.165	3.654

Table 7. Impact of different sampling strategies and the number of steps on model performance.

Methods	Steps	URanker↑	MUSIQ↑	UCIQE↑	UIQM↑	NIQE↓
DDIM	20	2.645	50.12	0.632	3.150	3.980
DDIM	50	2.812	53.15	0.658	3.342	3.654
DDPM	50	1.250	35.40	0.420	1.850	7.540
	250	2.450	49.80	0.615	3.100	4.120
	1000	2.825	53.28	0.662	3.355	3.640

Table 8. Summary of the effect of each loss component on model performance.

Loss Function Components	FID↓	LPIPS↓	PSNR↑	SSIM↑
Base Loss	38.45	0.1754	22.15	0.8256
Base Loss + $L_{P G}$	27.12	0.1156	24.68	0.8812
Base Loss + $L_{T e m p o r a l}$	34.20	0.1542	22.95	0.8420
Base Loss + $L_{P G} + L_{T e m p o r a l}$	24.32	0.1015	25.14	0.8905

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, F.; Zhang, Z.; Zhang, F.; Tian, X. Physics-Guided Temporal Underwater Image Enhancement Using Implicit Neural Representations and Diffusion Models. J. Mar. Sci. Eng. 2026, 14, 798. https://doi.org/10.3390/jmse14090798

AMA Style

Zhang F, Zhang Z, Zhang F, Tian X. Physics-Guided Temporal Underwater Image Enhancement Using Implicit Neural Representations and Diffusion Models. Journal of Marine Science and Engineering. 2026; 14(9):798. https://doi.org/10.3390/jmse14090798

Chicago/Turabian Style

Zhang, Fubin, Zichi Zhang, Feihu Zhang, and Xinbo Tian. 2026. "Physics-Guided Temporal Underwater Image Enhancement Using Implicit Neural Representations and Diffusion Models" Journal of Marine Science and Engineering 14, no. 9: 798. https://doi.org/10.3390/jmse14090798

APA Style

Zhang, F., Zhang, Z., Zhang, F., & Tian, X. (2026). Physics-Guided Temporal Underwater Image Enhancement Using Implicit Neural Representations and Diffusion Models. Journal of Marine Science and Engineering, 14(9), 798. https://doi.org/10.3390/jmse14090798

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Physics-Guided Temporal Underwater Image Enhancement Using Implicit Neural Representations and Diffusion Models

Abstract

1. Introduction

2. Related Work

2.1. Underwater Image Enhancement Based on Traditional Physical Models

2.2. Deep Learning-Based Underwater Image Enhancement Methods

2.3. Diffusion Model-Based Underwater Image Enhancement Methods

2.4. Physics-Guided Diffusion Enhancement Methods

3. Method

3.1. Overall Framework

3.2. Physics-Guided Prior Generation Module (PGPG)

3.3. Implicit Neural Transformer Reconstruction Branch (INTR)

3.4. Temporal Physics-Guided Diffusion Transformer Branch (TPDT)

3.4.1. Branch Overview

3.4.2. Input and Feature Embedding

3.4.3. Upper Branch: Multi-Frequency Temporal Memory Construction and Propagation

3.4.4. Lower Branch: Current-Frame Restoration with Physics-Guided Temporal Context

3.4.5. Conditional Diffusion Construction and Inference

3.4.6. Output Fusion and Decoding

3.5. TPDT-Block Internal Mechanism

3.5.1. Physics-Guided Self-Attention (PG-SA)

3.5.2. Cross-Scale Gating Feed-Forward Network (CSG-FFN)

3.5.3. Physics-Aware Modulation Module (PAPM)

3.6. Multi-Frequency Modulation Unit (MF-MU)

3.7. Loss Functions and Optimization Objectives

3.7.1. Diffusion Loss

3.7.2. Implicit Reconstruction Loss

3.7.3. Physics-Guided Prior Loss

3.7.4. Temporal Consistency Loss

3.7.5. Total Loss Function

4. Experimental Results and Analysis

4.1. Datasets

4.2. Compared Methods

4.3. Evaluation Metrics

4.4. Implementation Details

4.5. Comparison with the State of the Art

4.5.1. Performance Advantages and Generalization Capability of PG-TIE

Technical Interpretation of the Overall Performance Gains

4.5.2. Analysis of Deterministic Results

4.5.3. Analysis on Special Challenge Subsets

4.5.4. Temporal Consistency Evaluation

4.6. Ablation Study

4.6.1. Effectiveness of PGPG Internal Components

4.6.2. Physical Validation of PGPG Priors

4.6.3. Ablation on Conditional Fusion Strategy

4.6.4. Effectiveness of the Overall Framework

4.6.5. Impact of Sampling Strategy and Number of Steps

4.6.6. Contribution of Loss Function Components

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI