Contrastive Feature Disentanglement via Physical Priors for Underwater Image Enhancement

Li, Fei; Wan, Li; Zheng, Jiangbin; Wang, Lu; Xi, Yue

doi:10.3390/rs17050759

Open AccessArticle

Contrastive Feature Disentanglement via Physical Priors for Underwater Image Enhancement

by

Fei Li

¹

,

Li Wan

²,

Jiangbin Zheng

³,

Lu Wang

^1,* and

Yue Xi

⁴

¹

School of Computer Science and Technology, Xidian University, Xi’an 710071, China

²

Hangzhou Institute of Technology, Xidian University, Hangzhou 311200, China

³

School of Software and Microelectronics, Northwestern Polytechnical University, Xi’an 710071, China

⁴

Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(5), 759; https://doi.org/10.3390/rs17050759

Submission received: 16 December 2024 / Revised: 3 February 2025 / Accepted: 20 February 2025 / Published: 22 February 2025

(This article belongs to the Special Issue Ocean Remote Sensing Based on Radar, Sonar and Optical Techniques)

Download

Browse Figures

Versions Notes

Abstract

:

Underwater image enhancement (UIE) serves as a fundamental preprocessing step in ocean remote sensing applications, encompassing marine life detection, archaeological surveying, and subsea resource exploration. However, UIE encounters substantial technical challenges due to the intricate physics of underwater light propagation and the inherent homogeneity of aquatic environments. Images captured underwater are significantly degraded through wavelength-dependent absorption and scattering processes, resulting in color distortion, contrast degradation, and illumination irregularities. To address these challenges, we propose a contrastive feature disentanglement network (CFD-Net) that systematically addresses underwater image degradation. Our framework employs a multi-stream decomposition architecture with three specialized decoders to disentangle the latent feature space into components associated with degradation and those representing high-quality features. We incorporate hierarchical contrastive learning mechanisms to establish clear relationships between standard and degraded feature spaces, emphasizing intra-layer similarity and inter-layer exclusivity. Through the synergistic utilization of internal feature consistency and cross-component distinctiveness, our framework achieves robust feature extraction without explicit supervision. Compared to existing methods, our approach achieves a 12% higher UIQM score on the EUVP dataset and outperforms other state-of-the-art techniques on various evaluation metrics such as UCIQE, MUSIQ, and NIQE, both quantitatively and qualitatively.

Keywords:

contrastive feature disentanglement; physical priors; underwater image enhancement

1. Introduction

Underwater optical imaging has evolved into an indispensable modality in contemporary ocean remote sensing, including marine biology research, coral reef monitoring, underwater archaeology, and autonomous underwater vehicle (AUV) navigation [1]. However, the quality of underwater optical images often suffers from degradation due to various degradation factors, such as light scattering, absorption, and color distortion [2,3]. Such degradation mechanisms manifest in compromised visibility, attenuated contrast, and distorted color fidelity, which impede critical tasks like object detection, recognition, and tracking [4,5].

To address these challenges, underwater image enhancement (UIE) techniques aim to restore image clarity and improve visibility. Recent studies [6] highlight the significance of UIE as a preprocessing step for enhancing the performance of vision-based systems in underwater environments. Despite significant progress, existing methods often fail to adequately address the inherent complexities of underwater imaging, particularly when applied to real-world scenarios with diverse environmental conditions.

Compared to in-air image enhancement tasks, such as image denoising [7], dehazing [8], and low-light enhancement [9,10], UIE presents distinctive challenges [2,11,12]. These challenges stem from the unique properties of the underwater environment, including the following: (1) light scattering and absorption: light scattering reduces image clarity, while absorption causes wavelength-dependent color distortions, leading to the predominance of blue and green hues in underwater images; (2) scene homogeneity: the uniformity of underwater scenes, with limited texture and natural contrast, makes it difficult for enhancement algorithms to distinguish between degraded and authentic features; (3) data scarcity: collecting large-scale paired datasets of degraded and clean underwater images is inherently challenging due to the dynamic and unpredictable nature of underwater environments. Most existing datasets rely on synthetic image pairs, which fail to fully capture real-world complexities, worsening the domain shift issue between synthetic and real images.

These inherent challenges have motivated the development of various UIE methodologies, which can be systematically categorized into three main approaches: traditional model-based [13,14,15], deep learning-based [7,8,9,10,16,17,18,19], and physics-guided learning-based methods [11,20,21,22].

Traditional model-based approaches utilize specific mathematical models to simulate the underwater imaging process, relying on prior knowledge for parameter estimation, such as Dark Channel Prior (DCP) [13] and Retinex [14,15]. However, these approaches are often constrained by their inability to adapt to the varied environmental conditions encountered in real-world underwater scenarios.

Deep learning-based methods have significantly advanced image restoration tasks [7,8,9,10]. Such data-driven approaches excel in learning latent features from large datasets and modeling the relationship between degraded and enhanced images. Recent UIE-specific methods [16,17,18,19] have demonstrated promising results, regardless of whether they use paired or unpaired training datasets. However, these methods encounter significant challenges: (1) the domain gap between synthetic and real-world underwater images limits generalization capabilities; (2) the scarcity of paired datasets makes pixel-to-pixel mapping difficult during training; and (3) the black-box nature of deep learning models reduces interpretability, making it harder to trust their outputs in critical applications.

To overcome the limitations associated with paired samples, semisupervised and unsupervised methods have been explored. Generative Adversarial Networks (GANs) [1] and contrastive learning (CL) [18,19] have demonstrated promise by enabling the use of unpaired datasets. However, these approaches often experience instability during training and require physical priors to facilitate cross-domain feature transfer.

Unlike traditional and deep learning-based approaches, physics-guided learning-based methods embed physical priors into neural network architectures, improving interpretability and enhancing generalization. These methods leverage physical information to guide network design [11,20,21] and establish optimal solution boundaries [22]. While promising, these approaches often lack mechanisms to fully leverage unpaired data or to handle the inherent variability of underwater environments.

In this paper, we propose a novel hybrid framework that combines supervised and unsupervised learning paradigms along with physical priors in a unified architecture to overcome the limitations of existing methods. The primary challenge in UIE arises from the difficulty in handling the variability of underwater environments and the lack of paired data for supervised training. While many existing methods either rely on synthetic datasets or do not effectively address real-world variability, our framework uniquely integrates both supervised and unsupervised learning, making it capable of effectively leveraging unpaired data from real-world underwater images.

Our CFD-Net introduces a multi-stream decomposition approach that encodes degraded underwater images into a latent embedding space, separating components, including clean background, transmission maps, and ambient light, into distinct branches. This decomposition helps preserve the physical characteristics of the scene, enabling better alignment with the physical degradation process commonly encountered in underwater environments. By maintaining the realism of each generative component, our method ensures the estimated components are accurate and more reliable for subsequent enhancement. Specifically, we incorporate hierarchical contrastive learning mechanisms within a physics-guided architecture, enabling effective cross-domain feature learning to address the domain gap between synthetic and real-world underwater images. This allows our model to not only learn from synthetic data but also adapt well to the challenging variations of real-world underwater environments. Furthermore, we propose a novel cross-space contrastive loss function, which enhances feature separability within individual spaces and exclusivity between distinct spaces, improving the quality of decomposition and leading to superior image enhancement results.

This is the first approach that employs contrastive learning and physical feature decomposition for UIE, offering a unique perspective on underwater image restoration. By effectively addressing both the challenges of unpaired data and the inherent variability of underwater imaging, our method outperforms existing state-of-the-art techniques in both quantitative and qualitative evaluations. The primary contributions of this work are summarized as follows:

We propose a novel unsupervised UIE method leveraging contrastive feature decomposition, offering a new perspective for underwater image enhancement.
We introduce a unique cross-space and content contrastive loss, facilitating the simultaneous exploration of intra-similarity within latent spaces and inter-exclusiveness between feature spaces.
Comprehensive experiments are conducted to evaluate the proposed method, demonstrating its outstanding performance both quantitatively and qualitatively.

The remainder of this paper is organized as follows: Section 2 discusses related work in UIE. Section 3 details the proposed methodology. Section 4 presents the experimental results and comparisons with state-of-the-art methods. Finally, Section 5 concludes the paper and summarizes key findings.

2. Related Work

2.1. Traditional Methods

Underwater image enhancement predominantly relies on image processing and computer vision techniques. These non-physical methods, distinct from physical or deep learning models, function at the pixel level to enhance visual quality, such as histogram equalization [23], white balancing [24], and Retinex [14]. Ancuti et al. [25] proposed a multi-scale fusion method that combines white balancing with traditional contrast histogram equalization, enhancing the quality of the original image. Hitam et al. [26] introduced a hybrid Contrast Limited Adaptive Histogram Equalization (CLAHE) technique, employing the Euclidean norm to merge outputs from RGB and HSV color models, thereby improving contrast and reducing noise. Additionally, Ren et al. [15] developed the Low-Rank Regularized Retinex Model (LR3M), which combines a low-rank prior with Retinex decomposition to reduce noise in reflection images. While these methods are computationally efficient and easy to implement, they rely heavily on handcrafted priors and often fail to address the physical properties of underwater environments, limiting their effectiveness in complex scenarios.

2.2. Physical Methods

Inspired by the image degradation process, many low-quality restoration methods have been developed [13,27,28,29,30], each reconstructing the degradation process by estimating key parameters like light and transmission. Drawing inspiration from the Dark Channel Prior (DCP) used in dehazing tasks [27], Drews et al. [28] introduced the Underwater DCP (UDCP), which consists of transmission estimation methods tailored for underwater scenarios, primarily focusing on the blue and green channels to account for the rapid attenuation of red light underwater. Similarly, Galdran et al. [29] introduced the Red Channel Prior, which acknowledges the rapid attenuation of red wavelengths in water, thus providing another perspective on underwater image enhancement. Extending this concept, Peng et al. [13] introduced a technique incorporating depth-dependent color transformations for estimating environmental light. This approach enables adaptive color correction within the image formation framework, enhancing image clarity. Lastly, Akkaynak et al. [30] modified the conventional imaging model to address the limitations of standard models designed for atmospheric conditions. Liang et al. [31] introduced the enhancement method CDCR to restore color, detail, and contrast of degraded images. Despite their interpretability and reliance on well-defined mathematical models, physical methods often fail to generalize across diverse underwater environments. They rely on accurate parameter estimation, which is challenging in real-world scenarios with varying light conditions, turbidity levels, and depth variations.

2.3. Learning-Based Methods

Deep learning approaches have significantly advanced underwater image enhancement by overcoming the limitations inherent in traditional methods. Early methods, primarily based on Convolutional Neural Networks (CNNs), concentrated on extracting low-level feature representations for tasks such as color correction and contrast restoration. For example, Wang et al. [32] developed UIE-Net, a CNN-based approach aiming at color correction and dehazing. Similarly, Li et al. [33] introduced UWCNN, which integrates underwater scene priors to synthesize underwater images and reconstruct enhanced outputs without requiring explicit physical parameter estimation.

However, these methods often require large-scale paired datasets of degraded and clean images, which are challenging to obtain in underwater scenarios due to environmental variability and data scarcity. To address this issue, semisupervised and unsupervised learning frameworks have gained prominence in UIE tasks [11,34,35,36]. Notably, Li et al. [34] introduced WaterGAN to generate color-corrected underwater images, while Li et al. [35] adopted the CycleGAN [36] to facilitate unpaired feature domain transfer between degraded and enhanced images, reducing reliance on paired datasets. Chen et al. [11] proposed a generator with embedded physical priors for synthesizing training data and improving real-world underwater image enhancement.

More recently, contrastive learning (CL) has shown significant promise for self-supervised feature representation learning [37,38,39,40], particularly in image enhancement and restoration tasks [18,41,42]. For example, Wu et al. [41] introduced a contrast regularization technique leveraging CL for dehazing tasks. Liu et al. [42] applied a twin adversarial contrastive learning framework for underwater image enhancement. Huang et al. [18] proposed the Semi-UIR model, which integrates unlabeled data using CL to improve underwater image restoration.

Despite these advancements, existing learning-based methods continue to face critical challenges: (1) generative models often fail to fully capture the relationship between factor-related feature distributions and the physical degradation process, leading to suboptimal image restoration; (2) current methods primarily focus on mapping degraded images to clean counterparts but lack mechanisms to enforce independence among decomposed components, such as backgrounds and transmission maps; (3) the domain gap between synthetic and real-world underwater images remains a major barrier to generalization.

To address these limitations, we propose a novel contrastive multi-stream decomposition architecture that leverages physical priors and contrastive learning strategies. By embedding physical knowledge into latent feature spaces and utilizing cross-space contrastive loss functions, our method enhances generalization, improves interpretability, and effectively bridges the domain gap. This approach uniquely integrates the decomposition of degraded components (e.g., clean background, transmission maps, and ambient light layers) with cross-domain feature learning, providing a more robust solution for underwater image enhancement.

3. Proposed Method

To systematically address the inherent challenges of underwater image enhancement, we propose a physics-guided multi-stream decomposition framework providing theoretical foundations. The proposed framework comprises three key components: (1) a physics-guided image formation model that establishes the theoretical foundation and mathematical formulations; (2) a multi-stream decomposition architecture that disentangles various degradation factors; and (3) a hierarchical contrastive learning mechanism that ensures the distinctiveness of features.

In this section, we first introduce the physics-guided image formation model (Section 3.1), then detail our network architecture (Section 3.2), and, finally, present the learning objectives (Section 3.3).

3.1. Physics-Guided Image Formation Model

The underwater image formation process is governed by wavelength-dependent optical principles that characterize light propagation in aquatic media. For each color channel

c \in {R, G, B}

, the formation model is:

I^{c} (x) = J^{c} (x) t^{c} (x) + (1 - t^{c} (x)) A^{c}

(1)

where

I^{c} (x)

denotes the observed degraded image,

J^{c} (x)

represents the scene radiance,

t^{c} (x)

characterizes the transmission map, and

A^{c}

indicates the ambient light. The transmission follows the Beer–Lambert law:

t^{c} (x) = exp (- β^{c} d (x))

(2)

where

β^{c}

represents the wavelength-dependent attenuation coefficient for channel c and

d (x)

denotes the scene depth at position x. The attenuation coefficients satisfy

β^{R} > β^{G} > β^{B}

, which explains the characteristic blue-green color cast in underwater images.

To facilitate efficient network optimization while preserving physical consistency, we derive a simplified formulation based on two key observations from underwater optics. First, the transmission maps across RGB channels exhibit strong correlation due to the continuous nature of the attenuation spectrum:

T (x) = E_{c} [t^{c} (x)] = \frac{1}{3} \sum_{c \in {R, G, B}} t^{c} (x)

(3)

where

T (x)

represents the unified transmission map that captures the average attenuation effect across all channels. Second, the ambient light variation primarily affects intensity rather than spectral distribution:

A = E_{c} [A^{c}] + Δ A

(4)

where

A

denotes the mean ambient light intensity and

Δ A

represents the channel-specific intensity variation. These physical insights lead to our simplified formation model:

I = J \cdot T + (1 - T) \cdot A

(5)

This physically grounded model naturally suggests a feature decomposition strategy where the degraded underwater image can be disentangled into three components: clean background

J

representing features free from degradation, transmission map

T

capturing depth-dependent attenuation, and ambient light

A

encoding global illumination. Such decomposition provides clear guidance for contrastive feature learning—features from different components should be mutually exclusive in the latent space, while features within each component should maintain consistency across varying degradation levels.

Based on this formulation, we design a multi-stream architecture with three specialized decoders: the background decoder (

D_{b}

) estimates clean image

J

, the transmission decoder (

D_{t}

) computes transmission map

T

, and the ambient light decoder (

D_{a}

) predicts ambient light

A

. This physically constrained decomposition framework not only ensures the physical validity of the enhanced results but also provides natural supervision for contrastive feature learning, which will be detailed in the following sections.

3.2. Multi-Stream Decomposition Architecture

Given an unpaired underwater image set

X_{U} = {\{U\}}_{i = 1}^{N 1}

and its corresponding high-quality reference set

X_{H} = {\{H\}}_{i = 1}^{N 2}

, where

N_{1}

and

N_{2}

represent the sample sizes, we design a multi-stream architecture guided by the physical image formation model. The framework aims to learn disentangled representations that capture distinct physical properties: background features representing scene content, transmission features encoding depth-dependent attenuation, and ambient light features characterizing global illumination patterns. As illustrated in Figure 1, the framework consists of a shared encoder

E

and three specialized decoders

{D_{b}, D_{t}, D_{a}}

.

The feature embedding network: The feature embedding network

E

extracts multi-scale features guided by the physics of underwater image formation. Given an input image

U

, the hierarchical feature extraction process is formulated as:

F = {f_{1}, f_{2}, f_{3}, f_{4}} = E (U)

(6)

where

f_{1}

(1/2 scale) preserves fine-grained details for background reconstruction,

{f_{2}, f_{3}}

(1/4, 1/8 scales) capture medium-range context for transmission estimation, and

f_{4}

(1/16 scale) encodes global information for ambient light prediction.

Specialized decoders: the framework incorporates three physically motivated decoders, each designed to capture specific aspects of the underwater formation model:

(1) Background decoder

D_{b}

: following the multiplicative nature of background–transmission interaction in

J \cdot T

, the decoder utilizes features at all scales:

B = D_{b} (F) = D_{b} ({f_{1}, f_{2}, f_{3}, f_{4}})

(7)

(2) Transmission decoder

D_{t}

: guided by the spatial continuity of transmission map

T (x) = exp (- β d (x))

, the decoder focuses on medium- and high-level features:

T = D_{t} (F) = D_{t} ({f_{2}, f_{3}, f_{4}})

(8)

(3) Ambient light decoder

D_{a}

: following the global nature of ambient light

A = E_{c} [A^{c}]

, the decoder exclusively uses global semantic features:

A = D_{a} (F) = D_{a} ({f_{4}})

(9)

where

A

represents the estimated ambient light, derived from the deepest feature representations.

The network components leverage proven architectures while incorporating task-specific modifications. The transmission decoder

D_{t}

and ambient light decoder

D_{a}

adopt U-Net-like structures [43] for effective parameter estimation. The background decoder

D_{b}

employs a modified ResNet architecture in [44] enhanced with dual-attention mechanisms:

D_{b} (f) = {Conv}_{1 \times 1} (\sum_{i = 1}^{4} α_{i} \cdot {ResBlock}_{b} (f_{i}))

(10)

where

α_{i}

represents learnable attention weights and

{ResBlock}_{b}

incorporates both spatial and channel attention for comprehensive feature refinement. This physically guided architecture ensures that each component maintains its specific role while enabling efficient feature extraction and enhancement.

3.3. Training Objectives

The training objectives are designed to tackle underwater image enhancement challenges by incorporating physical priors derived from the image formation model (IFM). The IFM provides a mathematical foundation for decomposing an underwater image

U

into its constituent components: clean background

B

, transmission map

T

, and ambient light

A

, as formulated in Equation (5). The UIE task is thus formulated as an ill-posed inverse problem of recovering the clean background

B

from the degraded observation

U

while simultaneously estimating the transmission map

T

and ambient light

A

.

The total loss function combines reconstruction fidelity, physical consistency, and feature learning:

L_{total} = λ_{C L} L_{CL} + λ_{I F M} L_{IFM} + λ_{a d v} L_{adv}

(11)

where

λ_{C L}

,

λ_{I F M}

, and

λ_{a d v}

are hyperparameters that balance the contributions of adversarial learning, physical consistency, and contrastive feature learning, respectively.

3.3.1. Hierarchical Contrastive Learning Function

To effectively disentangle underwater image components while preserving their physical relationships, we propose a hierarchical contrastive learning framework that operates at multiple semantic levels. This framework leverages self-supervised learning to optimize feature representations and guide the decomposition process through carefully designed contrastive objectives. The comprehensive contrastive learning objective integrates three complementary components:

\begin{matrix} L_{CL} = L_{spaceCon} (B, T, A) + L_{factorCon} (T, T^{'}) + L_{factorCon} (A, A^{'}) + L_{detailCon} (B, U) \end{matrix}

(12)

where

L_{spaceCon}

enforces inter-component distinctiveness by regulating relationships between background, transmission, and ambient light features,

L_{factorCon}

maintains physical consistency by aligning estimated components with their reference counterparts, and

L_{detailCon}

ensures structural fidelity by preserving fine-grained image details during enhancement.

The framework implements these specialized contrastive losses to jointly optimize three fundamental aspects of the decomposition. First, intra-component coherence requires features within each component space (background, transmission, ambient light) to maintain high correlation, ensuring consistent physical interpretations through positive sample pairs drawn from the same component space. Second, inter-component separation demands features from different physical components to remain well separated in the latent space, preventing information leakage and ensuring clean decomposition through negative sample pairs across component spaces. Third, structural preservation guarantees that the enhanced outputs faithfully preserve the essential structural information present in the original underwater image while maintaining physical validity through detail-preserving contrastive constraints.

This hierarchical design enables our framework to learn physically meaningful and well-separated representations for each component while maintaining the intrinsic relationships that characterize underwater image formation. The following sections detail the specific formulation and implementation of each contrastive objective.

Degraded factor contrastive learning: To systematically model degradation-related features, the framework implements a physics-guided simulation of the underwater degradation process using unpaired image pairs. Consider a reference underwater image

I_{h}^{c}

, where its physical degradation process is characterized by:

I_{h}^{c} (x) = J_{h}^{c} (x) T_{h}^{c} (x)) + (1 - T_{h}^{c} (x)) A_{h}^{c}

(13)

where

J_{h}^{c} (x)

represents the scene radiance,

T_{h}^{c} (x)

denotes the transmission distribution, and

A_{h}^{c}

characterizes the ambient illumination.

Through the introduction of a degradation coefficient

α \in \{0, 1\}

, the degraded image formation

I_{l}^{c}

is mathematically formulated as:

\begin{matrix} I_{l}^{c} (x) & = J_{h}^{c} (x) (α T_{h}^{c} (x)) + (1 - α T_{h}^{c} (x)) A_{h}^{c} \\ = J_{h}^{c} (x) T_{l}^{c} (x) + (1 - T_{l}^{c} (x)) A_{l}^{c} \end{matrix}

(14)

As established in Equations (13) and (14), images

I_{l}^{c} (x)

and

I_{h}^{c} (x)

share an identical scene radiance

J_{h}^{c} (x)

. The degraded variant

I_{l}^{c} (x)

exhibits enhanced attenuation characteristics due to the constraint

α T_{h}^{c} (x) < T_{h}^{c} (x)

. Consequently, the estimated transmission map

T_{p}^{c} (x) \in \{T\}

and ambient light

A_{p}^{c} (x) \in \{A\}

should converge to their respective reference values

T_{h}^{c} (x)

and

A_{h}^{c} (x)

.

To achieve this objective, the framework defines contrastive objectives for the transmission map

T

and ambient light

A

estimation:

P_{t} (T) = L_{factorCon} (T_{h}^{c} (x), T_{p}^{c} (x))

(15)

P_{a} (A) = L_{factorCon} (A_{h}^{c} (x), A_{p}^{c} (x))

(16)

The framework implements a patch-based contrastive learning strategy where spatially corresponding patches within each decomposed component constitute positive pairs, while non-corresponding patches serve as negative samples. For a degraded input image

U

, the framework extracts feature representations

T

and

A

, with their high-quality counterparts denoted as

T^{'}

and

A^{'}

. Feature embeddings are computed through specialized Residual encoder

E_{r e s}

and multi-layer perceptron

M L P

, yielding representations

v_{A_{i}} = M L P (E_{r e s} (P_{A_{i}}))

,

v_{A_{i}^{'}} = M L P (E_{r e s} (P_{A_{i}^{'}}))

,

v_{T_{i}} = M L P (E_{r e s} (P_{T_{i}}))

, and

v_{T_{i}^{'}} = M L P (E_{r e s} (P_{T_{i}^{'}}))

.

The factor-aware contrastive loss is formulated as:

\begin{matrix} L_{factorCon} = & \sum_{i = 1}^{N} \frac{exp (v_{A_{i}^{'}} \cdot v_{A_{i}} / τ)}{exp (v_{A_{i}^{'}} \cdot v_{A_{i}} / τ) + \sum_{j = 1}^{N} exp (v_{A_{j}^{'}} \cdot v_{A_{i}} / τ)} + \\ \sum_{i = 1}^{N} \frac{exp (v_{T_{i}^{'}} \cdot v_{T_{i}} / τ)}{exp (v_{T_{i}^{'}} \cdot v_{T_{i}} / τ) + \sum_{j = 1}^{N} exp (v_{T_{j}^{'}} \cdot v_{T_{i}} / τ)} \end{matrix}

(17)

where N denotes the number of negative samples and

τ

represents the temperature coefficient. This formulation ensures that corresponding patches (e.g.,

P_{A_{i}}

and

P_{T_{i}}

) exhibit stronger feature correlations with their high-quality counterparts

P_{A_{i}^{'}}

and

P_{T_{i}^{'}}

compared to randomly sampled negative patches, thereby promoting consistent feature learning for transmission and ambient light estimation.

Detail preservation contrastive learning: The structural and perceptual consistency between the enhanced image

B

and the observed image

U

is systematically enforced through a detail-preserving contrastive mechanism. Spatially corresponding patches from

B

and

U

are designated as positive pairs to maintain local structural fidelity, while non-corresponding patches serve as negative samples. The feature extraction process utilizes the encoder

E_{r e s}

, yielding patch-level representations

v_{U_{i}} = M L P (E_{r e s} (P_{U_{i}}))

and

v_{B_{i}} = M L P (E_{r e s} (P_{B_{i}}))

. The detail preservation contrastive loss is formulated as:

L_{detailCon} = \sum_{i = 1}^{N} \frac{exp (v_{U_{i}} \cdot v_{B_{i}} / τ)}{exp (v_{U_{i}} \cdot v_{B_{i}} / τ) + \sum_{j = 1}^{N} exp (v_{U_{j}} \cdot v_{B_{i}} / τ)}

(18)

Inter-component contrastive learning: The framework integrates cross-space contrastive learning to articulate relationships between decomposed components, thereby facilitating more refined decomposition. The clean image

B

, transmission map

T

, and ambient light

A

exhibit distinct physical characteristics:

T

encodes scene depth and attenuation information,

A

captures global illumination patterns, and

B

preserves intrinsic scene structure and texture.

To model these inter-component relationships, we propose a cross-space contrastive loss that regulates the integration among components. The loss function is formulated as:

L_{spaceCon} = - \frac{1}{N_{X}} \sum_{k = 1}^{N_{X}} \sum_{i = 1}^{N_{X}} \frac{exp (f_{X_{i}} \cdot f_{X_{k}} / τ)}{\sum_{j = 1}^{N_{Y}} exp (f_{X_{i}} \cdot f_{Y_{j}} / τ)} - \frac{1}{N_{Y}} \sum_{m = 1}^{N_{Y}} \sum_{j = 1}^{N_{Y}} \frac{exp (f_{Y_{j}} \cdot f_{Y_{m}} / τ)}{\sum_{i = 1}^{N_{X}} exp (f_{Y_{j}} \cdot f_{X_{i}} / τ)}

(19)

where

X, Y \in {B, T, A}

and

N_{X}

and

N_{Y}

denote the cardinalities of positive and negative sample sets, respectively. The feature vectors

f_{X_{i}}

and

f_{Y_{i}}

are extracted using discriminator encoder

E_{d i s}

, which captures component-specific characteristics. The temperature parameter

τ

modulates the sensitivity of the similarity metric, following the principles established in [37].

This cross-space contrastive mechanism maximizes intra-component feature similarity while promoting inter-component feature distinctiveness, thereby ensuring effective separation of the physical components in the learned feature space. The selection of positive and negative samples is guided by non-local patch similarity to ensure meaningful contrastive relationships.

The complete contrastive learning objective integrates component-specific constraints:

L_{C L} = L_{f a c t o r C o n} + L_{s p a c e C o n} + L_{d e t a i l C o n}

(20)

where each term enforces specific aspects of the physical decomposition model while maintaining feature distinctiveness through carefully selected positive and negative samples.

Sampling strategy: The effectiveness of contrastive learning relies on clear distinctions between positive and negative samples while maintaining intra-class similarity. Unlike conventional CL approaches that employ data augmentation [37], we propose a physics-guided sampling strategy based on local patch similarity:

D i s t a n c e (p_{i}, p_{i_{R}}) = {∥p_{i} - p_{i_{Ω}}∥}^{2}

(21)

where

p_{i}

denotes the query patch and

p_{i_{Ω}}

represents candidate patches within support set

Ω

. Patches with the lowest distance values form positive pairs, while those with the highest values constitute negative pairs.

For detail preservation (

L_{d e t a i l C o n}

), positive pairs are selected from spatially corresponding regions between enhanced background

B

and input image

U

, which maintain consistent structural details. The factor consistency objective (

L_{f a c t o r C o n}

) forms positive pairs between estimated components

(T, A)

and their high-quality references

(T^{'}, A^{'})

, considering depth-dependent attenuation for transmission and illumination consistency for ambient light. For component distinctiveness (

L_{s p a c e C o n}

), positive pairs are selected within each component space, while negative pairs are sampled across different components to maximize feature distinctiveness.

This physics-guided sampling strategy ensures both reliable feature learning through intra-component consistency and effective decomposition through inter-component distinctiveness, while maintaining computational efficiency through local patch-based selection.

3.3.2. Information Formulation Supervised Function

The decomposition framework leverages physical priors to guide the separation of an underwater image

U

into its constituent components: background

B

, transmission map

T

, and ambient light

A

. These physics-based constraints ensure both the stability and the physical validity of the decomposition process. The estimated components

(T, A)

provide gradient supervision through an L1-based reconstruction loss:

L_{I F M} = {|U - (B \cdot T + (1 - T) \cdot A)|}_{1}

(22)

The L1 norm is specifically chosen over alternative metrics (e.g., L2) for its superior characteristics in preserving fine-grained details and robustness to outliers—properties crucial for high-fidelity image restoration. This constraint enforces the decomposed components to adhere to the physical degradation process defined by the IFM, guaranteeing physically consistent outputs from the decoders

(D_{b}, D_{t}, D_{a})

.

3.3.3. Adversarial Learning Function

The adversarial learning framework optimizes decoder

D_{b}

to synthesize clean background images

B

that accurately emulate the characteristics of natural underwater scenes. This is achieved through an adversarial discriminator

Dis

that learns to differentiate between authentic and synthesized backgrounds. The corresponding adversarial loss is formulated as:

\begin{matrix} L_{a d v} (D_{b}, Dis) = E_{B \sim P_{b g} (B)} [log Dis (B)] + E_{U \sim P_{u w} (U)} [log (1 - Dis (D_{b} (U)))] \end{matrix}

(23)

where

P_{b g} (B)

characterizes the probability distribution of authentic background images in the target domain and

P_{u w} (U)

represents the distribution of underwater observations in the source domain. Through iterative optimization, the decoder

D_{b}

learns to minimize this objective, thereby generating perceptually realistic and physically plausible background images.

For transmission and ambient light components, we extend this adversarial learning strategy with specialized formulations

L_{a d v} (D_{t}, Dis)

and

L_{a d v} (D_{a}, Dis)

, respectively. Specifically, the transmission decoder

D_{t}

employs local attention mechanisms to capture spatially varying degradation patterns, as the transmission map exhibits pixel-wise variations in projection rates. Conversely, the ambient light decoder

D_{a}

adopts a global representation approach, reflecting the scene-level nature of ambient illumination. For initialization, the transmission map

T

utilizes a second-order Laplacian filter to extract edge-based degradation cues, while the ambient light

A

leverages the dark channel prior.

Hence, the whole adversarial learning framework is formulated as:

L_{a d v} = L_{a d v} (D_{b}, Dis) + L_{a d v} (D_{t}, Dis) + L_{a d v} (D_{a}, Dis)

(24)

4. Experiments

4.1. Implementation Details

Network structure: The feature extraction backbone for

L_{c r o s s S p a c e}

is specifically engineered to optimize inter-component feature discrimination through a hierarchical representation learning scheme. The framework implements a classification-based architecture similar to PatchGAN [45], incorporating a discriminator–generator structure (illustrated in Figure 1b). This design enables

L_{c r o s s S p a c e}

to effectively capture discriminative semantic features necessary for component classification.

For

L_{d e t a i l C o n}

and

L_{f a c t o r C o n}

objectives, the feature extraction mechanism emphasizes local patch-level consistency between decomposed components and their corresponding observations. The framework employs ResNet [46] as the backbone feature extractor, followed by a two-layer multilayer perceptron (MLP) with 256 hidden units for feature embedding (detailed in Figure 1c). The sampling parameters are empirically set as

N = 256

for negative samples,

N_{X} = 8

for positive samples, and

N_{Y} = 256

for cross-space negative samples.

Training details: Our method employs a three-phase training protocol to ensure stable convergence and accurate component decomposition: (1) background initialization: the background decoder is optimized using

L_{d e t a i l C o n}

,

L_{a d v}

, and

L_{I F M}

for 10 epochs while keeping other components fixed; (2) degradation component learning: the transmission and ambient light decoders are jointly trained for 20 epochs using

L_{f a c t o r C o n}

and auxiliary losses, with the background decoder parameters frozen; (3) joint optimization: all components are collectively refined for 170 epochs using

L_{s p a c e C o n}

and the complete loss function set, enabling comprehensive feature interaction. The optimization framework adopts the momentum contrast mechanism from MOCO [40], with momentum coefficient 0.99 and temperature parameter 0.77. The implementation utilizes the PyTorch 1.9.0 framework [47] on NVIDIA RTX 4090 GPUs. The Adam optimizer [48] is employed with initial learning rate 1 ×

10^{- 4}

, which is decreased by a factor of 0.1 after 100 epochs. Training proceeds for 200 epochs with mini-batch size 4 and input resolution

256 \times 256

. The loss weights are configured as

λ_{a d v} = 1

,

λ_{I F M} = 0.5

, and

λ_{C L} = 0.1

.

4.2. Datasets

The experimental dataset comprises 2000 unpaired unlabeled samples and 2000 paired images with referenced clean labels, ensuring comprehensive coverage of diverse underwater scenarios. The unlabeled images are randomly sampled from the EUVP [1]. Additionally, 2000 labeled images are selected from the LSUI dataset [49], which ensures diversity in real underwater scenes, water types, lighting conditions, and target categories. Pseudo labels (i.e., estimated T′ and A′) are generated for training by processing all labeled and unlabeled high-quality underwater images using widely adopted underwater transmission and ambient light estimation methods [50]; the details can be seen on the right side of Figure 1a. The testing set consists of four commonly used datasets, UIEB [51], EUVP [1], RUIE [52], and SUIM [17], frequently employed in underwater image enhancement (UIE) research.

4.3. Quantitative and Quantitative Comparison

The performance evaluation of the proposed framework encompasses both quantitative metrics and qualitative visual evaluation, benchmarking against state-of-the-art underwater image enhancement methods. The comparative analysis includes two traditional methods (GDCP [13] and MMLE [53]) and five learning-based approaches (WaterNet [51], FUINE [1], CWR [19], SEMUIR [18], and HUPE [54]). For fair comparison, all learning-based methods are retrained using the proposed training dataset.

Visual quality assessment: The qualitative evaluation examines enhancement performance across four diverse datasets, encompassing various underwater scenes and degradation patterns. As illustrated in Figure 2, existing methods often exhibit inconsistent enhancement results, resulting in over-smoothing artifacts and color distortions at both global and local scales. For instance, WaterNet [51] introduces artifacts along high-frequency edges, while FUINE [1] and CWR [19] demonstrate limitations in color fidelity and detail preservation under challenging conditions.

The challenge of maintaining color fidelity while correcting hue distortions is particularly significant in unpaired learning frameworks. SEMUIR [18] addresses this through using estimated illumination maps as auxiliary inputs, showing improved color correction particularly for foreground objects (e.g., red and blue fish in the 11th and 12th rows of Figure 2). The proposed framework achieves superior color restoration by integrating physical underwater image formation principles into the network architecture, thereby resulting in more natural and vibrant color distributions compared to methods relying solely on unpaired training data.

Quantitative performance analysis: The quantitative evaluation employs four widely used non-reference metrics, each targeting different dimensions of underwater image degradation. UIQM [55] evaluates three key aspects, colorfulness, sharpness, and contrast, addressing common underwater image issues. UCIQE [56] quantifies color casts and low contrast, which are typical challenges in underwater imaging. MUSIQ [57] uses a multi-scale transformer architecture to assess structural fidelity and texture preservation across frequency bands. And NIQE [58] models natural scene statistics (NSSs) in the wavelet domain to detect artificial artifacts, with lower scores indicating higher naturalness. Although these metrics are heuristic and domain-specific, they cover important facets of underwater image quality: UIQM/UCIQE focus on the unique challenges of color distortions and turbidity, while MUSIQ/NIQE contribute to texture preservation and naturalness. As evidenced in Table 1, our method achieves optimal balance across these orthogonal dimensions.

Our CFD-Net consistently outperforms existing methods in UIQM and NIQE metrics across all datasets while maintaining competitive performance in UCIQE and MUSIQ. While GDCP [13] demonstrates superior performance in UCIQE and SEMUIR [18] achieves high scores in MUSIQ, our method exhibits balanced performance across all metrics, effectively improving both contrast enhancement and semantic feature preservation. This comprehensive enhancement capability is attributed to the synergistic integration of decomposed generative branches and multi-scale contrastive learning mechanisms.

Figure 3a–d present the channel-wise color distribution analysis of compared methods across different datasets. The results demonstrate that the proposed method consistently achieves more balanced and natural color distributions, particularly evident in the red channel where underwater degradation is most severe. Additionally, Figure 4 provides statistical analysis of color characteristics across all methods in the EVUP dataset, utilizing mean, standard deviation, median, and color range metrics to comprehensively evaluate enhancement quality. Our method achieves balanced mean values (

μ

= 0.497, 0.642, 0.590) with stable standard deviations (

σ

= 0.207, 0.195, 0.196) across RGB channels, showing a notable 66.1% improvement in red channel mean intensity compared to original images (

μ

= 0.299, 0.592, 0.555) while maintaining more consistent color distributions than competing methods such as GDCP (

σ

= 0.296, 0.301, 0.292). These quantitative results validate the effectiveness of our method in addressing the characteristic color distortion in underwater imagery while preserving natural color appearances.

Comprehensive performance evaluation: To evaluate computational efficiency, we compare the FLOPs and inference time of various methods. All experiments are performed on an NVIDIA RTX 4090 GPU using

512 \times 512

input images and a batch size of 1. Inference time is averaged over 100 forward passes to ensure consistency.

As shown in Table 1, our method achieves an inference time of 19.3 ms, making it the 2nd fastest among all methods. While FUINE achieves the fastest time (2.9 ms), it sacrifices quality, particularly in NIQE (5.106 vs. 3.922). Compared to CWR (20.8 ms) and WaterNet (40.6 ms), our model offers superior speed without compromising enhancement performance. Regarding computational cost, our method operates at 147.5 G FLOPs, significantly lower than WaterNet (571.8 G) and CWR (338.9 G). Although our approach has a higher FLOP count than FUINE (81.91 G) and HUPE (87.5 G), it strikes a favorable balance between efficiency and quality. This makes it particularly suitable for real-time underwater vision applications, especially in resource-constrained environments.

4.4. Ablation Study

To comprehensively evaluate the contributions of different components, we conduct ablation experiments on the UIEB dataset using four complementary quality metrics. We categorize the losses into three groups: (1) physics-consistency losses (

L_{I F M}

), (2) feature contrastive losses (

L_{d e t a i l C o n}

,

L_{s p a c e C o n}

,

L_{f a c t o r C o n}

), and (3) adversarial loss (

L_{a d v}

). The relative performance drop, denoted as

Δ

, is computed as follows:

Δ = \frac{M_{full} - M_{ablated}}{M_{full}} \times 100 %

(25)

where

M_{full}

denotes the metric value of the full model and

M_{ablated}

represents the value when specific components are removed.

As shown in Table 2, the physics-only baseline achieves moderate performance (UIQM = 1.667) and provides moderate enhancement quality, confirming the necessity of physical modeling. However, removing contrastive learning drastically reduces performance (e.g., UIQM ↓ 40.3%), indicating that feature-level constraints are critical for enhancement quality. Among the contrastive losses,

L_{d e t a i l C o n}

contributes the most to texture enhancement, improving UIQM by 50.3% (1.667 → 2.506),

L_{s p a c e C o n}

mainly enhances color balance, increasing UCIQE by 7.2%, and

L_{f a c t o r C o n}

improves perceptual quality and denoising, reducing NIQE from 4.107 to 3.980. The adversarial loss alone (

L_{a d v}

) achieves reasonable enhancement, but its full effect emerges when combined with physical and contrastive constraints. Notably, the full model achieves the best balance between all metrics, demonstrating the complementary interactions between loss categories. These results confirm that feature contrastive learning is the most crucial factor for underwater image enhancement, while adversarial learning provides additional refinement, particularly in perceptual quality (MUSIQ ↑ 41.412 vs. 40.213).

Figure 5 illustrates the impact of individual components on visual results. Initially, the

L_{I F M}

function restores the contour information but results in some color and contrast loss. The addition of

L_{d e t a i l C o n}

enhances fine-grained details, notably the printed text. When

L_{s p a c e C o n}

and

L_{f a c t o r C o n}

are introduced, color vibrancy improves, but some noise artifacts are also present. The adversarial loss (

L_{a d v}

) enhances color fidelity and texture preservation, especially in low-light scenarios. The integration of all components yields the best overall performance in terms of clarity, detail retention, and contrast.

4.5. Evaluation on Other Applications

To demonstrate the practical utility of the proposed framework, we evaluate its performance on downstream high-level computer vision tasks. Specifically, we assess the impact of image enhancement on underwater semantic segmentation performance using the SUIM dataset [17]. The evaluation compares the segmentation accuracy using original underwater images, images enhanced by competing methods, and those processed by our framework.

The segmentation results are illustrated in Figure 6, with quantitative evaluation metrics presented in Table 3. The proposed framework achieves superior performance across all evaluation metrics, attaining a PA of 0.830, mPA of 0.319, and mIoU of 0.284. These results represent significant improvements compared to both traditional methods (GDCP [13], MMLE [53]) and learning-based approaches (WaterNet [51], FUINE [1], CWR [19], SEMUIR [18], and HUPE [54]). The enhanced performance on semantic segmentation demonstrates the capability of our framework to recover discriminative features that benefit high-level computer vision tasks.

5. Conclusions

Our CFD-Net integrates decomposed feature representation with physical priors, which combines underwater image enhancement techniques with unsupervised learning algorithms, effectively addressing the challenges posed by degradation processes and refined feature representations. Through extensive experiments, we demonstrate that CFD-Net consistently outperforms state-of-the-art methods in both image enhancement and high-level downstream tasks, such as semantic segmentation. The integration of physical priors and innovative learning mechanisms allows our framework to recover vibrant, natural colors and significantly improve segmentation accuracy. Future work will explore the extension of this approach to multi-frame sequences and videos, where temporal information and multi-view perspectives could provide additional cues for enhanced underwater image restoration.

Author Contributions

Conceptualization, F.L.; Data curation, L.W. (Li Wan); Methodology, F.L.; Supervision, J.Z.; Validation, L.W. (Li Wan); Writing—original draft, F.L.; Writing—review and editing, L.W. (Lu Wang) and Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was sponsored by the National Natural Science Foundation of China under Grant (62301397, 62402370), in part by the Aero Science Foundation of China under Grant 20240058081001 and the Fundamental Research Funds for the Central Universities under Grant XJSJ23034.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Islam, M.J.; Xia, Y.; Sattar, J. Fast underwater image enhancement for improved visual perception. IEEE Robot. Autom. Lett. 2020, 5, 3227–3234. [Google Scholar] [CrossRef]
Song, Y.; Nakath, D.; She, M.; Köser, K. Optical imaging and image restoration techniques for deep ocean mapping: A comprehensive survey. PFG-Photogramm. Remote Sens. Geoinf. Sci. 2022, 90, 243–267. [Google Scholar] [CrossRef]
Grimaldi, M.; Nakath, D.; She, M.; Köser, K. Investigation of the Challenges of Underwater-Visual-Monocular-SLAM. arXiv 2023, arXiv:2306.08738. [Google Scholar] [CrossRef]
Long, H.; Shen, L.; Wang, Z.; Chen, J. Underwater Forward-Looking Sonar Images Target Detection via Speckle Reduction and Scene Prior. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Cao, X.; Ren, L.; Sun, C. Dynamic target tracking control of autonomous underwater vehicle based on trajectory prediction. IEEE Trans. Cybern. 2022, 53, 1968–1981. [Google Scholar] [CrossRef]
Wang, Y.; Guo, J.; He, W.; Gao, H.; Yue, H.; Zhang, Z.; Li, C. Is Underwater Image Enhancement All Object Detectors Need? IEEE J. Ocean. Eng. 2024, 49, 606–621. [Google Scholar] [CrossRef]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision transformers for single image dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Pan, C.; Wang, G.; Yang, Y.; Wei, J.; Li, C.; Shen, H.T. Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1662–1671. [Google Scholar]
Zheng, S.; Gupta, G. Semantic-guided zero-shot learning for low-light image/video enhancement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 581–590. [Google Scholar]
Chen, L.; Jiang, Z.; Tong, L.; Liu, Z.; Zhao, A.; Zhang, Q.; Dong, J.; Zhou, H. Perceptual underwater image enhancement with deep learning and physical priors. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 3078–3092. [Google Scholar] [CrossRef]
Li, C.; Anwar, S.; Hou, J.; Cong, R.; Guo, C.; Ren, W. Underwater image enhancement via medium transmission-guided multi-color space embedding. IEEE Trans. Image Process. 2021, 30, 4985–5000. [Google Scholar] [CrossRef]
Peng, Y.T.; Cao, K.; Cosman, P.C. Generalization of the dark channel prior for single image restoration. IEEE Trans. Image Process. 2018, 27, 2856–2868. [Google Scholar] [CrossRef] [PubMed]
Rahman, Z.u.; Jobson, D.J.; Woodell, G.A. Multi-scale retinex for color image enhancement. In Proceedings of the 3rd IEEE International Conference on Image Processing, Lausanne, Switzerland, 16–19 September 1996; Volume 3, pp. 1003–1006. [Google Scholar]
Ren, X.; Yang, W.; Cheng, W.H.; Liu, J. LR3M: Robust low-light enhancement via low-rank regularized retinex model. IEEE Trans. Image Process. 2020, 29, 5862–5876. [Google Scholar] [CrossRef]
Fu, C.; Liu, R.; Fan, X.; Chen, P.; Fu, H.; Yuan, W.; Zhu, M.; Luo, Z. Rethinking general underwater object detection: Datasets, challenges, and solutions. Neurocomputing 2023, 517, 243–256. [Google Scholar] [CrossRef]
Islam, M.J.; Edge, C.; Xiao, Y.; Luo, P.; Mehtaz, M.; Morse, C.; Enan, S.S.; Sattar, J. Semantic segmentation of underwater imagery: Dataset and benchmark. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 1769–1776. [Google Scholar]
Huang, S.; Wang, K.; Liu, H.; Chen, J.; Li, Y. Contrastive semi-supervised learning for underwater image restoration via reliable bank. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18145–18155. [Google Scholar]
Han, J.; Shoeiby, M.; Malthus, T.; Botha, E.; Anstee, J.; Anwar, S.; Wei, R.; Armin, M.A.; Li, H.; Petersson, L. Underwater image restoration via contrastive learning and a real-world dataset. Remote Sens. 2022, 14, 4297. [Google Scholar] [CrossRef]
Liu, X.; Gao, Z.; Chen, B.M. IPMGAN: Integrating physical model and generative adversarial network for underwater image enhancement. Neurocomputing 2021, 453, 538–551. [Google Scholar] [CrossRef]
Qi, H.; Dong, X. Physics-aware semi-supervised underwater image enhancement. arXiv 2023, arXiv:2307.11470. [Google Scholar]
Zhou, Y.; Yan, K.; Li, X. Underwater image enhancement via physical-feedback adversarial transfer learning. IEEE J. Ocean. Eng. 2021, 47, 76–87. [Google Scholar] [CrossRef]
Liu, Y.C.; Chan, W.H.; Chen, Y.Q. Automatic white balance for digital still camera. IEEE Trans. Consum. Electron. 1995, 41, 460–466. [Google Scholar]
Pizer, S.M. Contrast-limited adaptive histogram equalization: Speed and effectiveness stephen m. pizer, r. eugene johnston, james p. ericksen, bonnie c. yankaskas, keith e. muller medical image display research group. In Proceedings of the First Conference on Visualization in Biomedical Computing, Atlanta, GA, USA, 22–25 May 1990; Volume 337, p. 2. [Google Scholar]
Ancuti, C.; Ancuti, C.O.; Haber, T.; Bekaert, P. Enhancing underwater images and videos by fusion. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 81–88. [Google Scholar]
Hitam, M.S.; Awalludin, E.A.; Yussof, W.N.J.H.W.; Bachok, Z. Mixture contrast limited adaptive histogram equalization for underwater image enhancement. In Proceedings of the 2013 International Conference on Computer Applications Technology (ICCAT), Sousse, Tunisia, 20–22 January 2013; pp. 1–5. [Google Scholar]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar] [PubMed]
Drews, P.; Nascimento, E.; Moraes, F.; Botelho, S.; Campos, M. Transmission estimation in underwater single images. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013; pp. 825–830. [Google Scholar]
Galdran, A.; Pardo, D.; Picón, A.; Alvarez-Gila, A. Automatic red-channel underwater image restoration. J. Vis. Commun. Image Represent. 2015, 26, 132–145. [Google Scholar] [CrossRef]
Akkaynak, D.; Treibitz, T. Sea-thru: A method for removing water from underwater images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1682–1691. [Google Scholar]
Liang, Z.; Zhang, W.; Ruan, R.; Zhuang, P.; Xie, X.; Li, C. Underwater image quality improvement via color, detail, and contrast restoration. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 1726–1742. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, J.; Cao, Y.; Wang, Z. A deep CNN method for underwater image enhancement. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 1382–1386. [Google Scholar]
Li, C.; Anwar, S.; Porikli, F. Underwater scene prior inspired deep underwater image and video enhancement. Pattern Recognit. 2020, 98, 107038. [Google Scholar] [CrossRef]
Li, J.; Skinner, K.A.; Eustice, R.M.; Johnson-Roberson, M. WaterGAN: Unsupervised generative network to enable real-time color correction of monocular underwater images. IEEE Robot. Autom. Lett. 2017, 3, 387–394. [Google Scholar] [CrossRef]
Li, C.; Guo, J.; Guo, C. Emerging from water: Underwater image color correction based on weakly supervised color transfer. IEEE Signal Process. Lett. 2018, 25, 323–327. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Chen, T.; Kornblith, S.; Swersky, K.; Norouzi, M.; Hinton, G.E. Big self-supervised models are strong semi-supervised learners. Adv. Neural Inf. Process. Syst. 2020, 33, 22243–22255. [Google Scholar]
Chen, X.; Fan, H.; Girshick, R.; He, K. Improved baselines with momentum contrastive learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Wu, H.; Qu, Y.; Lin, S.; Zhou, J.; Qiao, R.; Zhang, Z.; Xie, Y.; Ma, L. Contrastive learning for compact single image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10551–10560. [Google Scholar]
Liu, R.; Jiang, Z.; Yang, S.; Fan, X. Twin adversarial contrastive learning for underwater image enhancement and beyond. IEEE Trans. Image Process. 2022, 31, 4922–4936. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Wang, C.; Guo, X.; Tao, D. Robust Unpaired Image Dehazing via Density and Depth Decomposition. Int. J. Comput. Vis. 2024, 132, 1557–1577. [Google Scholar] [CrossRef]
Chang, Y.; Guo, Y.; Ye, Y.; Yu, C.; Zhu, L.; Zhao, X.; Yan, L.; Tian, Y. Unsupervised Deraining: Where Asymmetric Contrastive Learning Meets Self-similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2638–2657. [Google Scholar] [CrossRef]
Demir, U.; Unal, G. Patch-based image inpainting with generative adversarial networks. arXiv 2018, arXiv:1803.07422. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14; Springer: Cham, Switzerland, 2016; pp. 694–711. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Peng, L.; Zhu, C.; Bian, L. U-shape transformer for underwater image enhancement. IEEE Trans. Image Process. 2023, 32, 3066–3079. [Google Scholar] [CrossRef] [PubMed]
Ehsan, S.M.; Imran, M.; Ullah, A.; Elbasi, E. A single image dehazing technique using the dual transmission maps strategy and gradient-domain guided image filtering. IEEE Access 2021, 9, 89055–89063. [Google Scholar] [CrossRef]
Li, C.; Guo, C.; Ren, W.; Cong, R.; Hou, J.; Kwong, S.; Tao, D. An underwater image enhancement benchmark dataset and beyond. IEEE Trans. Image Process. 2019, 29, 4376–4389. [Google Scholar] [CrossRef] [PubMed]
Liu, R.; Fan, X.; Zhu, M.; Hou, M.; Luo, Z. Real-world underwater enhancement: Challenges, benchmarks, and solutions under natural light. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 4861–4875. [Google Scholar] [CrossRef]
Zhang, W.; Zhuang, P.; Sun, H.H.; Li, G.; Kwong, S.; Li, C. Underwater image enhancement via minimal color loss and locally adaptive contrast enhancement. IEEE Trans. Image Process. 2022, 31, 3997–4010. [Google Scholar] [CrossRef]
Zhang, Z.; Jiang, Z.; Ma, L.; Liu, J.; Fan, X.; Liu, R. HUPE: Heuristic Underwater Perceptual Enhancement with Semantic Collaborative Learning. Int. J. Comput. Vis. 2025, 1–19. [Google Scholar] [CrossRef]
Panetta, K.; Gao, C.; Agaian, S. Human-visual-system-inspired underwater image quality measures. IEEE J. Ocean. Eng. 2015, 41, 541–551. [Google Scholar] [CrossRef]
Yang, M.; Sowmya, A. An underwater color image quality evaluation metric. IEEE Trans. Image Process. 2015, 24, 6062–6071. [Google Scholar] [CrossRef]
Ke, J.; Wang, Q.; Wang, Y.; Milanfar, P.; Yang, F. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5148–5157. [Google Scholar]
Mittal, A.; Soundararajan, R.; Bovik, A. Making a “Completely Blind” Image Quality Analyzer. IEEE Signal Process. Lett. 2013, 20, 209–212. [Google Scholar] [CrossRef]

Figure 1. The proposed CFD-Net framework. (a) An overview of our framework consisting of three parallel decoders: transmission decoder

D_{t}

for estimating

T

, background decoder

D_{b}

for reconstructing

B

, and ambient light decoder

D_{a}

for predicting

A

. (b) Cross-space contrastive loss

L_{s p a c e C o n}

for feature alignment and (c) content contrastive loss

L_{d e t a i l C o n}

and factor contrastive loss

L_{f a c t o r C o n}

for preserving details.

Figure 1. The proposed CFD-Net framework. (a) An overview of our framework consisting of three parallel decoders: transmission decoder

D_{t}

for estimating

T

, background decoder

D_{b}

for reconstructing

B

, and ambient light decoder

D_{a}

for predicting

A

. (b) Cross-space contrastive loss

L_{s p a c e C o n}

for feature alignment and (c) content contrastive loss

L_{d e t a i l C o n}

and factor contrastive loss

L_{f a c t o r C o n}

for preserving details.

Figure 2. Visual comparison of underwater image enhancement results across multiple datasets. From top to bottom: RUIE (blue-tinted underwater scenes), SUIM (shallow underwater images), EVUP (varied visibility conditions), and UIEB (diverse underwater environments). Note improved visibility details of underwater objects and more natural color rendition in our results. (a) Original. (b) GDCP [13]. (c) MMLE [53]. (d) Water [51]. (e) FUINE [1]. (f) CWR [19]. (g) SEMI [18]. (h) HUPE [54]. (i) CFD-Net.

Figure 3. Channel-wise color distribution comparison of different underwater image enhancement methods on multiple datasets. Density curves illustrate distribution of pixel values in red (left), green (middle), and blue (right) channels. Different line styles and colors represent different enhancement methods.

Figure 4. Statistical analysis of color characteristics for different enhancement methods. Heatmaps display (a) mean values, (b) standard deviation, (c) median values, and (d) color range for each RGB channel. Values are normalized to range [0, 1].

Figure 5. Visual comparison of ablation study results. From left to right: input image, results from ablation with IFM (

L_{I F M}

), with detail preservation (

L_{d e t a i l C o n}

), with cross-space constraint (

L_{s p a c e C o n}

), with factor constraint (

L_{f a c t o r C o n}

), with adversarial loss (

L_{a d v}

), and with full model.

Figure 5. Visual comparison of ablation study results. From left to right: input image, results from ablation with IFM (

L_{I F M}

), with detail preservation (

L_{d e t a i l C o n}

), with cross-space constraint (

L_{s p a c e C o n}

), with factor constraint (

L_{f a c t o r C o n}

), with adversarial loss (

L_{a d v}

), and with full model.

Figure 6. Evaluation of underwater semantic segmentation on SUIM dataset. The 1st and 2nd columns denote the input underwater images and their corresponding label masks. The remaining columns show the segmentation results of different methods.

Table 1. Comprehensive evaluation of compared methods (quality metrics and computational efficiency) on benchmarks in terms of UIQM [55], UCIQE [56], MUSIQ [57], and NIQE [58] across RUIE [52], SUIM [17], EVUP [1], and UIEB [51] datasets. The ↑ indicates higher values are better, while ↓ indicates lower values are better. Best results are in bold and second-best results are underlined.

Method	UIQM ↑				UCIQE ↑				MUSIQ ↑				NIQE ↓				Efficiency
Method	RUIE	SUIM	EVUP	UIEB	RUIE	SUIM	EVUP	UIEB	RUIE	SUIM	EVUP	UIEB	RUIE	SUIM	EVUP	UIEB	FLOPs (G)	Time (ms)
Original	2.294	2.027	2.254	1.677	0.523	0.597	0.543	0.533	33.874	60.855	43.678	41.697	5.062	3.957	4.976	6.905	-	-
GDCP [13]	2.738	1.970	2.337	1.899	0.608	0.678	0.638	0.624	32.671	58.897	40.886	51.002	4.738	3.885	4.697	5.190	-	173.8
MMLE [53]	2.871	2.137	2.337	1.953	0.567	0.612	0.638	0.580	36.510	62.704	40.886	40.345	4.859	4.045	4.697	4.845	-	91.7
WaterNet [51]	3.168	2.644	3.077	2.317	0.568	0.607	0.597	0.581	30.289	60.280	42.402	40.006	4.755	4.007	4.420	5.702	571.8	40.6
FUINE [1]	3.145	2.557	2.944	2.867	0.538	0.610	0.572	0.552	28.726	60.158	38.110	46.827	4.906	3.611	5.106	5.299	81.91	2.9
CWR [19]	3.154	2.847	3.008	2.459	0.583	0.637	0.618	0.607	25.310	58.915	37.854	30.131	4.730	4.058	4.744	5.338	338.9	20.8
SEMUIR [18]	3.063	2.502	2.957	2.164	0.554	0.636	0.599	0.570	32.446	62.272	47.882	42.460	4.633	3.439	4.352	5.697	105.6	43.4
HUPE [54]	3.000	2.481	2.779	2.198	0.550	0.637	0.602	0.582	29.766	54.57	38.817	35.559	4.598	4.148	5.136	7.771	87.5	50.2
Ours	3.227	3.117	3.116	2.793	0.558	0.620	0.600	0.569	34.398	61.348	47.175	39.412	3.922	2.863	3.639	3.938	147.5	19.3

FLOPs: float operations; time: 4090 GPU latency; “-”: non-deep methods.

Table 2. Component-wise ablation analysis on UIEB dataset (updated). The ↑ indicates higher values are better, while ↓ indicates lower values are better. Best results are in bold.

Category	Components	UIQM ↑	$Δ$ UIQM	UCIQE ↑	$Δ$ UCIQE	MUSIQ ↑	$Δ$ MUSIQ	NIQE ↓	$Δ$ NIQE
Physical only	$L_{I F M}$	1.667	−40.3%	0.572	−1.2%	39.476	−4.7%	4.107	−9.9%
Feature contrastive	+ $L_{d e t a i l C o n}$	2.506	−10.3%	0.522	−8.3%	38.912	−6.0%	4.246	−13.6%
	+ $L_{s p a c e C o n}$	2.301	−17.6%	0.551	−3.2%	39.102	−5.6%	4.102	−9.7%
	+ $L_{f a c t o r C o n}$	2.410	−13.7%	0.563	−1.1%	39.875	−3.7%	3.980	−6.5%
Adversarial	+ $L_{a d v}$	2.155	−22.8%	0.548	−3.7%	40.213	−2.9%	3.890	−4.1%
Full	All	2.793	-	0.569	-	41.412	-	3.738	-

Δ

: Relative drop versus full model.

Table 3. Pixel Accuracy (PA), Mean Pixel Accuracy (mPA), and Mean Intersection over Union (mIoU) comparison on SUIM dataset. The ↑ indicates higher values are better. Best results are in bold and second-best results are underlined.

	GDCP	MMLE	WaterNet	FUNIE	CWR	SEMUIR	HUPE	Ours
PA ↑	0.812	0.805	0.655	0.806	0.800	0.816	0.826	0.830
MPA ↑	0.301	0.304	0.261	0.297	0.300	0.304	0.312	0.319
mIoU ↑	0.267	0.268	0.200	0.264	0.268	0.273	0.284	0.284

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, F.; Wan, L.; Zheng, J.; Wang, L.; Xi, Y. Contrastive Feature Disentanglement via Physical Priors for Underwater Image Enhancement. Remote Sens. 2025, 17, 759. https://doi.org/10.3390/rs17050759

AMA Style

Li F, Wan L, Zheng J, Wang L, Xi Y. Contrastive Feature Disentanglement via Physical Priors for Underwater Image Enhancement. Remote Sensing. 2025; 17(5):759. https://doi.org/10.3390/rs17050759

Chicago/Turabian Style

Li, Fei, Li Wan, Jiangbin Zheng, Lu Wang, and Yue Xi. 2025. "Contrastive Feature Disentanglement via Physical Priors for Underwater Image Enhancement" Remote Sensing 17, no. 5: 759. https://doi.org/10.3390/rs17050759

APA Style

Li, F., Wan, L., Zheng, J., Wang, L., & Xi, Y. (2025). Contrastive Feature Disentanglement via Physical Priors for Underwater Image Enhancement. Remote Sensing, 17(5), 759. https://doi.org/10.3390/rs17050759

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Contrastive Feature Disentanglement via Physical Priors for Underwater Image Enhancement

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods

2.2. Physical Methods

2.3. Learning-Based Methods

3. Proposed Method

3.1. Physics-Guided Image Formation Model

3.2. Multi-Stream Decomposition Architecture

3.3. Training Objectives

3.3.1. Hierarchical Contrastive Learning Function

3.3.2. Information Formulation Supervised Function

3.3.3. Adversarial Learning Function

4. Experiments

4.1. Implementation Details

4.2. Datasets

4.3. Quantitative and Quantitative Comparison

4.4. Ablation Study

4.5. Evaluation on Other Applications

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI