Semi-Supervised Underwater Image Enhancement Method Using Multimodal Features and Dynamic Quality Repository

Ding, Mu; Li, Gen; Hu, Yu; Liu, Hangfei; Hu, Qingsong; Huang, Xiaohua

doi:10.3390/jmse13061195

Open AccessArticle

Semi-Supervised Underwater Image Enhancement Method Using Multimodal Features and Dynamic Quality Repository

by

Mu Ding

^1,2,

Gen Li

¹

,

Yu Hu

¹,

Hangfei Liu

¹,

Qingsong Hu

² and

Xiaohua Huang

^1,*

¹

South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences/Key Laboratory for Sustainable Utilization of Open-Sea Fishery, Ministry of Agriculture and Rural Affairs, Guangzhou 510300, China

²

College of Engineering Science and Technology, Shanghai Ocean University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(6), 1195; https://doi.org/10.3390/jmse13061195

Submission received: 27 May 2025 / Revised: 15 June 2025 / Accepted: 17 June 2025 / Published: 19 June 2025

(This article belongs to the Special Issue Selection of Deep-Sea Aquaculture Species and Development of Supporting Technologies and Equipment)

Download

Browse Figures

Versions Notes

Abstract

Obtaining clear underwater images is crucial for smart aquaculture, so it is necessary to repair degraded underwater images. Although underwater image restoration techniques have achieved remarkable results in recent years, the scarcity of labeled data poses a significant challenge to continued advancement. It is well known that semi-supervised learning can make use of unlabeled data. In this study, we proposed a semi-supervised underwater image enhancement method, MCR-UIE, which utilized multimodal contrastive learning and a dynamic quality reliability repository to leverage the unlabeled data during training. This approach used multimodal feature contrast regularization to prevent the overfitting of incorrect labels, and secondly, introduced a dynamic quality reliability repository to update the output as pseudo ground truth. The robustness and generalization of the model in pseudo-label generation and unlabeled data learning were improved. Extensive experiments conducted on the UIEB and LSUI datasets demonstrated that the proposed method consistently outperformed existing traditional and deep learning-based approaches in both quantitative and qualitative evaluations. Furthermore, its successful application to images captured from deep-sea cage aquaculture environments validated its practical value. These results indicated that MCR-UIE held strong potential for real-world deployment in intelligent monitoring and visual perception tasks in complex underwater scenarios.

Keywords:

aquaculture; underwater image enhancement; multimodal contrastive learning; dynamic quality reliability repository

1. Introduction

Driven by the expansion of marine resource exploitation and the advancement of oceanographic research, underwater images have progressively gained importance in fields such as marine surveillance, underwater archeology, and aquaculture. Nevertheless, the unique optical properties of underwater environments, such as light absorption, scattering, and refraction, often lead to severe image degradation, including color distortion, low contrast, and detail blurring. These degradations significantly impair both the visual perception and the effectiveness of subsequent image analysis. Underwater image enhancement seeks to address these challenges by improving visual quality, thereby facilitating human interpretation and automated processing [1,2]. Through enhancement processing, the sharpness, contrast, and color fidelity of images can be increased, which plays a vital role in tasks encompassing underwater target detection, recognition, and tracking.

Researchers have explored various enhancement techniques, each with their own set of advantages and limitations. Currently, underwater image enhancement approaches are typically classified into three main categories: physical model-based approaches [3,4], non-physical model-based approaches [5,6], and deep learning-based approaches [7,8,9].

Physical model-based: These methods describe the degradation process of underwater images through mathematical deduction, which mainly includes establishing an underwater optical imaging model, estimating the parameters of the model, and restoring underwater images through reverse deduction. The physical parameters in the model are obtained through two different methods: extraction by polarization imaging equipment and derivation by prior knowledge. However, since the parameter estimation process has difficulty fully considering different water environments and shooting conditions, the algorithm is not universal enough.

Non-physical model-based: These approaches based on non-physical models directly adjust the pixel values of the entire image by constructing functions to adjust the color and contrast of the image, thereby enhancing human subjective visual perception. The enhancement approach based on the non-physical model is relatively simple, with low computational complexity, and is easy to implement and apply. However, since it does not take into account the physical process and optical properties, it has poor adaptability to changes in underwater environments and different scenes. It is easy for noise and color deviation to be introduced during processing, and there are problems such as vignetting and oversaturation which affect the quality and clarity of the image.

Underwater image enhancement methods grounded in physical models need to be evaluated and optimized according to specific circumstances in practical applications, which also leads to the insufficient robustness of the methods. Underwater image enhancement approaches grounded in non-physical models also have shortcomings such as poor robustness, limited processing effects, and the presence of artifacts and noise, and can be considered as auxiliary methods. With the continuous development of image processing technology, deep learning-based approaches have gradually addressed some of the shortcomings of traditional methods.

Despite their respective advantages, both physical and non-physical methods suffer from generalization issues when applied to real-world underwater environments. Recent advances in computer vision have spurred the development of deep learning-based methods [10,11,12], which leverage powerful neural architectures including convolutional neural network (CNN), generative adversarial network (GAN), and autoencoders to learn the mappings from degraded to enhanced images. These approaches have shown promising results, especially when trained on large annotated datasets. However, their performance is highly dependent on the availability of labeled data, which is costly and labor-intensive to obtain in underwater domains.

In contrast, while unlabeled underwater images are relatively easy to collect, the primary challenge lies in how to utilize them effectively to train robust and generalizable models [13]. Existing supervised underwater image enhancement (UIE) methods heavily depend on paired data or high-quality reference images, which are difficult to obtain in real-world underwater environments. Moreover, conventional semi-supervised learning methods, although promising, often fall short in underwater applications due to their reliance on static pseudo-labels and heuristic confidence mechanisms that cannot adapt to sample-wise quality variations or unknown degradations.

To address these limitations, we introduce a semi-supervised learning framework specifically tailored for underwater image enhancement, aiming to improve the model’s generalization to diverse and unseen real-world underwater scenarios. Our method is built upon the mean teacher paradigm [14,15], which leverages an exponential moving average (EMA) of the student model to form the teacher network. The teacher provides pseudo-labels for the unlabeled data, and a consistency loss is used to guide the student’s training, enabling the model to benefit from both labeled and unlabeled samples.

However, directly applying the mean teacher method to underwater image enhancement poses several critical challenges. (1) The teacher model, especially in the early training stages, is not guaranteed to outperform the student, leading to unreliable pseudo-labels that can misguide the student and hinder convergence. (2) The use of a conventional pixel-wise consistency loss (typically L1 loss) can be overly strict, causing the model to overfit noisy pseudo-labels and suffer from confirmation bias. These issues highlight a scientific gap: understanding how to integrate pseudo-label selection with image quality awareness in a dynamically evolving underwater learning scenario.

To this end, we propose a novel dynamic quality reliability repository (DQR), which continuously tracks and stores high-quality outputs from the teacher model using an NR-IQA metric (MUSIQ). This allows the student to be guided by only the most reliable pseudo-labels, effectively filtering out noisy supervision and stabilizing the semi-supervised process. Furthermore, to alleviate overfitting and enforce a more flexible learning objective, we introduce a multimodal contrastive loss that leverages complementary modality cues, such as VGG features, edge information, color distributions, and local texture regions, to provide gradient-level supervision. This auxiliary contrastive regularization improves representation robustness and is especially beneficial when working with unlabeled, degraded underwater images.

Taken together, our proposed approach directly addresses the shortcomings of previous semi-supervised UIE methods by integrating dynamic reliability assessment and multimodal regularization. It offers a principled strategy to fully exploit large-scale unlabeled underwater data, filling an important methodological gap and improving model generalization across a wide range of underwater environments.

The primary contributions of this work are summarized as follows: (1) We proposed MCR-UIE, a semi-supervised underwater image enhancement framework that leveraged multimodal loss and a dynamic quality reliability repository to effectively utilize unlabeled data, thereby improving the generalization capability of the trained model on real-world images. (2) To ensure the reliability of the pseudo-labels, we constructed a dynamic quality reliability repository that archived the best outputs produced by the teacher model. (3) We adopted multimodal contrastive loss as a regularization technique to alleviate confirmation bias during training. (4) Extensive experimental results demonstrated the effectiveness and robustness of the proposed approach.

The remainder of this manuscript is organized as follows: Section 2 reviews the related work. In Section 3, we introduce the proposed semi-supervised underwater image enhancement method, which incorporates multimodal contrastive loss and the dynamic quality reliability repository. Section 4 presents the enhanced experimental and analytical results. Finally, the conclusions are summarized in Section 5.

2. Related Works

2.1. Underwater Image Enhancement Methods

Traditional underwater image enhancement approaches are generally categorized into physical and non-physical model-based methods. Physical model-based approaches [3,4] aim to describe the image degradation process by estimating unknown parameters in underwater imaging models. These parameters typically include the transmission map and ambient light, which are derived using handcrafted priors and assumptions based on optical principles. In contrast, non-physical model-based approaches directly improve image quality by adjusting pixel intensities or contrast through algorithmic design. Typical techniques include CLAHE [6], Retinex [5], Fusion [16], and MMLE [17]. Although these traditional methods have achieved reasonable performance in relatively simple underwater scenes, they often struggle to handle complex and dynamic real-world environments. Their limitations become evident when facing the varying lighting conditions, turbidity levels, and color distortions that are common in practical underwater applications.

Early deep learning-based underwater image enhancement approaches [18,19] commonly relied on physical imaging models. These methods typically trained neural networks to estimate parameters such as transmission maps and ambient light. However, the challenge of accurately estimating these parameters often led to suboptimal restoration performance, particularly in complex underwater environments.

To overcome these limitations, recent research has shifted towards purely data-driven approaches that dispense with explicit imaging models. These approaches aim to learn a direct mapping from degraded to enhanced images using supervised learning on paired datasets. For example, some frameworks employ feature fusion strategies that integrate outputs from multiple traditional enhancement methods to guide the restoration process [20]. Others incorporate prior-inspired modules, such as spatial encoders and transmission-guided decoders, to refine structural and color representations [21]. Additionally, GAN-based architectures have also been introduced to achieve efficient, real-time image enhancement [22].

2.2. Semi-Supervised Approaches

In recent times, semi-supervised learning has become an effective strategy in computer vision by enabling the joint utilization of labeled and unlabeled data. Several representative approaches have been proposed, including mean teacher [14], virtual adversarial learning [23], and Fixmatch [24]. Among these approaches, the mean teacher method, grounded in consistency regularization, has shown remarkable effectiveness in image classification tasks. Its effectiveness has subsequently inspired its application in other areas such as semantic segmentation [25] and image restoration [26].

Despite the increasing popularity of semi-supervised learning in various vision-related tasks, its potential in underwater image restoration remains largely unexplored. A preliminary study [27] attempted to apply a semi-supervised strategy by jointly optimizing supervised and unsupervised losses within a single network. Building upon this idea, our work proposes a more systematic framework that incorporates several key components tailored for underwater scenarios. Specifically, we adopt the mean teacher mechanism and further enhance it with a dynamic quality reliability repository to filter pseudo-labels, as well as a multimodal contrastive loss that promotes better feature representation and mitigates confirmation bias.

2.3. Contrastive Learning

Contrastive learning represents a powerful paradigm in self-supervised representation learning [28]. It facilitates the acquisition of meaningful visual features by enforcing similarity between semantically related samples while pushing apart dissimilar ones. In the domain of image restoration, previous studies primarily focus on constructing contrastive pairs and designing appropriate feature projections. For instance, some approaches [29,30] consider clean images as positive samples and degraded ones as negatives, projecting them into a learned embedding space using networks such as VGG [31]. However, these implementations typically rely on paired ground truth and apply the contrastive loss in a supervised manner, limiting their applicability to unlabeled data.

To date, contrastive learning has seen limited use in underwater image restoration. A prior work [32] incorporated contrastive loss as a regularization term to improve performance within a supervised learning framework, but it still depended on labeled data. In contrast, this study presents a systematic approach to utilizing unlabeled data through multimodal contrastive learning. By designing contrastive objectives that leverage information from multiple modalities, our method enables the network to learn more robust features without requiring ground truth, thereby enhancing its generalization to complex real-world underwater scenes.

3. Methods

3.1. The Network Structure of MCR-UIE

Semi-supervised learning is intended to leverage both labeled and unlabeled data to improve model generalization and learning efficiency. In the context of underwater image restoration, we formally define the problem as follows: Let the labeled dataset be denoted as

D_{L} = {(x_{i}^{l}, y_{i}^{l}) | x_{i}^{l} \in I_{l}^{L Q}, y_{i}^{l} \in I_{l}^{H Q}}_{i = 1}^{N}

, where

x_{i}^{l}

and

y_{i}^{l}

represent the degraded underwater image and its corresponding clean ground truth, respectively, sampled from the low-quality set

I_{l}^{L Q}

and the high-quality set

I_{l}^{H Q}

. Similarly, the unlabeled dataset is defined as

D_{U} = {x_{i}^{u} | x_{i}^{u} \in I_{u}^{L Q}}_{i = 1}^{M}

, where each

x_{i}^{u}

is an underwater image from the unlabeled degraded set

I_{u}^{L Q}

. It is important to note that the labeled and unlabeled images are disjointed, i.e.,

D_{L} \cap D_{U} = \emptyset

. The overall objective is to learn a restoration mapping function over the combined dataset

D = D_{L} \cup D_{U}

, such that any degraded underwater image

x

can be effectively transformed into its clean version

y

.

Our semi-supervised learning framework adopts the standard architecture commonly used in semi-supervised settings [14,24], as presented in Figure 1. Specifically, the proposed MCR-UIR consists of two networks with identical architectures, referred to as the teacher and student networks. The key distinction between them lies in their parameter update strategy: the student network is trained via gradient descent, while the teacher network is updated using the exponential moving average of the student’s weights during training.

The teacher network’s parameters, denoted as

θ_{t}

, are refined using the EMA of the student network’s parameters

θ_{s}

, following the update rule:

θ_{t} = λ θ_{t} + (1 - λ) θ_{s}

(1)

where

λ \in (0, 1)

is a momentum coefficient that controls the update speed. This strategy enables the teacher model to accumulate knowledge from the student network over time, effectively aggregating its parameters after each training step. As highlighted in [33], such temporal weight averaging not only helps to stabilize the training process but also leads to better generalization performance compared to standard gradient descent.

The student network’s parameters are refined via gradient descent. Typically, the optimization objective of the student model is defined by minimizing the following loss function:

L_{t o t a l} = L_{s u p} + λ L_{u n s u p}

(2)

where

L_{s u p} = \sum_{i = 0}^{N} | f_{θ_{s}} (x_{i}^{l}) - y_{i}^{l} |

denotes the supervised loss, and

L_{u n s u p} = \sum_{i = 0}^{M} | f_{θ_{s}} (ϕ_{s} (x_{i}^{u})) - f_{θ_{t}} (ϕ_{t} (x_{i}^{u})) |

represents the unsupervised student–teacher consistency loss. Here,

|\cdot|

denotes the L1 distance, and

ϕ_{s}

and

ϕ_{t}

refer to the data augmentation functions applied to the student and teacher inputs, respectively.

In principle, as the teacher network generally yields superior performance, the unsupervised loss

L_{u n s u p}

provides effective guidance for training the student model on unlabeled samples. Accordingly, the teacher’s output

{\hat{y}}_{i}^{u} = f_{θ_{t}} (ϕ_{t} (x_{i}^{u}))

is referred to as a pseudo-label. However, it is important to note that the teacher’s predictions are not always more accurate than those of the student. Inaccurate pseudo-labels may introduce noise and negatively impact the learning process of the student network.

3.2. Dynamic Quality and Reliability Repository

To tackle the aforementioned problem, we utilize the most confident outputs from the teacher network as pseudo-labels. Similar approaches have been employed in image classification and semantic segmentation tasks [25]; output reliability is typically assessed based on prediction entropy or confidence scores. However, directly extending these approaches to image restoration tasks is non-trivial due to several unique challenges. Specifically, as a regression problem, underwater image restoration requires the accurate recovery of fine textures and the effective removal of color casts, which cannot be reliably evaluated using classification-based confidence metrics.

To mitigate the issue of unreliable pseudo-labels, we design a reliable repository that dynamically stores the most trustworthy outputs generated by the teacher network during training. Initially, the repository

B_{U}

is empty. At each iteration, we evaluate the current output of the teacher model by comparing it with both the corresponding output of the student model and the existing pseudo-label in the repository. If the teacher’s current prediction exhibits better quality, it replaces the previous one in

B_{U}

. As training proceeds, the repository gradually accumulates a set of reliable pseudo-labels, denoted as

B_{U} = {y_{i}^{b}}_{i = 1}^{M}

. In this way, we construct the updated pseudo-labeled dataset

\begin{array}{r} D^{'} = D_{U} \cup B_{U} = {(x_{i}^{u}, y_{i}^{b})}_{i = 1}^{M} \end{array}

, where each unlabeled image is associated with its most reliable pseudo-label. This mechanism ensures that the unsupervised consistency loss

L_{u n s u p}

is computed using high-quality targets, thereby reducing the adverse effects of noisy supervision. The revised loss function can be formulated as follows:

L_{u n s u p}^{'} = \sum_{i = 0}^{M} |f_{θ_{s}} (ϕ_{s} (x_{i}^{u})) - y_{i}^{b}|

(3)

Intuitively, one might consider using non-reference image quality assessment (NR-IQA) metrics. However, as highlighted in [3,20], widely used metrics such as UCIQE [34] and UIQM [35] do not reliably capture the quality of restored underwater images. Consequently, relying on these metrics to construct our dynamic quality and reliability repository could lead to suboptimal results. To address this issue, we perform an empirical evaluation of multiple NR-IQA metrics to identify the most suitable one for assessing the quality of underwater images. We observe that the deep learning-based MUSIQ [36] metric best aligns with the monotonicity law. To justify the use of MUSIQ as the reliability criterion in our dynamic quality repository, we conduct a comparative study on seven commonly used NR-IQA methods over the EUVP benchmark, which covers a wide range of underwater scenarios. As shown in Figure 2, the evaluation highlights that deep learning-based metrics, particularly MUSIQ, exhibit better monotonicity and alignment with visual quality as compared to traditional handcrafted metrics such as BRISQUE and NIQE. MUSIQ consistently provides more stable and perceptually meaningful scores across varying underwater degradations. Based on this observation, we adopt MUSIQ to estimate the reliability of the network outputs, guiding the update of pseudo-labels in our dynamic quality reliability repository.

The construction steps of the dynamic quality reliability repository are detailed in Algorithm 1, with corresponding explanations provided for each step. Obtain the predictions of teachers and students: We calculate the prediction results of the teacher and student models for unlabeled samples. The predictions of the teacher model are used to generate candidate pseudo-labels, and the predictions of the student model are used for comparative judgment. Segment local areas: We divide the prediction results into multiple small blocks to evaluate the quality of the image more carefully, so as to avoid the situation where the overall score is affected by the poor local quality of the image. Calculate local quality scores and entropy metrics: We calculate the NR-IQA score of each local area separately, and use the entropy metric to measure the uncertainty of the prediction, and then weight the local scores into a global score. The smaller the entropy, the more confident the model prediction is, so a negative weight is given to the entropy metric. Update the reliable sample repository: If the prediction quality score of the teacher model is significantly higher than that of the student model and exceeds the existing pseudo-labels in the repository, it means that the pseudo-label is more reliable and can be replaced in the reliable sample repository.

We adopt an online update mechanism for the dynamic quality reliability repository, where the repository is dynamically refreshed during each training iteration. For every unlabeled input image, we generate enhancement results from both the teacher and student branches and assess their quality using a composite reliability score that integrates the MUSIQ score and entropy-based confidence. A new prediction from the teacher branch is allowed to replace an existing sample in the repository only when it achieves a higher quality score than both the corresponding student output and the current repository entry. This selective replacement ensures that only more reliable and higher quality pseudo-labels are preserved, allowing the DQR to evolve towards greater consistency and trustworthiness over time.

To ensure stable convergence in this feedback-based design, we apply a progressive warm-up strategy to the consistency regularization coefficient

ρ

, which gradually increases its influence throughout training. In the early stages, when pseudo-labels may still be noisy, this scheduling helps prevent unstable updates. As the network matures, fewer replacements occur, and the repository contents stabilize, thereby promoting convergence. This interplay between dynamic updating and progressive regularization contributes to the robustness and effectiveness of our semi-supervised enhancement framework.

Algorithm 1: Update of dynamic quality reliability repository

Require: NR-IQA method

Ψ (\cdot)

, Entropy metric

H (\cdot)

, Local region split function

s p l i t (\cdot)

;
Initialize:

B_{U} = \emptyset

;
Sample a batch of unlabeled images

{x_{i}^{u}}_{i = 1}^{b}

from

D_{U}

;
for each

x_{i}^{u}

do
Get teacher’s prediction:

{\hat{y}}_{i}^{u} = f_{θ_{t}} (ϕ_{t} (x_{i}^{u}))

;
Get student’s prediction:

{\tilde{y}}_{i}^{u} = f_{θ_{s}} (ϕ_{s} (x_{i}^{u}))

;
Compute enhanced quality scores for

{\hat{y}}_{i}^{u}

,

{\tilde{y}}_{i}^{u}

, and

y_{i}^{b} \in B_{U}

;
Split each prediction into local regions:

{r_{k} ({\hat{y}}_{i}^{u})}

, {r_{k} ({\tilde{y}}_{i}^{u})}

, {r_{k} (y_{i}^{b})}

using

s p l i t (\cdot)

;
Compute NR-IQA scores of each region for teacher prediction:

z_{t}^{k} = Ψ (r_{k} ({\hat{y}}_{i}^{u}))

;
Compute NR-IQA scores of each region for student prediction:

z_{s}^{k} = Ψ (r_{k} ({\tilde{y}}_{i}^{u}))

;
Compute NR-IQA scores of each region for existing reliable bank sample:

z_{b}^{k} = Ψ (r_{k} (y_{i}^{b}))

.
Aggregate regional scores with weighted mean for global score:

z_{t} = 0.8 \times m e a n ({z_{t}^{k}}) - 0.2 \times H ({\hat{y}}_{i}^{u})

z_{s} = 0.8 \times m e a n ({z_{s}^{k}}) - 0.2 \times H ({\tilde{y}}_{i}^{u})

z_{b} = 0.8 \times m e a n ({z_{b}^{k}}) - 0.2 \times H (y_{i}^{b})

if

z_{t} > z_{s}

and

z_{t} > z_{b}

then
Replace the

y_{i}^{b}

in B_{U}

by

{\hat{y}}_{i}^{u}

;
end if
end for

3.3. Multimodal Contrastive Loss

In most cases, numerous mean teacher-based approaches utilize the L1 distance as the consistency loss, as illustrated in Equation (2). The simple consistency loss may cause the student model to overfit incorrect predictions, thereby introducing confirmation bias. To alleviate this problem, we propose the integration of contrastive loss during training. Contrastive learning, a prominent approach in self-supervised learning [28], drives the model to differentiate between positive and negative sample pairs by pulling together their representations in the former case and pushing them apart in the latter. In our context, the positive samples correspond to the pseudo-labels, while the negative samples are the corresponding degraded underwater images. However, traditional contrastive learning typically performs a global comparison of features, which may be insufficient for capturing the complex characteristics of underwater images. To address this limitation, we aim to extend it into a comprehensive contrastive loss that integrates multimodal information, multi-scale feature representations, and adaptive feature selection. This enhancement is designed to significantly improve the robustness and generalization ability of the model, particularly in the generation of pseudo-labels and the effective utilization of unlabeled data.

To optimize the robustness and generalization of the student model during unsupervised learning, we introduce a multimodal contrastive loss that integrates VGG features, edge cues, color information, and local region features. This multimodal contrastive strategy enables the model to distinguish subtle variations in the image content, thereby improving pseudo-label reliability.

3.3.1. VGG Feature Contrastive Loss

Let

a_{v g g}^{i}

,

p_{v g g}^{i}

, and

n_{v g g}^{i}

represent the anchor, positive, and negative VGG features at the

i

-th layer, respectively. The distances are then computed as follows: anchor-positive distance:

d_{a p}^{i} = ∥ a_{v g g}^{i} - p_{v g g}^{i} ∥_{1}

, and anchor-negative distance:

d_{a n}^{i} = ∥ a_{v g g}^{i} - n_{v g g}^{i} ∥_{1}

. The contrastive loss at layer

i

is:

L_{v g g}^{i} = \frac{d_{a p}^{i}}{d_{a n}^{i} + ϵ}

If the negative sample is harder (i.e.,

d_{a n}^{i} < d_{a p}^{i}

), a hard sample weight

w_{h a r d}

is applied:

L_{v g g}^{i} = L_{v g g}^{i} \times w_{h a r d}

To emphasize discriminative layers, we introduce static weights

w_{i}

and dynamically computed complexity-aware weights

w_{d y n a m i c}^{i}

. The final VGG-based contrastive loss is as follows:

L_{v g g} = \sum_{i} w_{i} \times L_{v g g}^{i} \times w_{d y n a m i c}^{i}

(4)

3.3.2. Edge Feature Contrastive Loss

Let

a_{e d g e}

,

p_{e d g e}

, and

n_{e d g e}

denote the edge features of the anchor, positive, and negative images, respectively. The loss is defined as

L_{e d g e} = \frac{∥ a_{e d g e} - p_{e d g e} ∥_{1}}{∥ a_{e d g e} - n_{e d g e} ∥_{1} + ϵ}

. If a hard negative is detected (

d_{a n}^{e d g e} < d_{a p}^{e d g e}

), then apply

L_{e d g e} = L_{e d g e} \times w_{h a r d}

, and finally:

L_{e d g e} = L_{e d g e} \times w_{d y n a m i c}^{e d g e}

(5)

3.3.3. Color Feature Contrastive Loss

Let

a_{c o l o r}

,

p_{c o l o r}

, and

n_{c o l o r}

be the color features of anchor, positive, and negative samples. We define

L_{c o l o r} = \frac{∥ a_{c o l o r} - p_{c o l o r} ∥_{1}}{∥ a_{c o l o r} - n_{c o l o r} ∥_{1} + ϵ}

. With hard sample adjustment (

d_{a n}^{c o l o r} < d_{a p}^{c o l o r}

):

L_{c o l o r} = L_{c o l o r} \times w_{h a r d}

, and finally:

L_{c o l o r} = L_{c o l o r} \times w_{d y n a m i c}^{c o l o r}

(6)

3.3.4. Local Region Contrastive Loss

The image is divided into four local regions. For each region

j

, the anchor, positive, and negative local features are

a_{l o c a l}^{j}

,

p_{l o c a l}^{j}

, and

n_{l o c a l}^{j}

. The region-wise contrastive loss is

L_{l o c a l}^{j} = \frac{∥ a_{l o c a l}^{j} - p_{l o c a l}^{j} ∥_{1}}{∥ a_{l o c a l}^{j} - n_{l o c a l}^{j} ∥_{1} + ϵ}

. If

d_{a n}^{l o c a l, j} < d_{a p}^{l o c a l, j}

, apply

L_{l o c a l}^{j} = L_{l o c a l}^{j} \times w_{h a r d}

. The overall local contrastive loss is averaged:

L_{l o c a l} = \sum_{j = 1}^{4} 0.25 \times L_{l o c a l}^{j}

(7)

The total loss combines all components:

L_{m c r} = L_{v g g} + L_{e d g e} + L_{c o l o r} + L_{l o c a l}

(8)

This comprehensive loss function enables robust and fine-grained representation learning from unlabeled underwater images, effectively suppressing confirmation bias and improving pseudo-label quality.

To enhance the adaptiveness of our contrastive loss to varying feature complexities across different layers and modalities, we introduce a dynamic weighting scheme based on feature variance. The intuition is that feature maps with higher variance often contain richer structural or semantic information, and should contribute more significantly to the contrastive learning process.

For the

i

-th feature layer or modality, we define the dynamic weight

w_{d y n a m i c}^{i}

as Equation (9):

w_{d y n a m i c}^{i} = \frac{V a r (f^{i})}{\sum_{j} V a r (f^{j})}, V a r (f^{i}) = \frac{1}{N} \sum_{k = 1}^{N} {(f_{k}^{i} - {\bar{f}}^{i})}^{2}

(9)

Here,

f^{i}

denotes the feature map of the

i

-th layer,

f_{k}^{i}

is the

k

-th pixel (or feature vector),

{\bar{f}}^{i}

is the mean feature value, and

N

is the total number of pixels in the feature map. This normalization ensures that weights across all layers or modalities sum to 1, stabilizing the training. By integrating this variance-based dynamic weighting, the model can prioritize feature levels that contain more discriminative information, leading to more effective pseudo-label learning and better generalization in complex underwater environments.

Building on Equation (2), the overall training objective is composed of a supervised loss and a refined unsupervised loss, as detailed below.

To optimize the effectiveness of the supervised loss beyond the standard L1 formulation presented in Equation (2), we adopt a more comprehensive objective inspired by [18], which incorporates not only the pixel-wise L1 loss but also a perceptual loss

L_{p e r}

and a gradient penalty term

L_{g r a d}

, thereby encouraging the restoration of finer textures and more accurate structural details.

L_{s u p}^{'} = L_{s u p} + α_{1} L_{p e r} + α_{2} L_{g r a d}

(10)

For the unsupervised component, we refine the original

L_{u n s u p}

by formulating it as a combination of the proposed reliable teacher–student consistency loss and the multimodal contrastive loss, aiming to enhance the stability of pseudo-label learning and improve the model’s generalization on unlabeled data.

L_{u n s u p}^{″} = L_{u n s u p}^{'} + β L_{m c r}

(11)

Finally, the overall optimization objective is reformulated as follows, consistent with the structure of Equation (2):

L_{o v e r a l l} = L_{s u p}^{'} + ρ L_{u n s u p}^{″}

(12)

where

L_{s u p}^{'}

denotes the enhanced supervised loss incorporating perceptual and gradient components, and

L_{u n s u p}^{″}

represents the improved unsupervised loss that combines reliable consistency and contrastive constraints.

We adopt an initial learning rate of 2 × e⁻⁴ and train the model for 200 epochs. The learning rate is decayed by a factor of 0.1 after 100 epochs to facilitate stable convergence. During training, all images are uniformly cropped to a 256 × 256 resolution. For data augmentation, we apply standard geometric transformations (resizing, random cropping, and rotation) to the labeled data. Regarding the unlabeled data, the teacher branch receives weakly augmented inputs (resizing only), while the student branch is exposed to strong augmentations, including resizing, color jittering, Gaussian blur, and grayscale conversion, to encourage consistency under perturbations. The loss function comprises several components, whose weights are empirically set as follows:

α_{1}

= 0.3,

α_{2}

= 0.1, and

β

= 1. Additionally, the consistency regularization coefficient

ρ

is gradually increased during training using an exponential equation [37]:

ρ (t) = 0.2 \times e^{- 5 {(1 - t / 200)}^{2}}

, where

t

denotes the training epoch. This progressive scheduling helps stabilize training in the early stages by controlling the influence of the unsupervised loss component.

4. Experimental Results

4.1. Datasets and Settings

4.1.1. Software Configuration

The proposed approach was developed using the PyTorch (1.11.0 + cu113) framework and executed on an NVIDIA RTX 4090 D GPU. To accelerate convergence and minimize training duration, the AdamP optimizer [38] was employed, owing to its efficiency in reaching optimal solutions.

4.1.2. Introduction to Dataset

The training dataset comprises 1600 labeled image pairs and 1600 unlabeled images. The labeled pairs are randomly selected in an equal proportion from the dataset proposed in [39] and the UIEB dataset [20]. Specifically, ref. [39] provides a collection of synthetically generated underwater images captured in indoor scenes, while UIEB [20] contains 890 real-world underwater images accompanied by corresponding ground truth references. The unpaired subset of the EUVP benchmark [22] serves as the source of unlabeled images, which includes diverse underwater scenes with varying water types and illumination conditions. To evaluate performance, the test set incorporates both full-reference and no-reference benchmark datasets, comprising 89 images from the UIEB and 500 images from the LSUI [40].

4.1.3. Evaluation Metrics

To evaluate model performance, we utilize a set of commonly adopted image quality assessment metrics. In particular, peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and root mean square error (RMSE) are selected as full-reference metrics. A higher PSNR signifies superior image fidelity, whereas SSIM values approaching 1 denote stronger structural resemblance to the ground truth. Conversely, a lower RMSE implies a reduced restoration error. Additionally, two no-reference evaluation metrics—underwater image quality measure (UIQM) and underwater color image quality evaluation (UCIQE)—are applied to assess the perceptual quality of the enhanced underwater images. Elevated UIQM and UCIQE scores reflect improved visual appeal and color restoration accuracy.

The calculation formulas for the evaluation metrics are presented as follows: The definitions of PSNR, SSIM, and RMSE are provided in Equations (13), (14), and (15), respectively.

P S N R = 10 {l o g}_{10} (\frac{{m a x}^{2}}{M S E})

(13)

S S I M = \frac{(2 μ_{x} μ_{y} + c_{1}) ({2 σ_{x y} + c}_{2})}{(μ_{x}^{2} {+ μ_{y}^{2} + c}_{1}) (σ_{x}^{2} {+ σ_{y}^{2} + c}_{2})}

(14)

R M S E = \sqrt{M S E} = \sqrt{\frac{1}{M N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} {[y_{e} (i, j) - y (i, j)]}^{2}}

(15)

UIQM is a no-reference metric designed to evaluate the perceptual quality of underwater images by integrating three aspects: colorfulness, sharpness, and contrast. The overall UIQM is calculated as a weighted combination of these components, as shown in Equation (16):

U I Q M = c_{1} \cdot U I C M + c_{2} \cdot U I S M + c_{3} \cdot U I C o n M

(16)

Among them, the commonly used weight coefficients are as follows:

c_{1}

= 0.0282,

c_{2}

= 0.2953, and

c_{3}

= 3.5753. UCIQE is a no-reference metric used to assess the perceptual quality of underwater images. It primarily considers three visual attributes: colorfulness, contrast, and saturation. The UCIQE score is computed as a weighted linear combination of the standard deviation of chroma (

ω_{c}

), the contrast of luminance (

{c o n}_{l}

), and the mean of saturation (

μ_{s}

), as defined in Equation (17):

U C I Q E = c_{1} \cdot ω_{c} + c_{2} \cdot {c o n}_{l} + c_{3} \cdot μ_{s}

(17)

where

c_{1}

= 0.4680,

c_{2}

= 0.2745, and

c_{3}

= 0.2576 are the empirically determined weighting coefficients.

4.2. Enhanced Experiments on Public Datasets

We first conduct enhancement experiments on the UIEB test set, which contains 89 real-world underwater images, each resized to 256 × 256 pixels for evaluation. To validate the effectiveness of our approach, we compare it against several representative underwater image enhancement approaches, including both traditional and deep learning-based methods: NLD [41], CLAHE [6], DCP [42], UDCP [43], UNet [44], UWNet [45], CycleGAN [46], and FUnIE-GAN [22]. A selection of enhanced images are shown in Figure 3 for visual comparison. Specifically, (a) is the input image; (b–j) are the results of NLD, CLAHE, DCP, UDCP, UNet, UWNet, CycleGAN, FUnIE-GAN, and our MCR-UIE, respectively; and (k) is ground truth.

As shown in Figure 3, deep learning-based approaches clearly outperform traditional enhancement approaches in terms of visual quality. This superiority can be attributed to the ability of deep learning models to automatically learn complex representations from data. These models effectively capture subtle textures, edges, and color distributions in underwater scenes. In contrast, traditional approaches typically rely on hand-crafted features and heuristic adjustments, such as enhancing brightness or contrast, which are insufficient to fully recover the degraded content and intricate details present in underwater images.

To further demonstrate the efficacy of our proposed approach, additional enhancement experiments were performed on the LSUI dataset. The LSUI test set contains 500 underwater images, each resized to 256 × 256 pixels for evaluation. A selection of enhanced images are shown in Figure 4 for visual comparison. Specifically, (a) is the input image; and (b–j) are the outputs of NLD, CLAHE, DCP, UDCP, UNet, UWNet, CycleGAN, FUnIE-GAN, and our MCR-UIE, respectively.

To better visualize the enhancement effects, a representative region was selected and zoomed in on, presented in the lower right corner of Figure 4. Upon closer inspection, it is evident that the LSUI test set more closely resembles real underwater scenes in terms of visual characteristics. As with previous results, deep learning-based approaches continue to outperform traditional approaches. Notably, the CycleGAN method produces images with relatively high color saturation, which may lead to unrealistic results. In contrast, our proposed approach achieves superior visual performance, preserving natural color tones and fine details more effectively than the other compared methods. While subjective visual comparisons provide an intuitive understanding of enhancement quality, they are insufficient for a comprehensive evaluation. Therefore, we further assess the performance using objective evaluation metrics, which include both full-reference and no-reference metrics. The final reported numbers represent the average values across the entire test set. Table 1 presents the full-reference evaluation results on the UIEB and LSUI datasets after image enhancement, where the best and second-best results are marked in red and blue, respectively.

As presented in Table 1, the proposed MCR-UIE approach significantly outperforms traditional underwater image enhancement techniques, including NLD, CLAHE, DCP, and UDCP. On the UIEB dataset, MCR-UIE improves the PSNR by 44.3% and SSIM by 20.2% compared to the best-performing traditional method (UDCP). Similarly, on the LSUI dataset, MCR-UIE achieves a 46.3% increase in PSNR and a 14.4% improvement in SSIM over UDCP. Additionally, MCR-UIE reduces the RMSE by 46.0% on UIEB and 55.4% on LSUI, demonstrating its superior ability to recover image details and reduce distortion.

Compared to deep learning-based methods which encompass UNet, UWNet, CycleGAN, and FUnIE-GAN, MCR-UIE also achieves consistent performance gains. On the UIEB dataset, MCR-UIE surpasses CycleGAN in PSNR and SSIM by 9.1% and 7.0%, respectively, while reducing RMSE by 20.3%. On the LSUI dataset, MCR-UIE improves PSNR by 11.0%, SSIM by 10.3%, and lowers RMSE by 21.9% compared to CycleGAN. These results confirm that our approach not only effectively restores underwater image quality but also exhibits a strong generalization capability across different datasets.

To provide a more intuitive comparison of performance differences among the methods, Figure 5 presents the box plots of the PSNR and RMSE metrics corresponding to the results in Table 1. These plots illustrate the distribution, central tendency, and variability of each method’s performance across the test set, enabling a clearer visual interpretation of their relative effectiveness.

Table 2 reports the results of no-reference quality evaluation metrics, including UIQM and UCIQE, on the UIEB and the LSUI datasets. These metrics are designed to assess image quality in the absence of ground truth by evaluating attributes such as colorfulness, contrast, and sharpness. From the results, we observe that deep learning-based approaches generally outperform traditional enhancement approaches. Notably, UNet achieves the highest UIQM score (3.075) on the LSUI dataset, indicating its strong ability to enhance image contrast and sharpness in certain scenes. However, its UCIQE performance remains relatively modest (0.532), suggesting potential issues in maintaining consistent color balance.

Our proposed MCR-UIE method obtains a UIQM of 2.881 on UIEB and 3.000 on LSUI, ranking among the top-performing methods across both datasets. Although its UIQM is slightly lower than that of UNet and FUnIE-GAN, MCR-UIE shows a more stable and balanced performance, with UCIQE scores of 0.606 (UIEB) and 0.572 (LSUI) that are consistently high across datasets. In contrast, some methods (e.g., CycleGAN and UWNet) exhibit strong UIQM but relatively poor UCIQE, indicating possible color over-enhancement or inconsistency.

In summary, MCR-UIE achieves a strong trade-off between sharpness, color fidelity, and contrast, producing visually pleasing results while maintaining generalization ability across different underwater environments. This is also consistent with the qualitative results shown in Figure 4. To provide a more intuitive comparison of the results presented in Table 2, Figure 6 shows a bar chart to highlight the differences in image quality enhancement across different metrics.

To comprehensively evaluate the performance of different underwater image enhancement methods, we analyze both full-reference metrics (Table 1) and no-reference metrics (Table 2) across the UIEB and LSUI datasets. From the perspective of full-reference metrics including PSNR, SSIM, and RMSE, our proposed MCR-UIE achieves the best overall performance.

In terms of no-reference evaluation metrics, MCR-UIE also performs competitively. However, it is worth noting that although FUnIE-GAN performs well on no-reference metrics, its performance on full-reference metrics (e.g., PSNR = 19.524 dB on UIEB) is relatively low, suggesting that its visual enhancement may not be structurally accurate. In contrast, MCR-UIE achieves a strong balance between full-reference and no-reference metrics, with consistently high UIQM (2.881/3.000) and UCIQE (0.606/0.572) values across both datasets. This indicates that MCR-UIE not only preserves structural fidelity but also enhances visual perception quality effectively.

Overall, the proposed MCR-UIE framework exhibits excellent generalization, stable enhancement quality, and balanced performance from both subjective and objective evaluation perspectives, surpassing traditional and recent deep learning-based underwater image enhancement methods. On widely recognized public benchmarks such as UIEB and LSUI, our method achieves notable improvements in visual clarity, color fidelity, and contrast restoration, validating its effectiveness under diverse underwater conditions and degradation types. These results confirm that the proposed multimodal contrastive regularization and dynamic quality reliability strategy are highly effective in guiding the network toward producing perceptually pleasing and semantically faithful outputs.

Given its strong performance on benchmark datasets and its robustness in real-world scenarios, MCR-UIE holds great promise for practical deployment in underwater visual applications, especially in aquaculture environments. Specifically, as detailed in Section 4.3, we conduct underwater image enhancement experiments using images collected from deep-sea aquaculture cages. The model demonstrates a strong adaptability and enhancement capability in complex oceanic environments, characterized by low visibility, high turbidity, and dynamic illumination. These experiments not only confirm the real-world utility of MCR-UIE but also pave the way for its integration into intelligent monitoring systems in aquaculture, such as fish detection, behavior analysis, and health condition assessment.

4.3. Enhanced Experiments on Deep-Sea Cage Dataset

To further validate the effectiveness and generalization ability of the proposed MCR-UIE method in real-world scenarios, we deployed our model in the context of the deep-sea cage aquaculture, where underwater images are typically degraded due to the low illumination, high turbidity, and non-uniform color distortion caused by complex and dynamic oceanic environments.

We conducted two sets of underwater image enhancement experiments using images captured from the same large-scale deep-sea aquaculture cage on 16 April and 17 April 2025, respectively. These images were taken during the routine monitoring of Trachinotus ovatus in deep-sea cages and present significant visual challenges that conventional enhancement methods struggle to address.

The first test set, collected on 16 April, consisted of 350 underwater images mainly characterized by a bluish-green color cast. Four representative images from this test set were selected to visually compare the enhancement results before and after applying MCR-UIE, as illustrated in Figure 7. The second test set, collected on 17 April, included 380 images with a yellowish-green background. Similarly, four enhanced image samples were chosen from this set to show the improvements achieved by our model, as demonstrated in Figure 8. These color differences reflect environmental variations such as water turbidity, light penetration, and biological activity.

We employed two widely used no-reference image quality metrics, UIQM and UCIQE, to assess the visual quality of the degraded input images. The first test set yielded an average UIQM score of 1.810 and a UCIQE score of 0.456. The second test set achieved slightly higher values, with an average UIQM of 1.905 and UCIQE of 0.485.

Interestingly, while the second test set yielded higher objective scores, subjective evaluation suggested that the first test set produced more visually pleasing results, with clearer textures, more natural color reproduction, and better overall visual appeal. This discrepancy suggests that current NR-IQA metrics may not fully capture perceptual quality in complex underwater scenes. For instance, the yellowish-green cast in the second set may have led to artificially higher metric scores due to increased global contrast or saturation, even though human observers preferred the more naturally enhanced results from the first set.

These findings highlight the limitations of relying solely on objective metrics for underwater image evaluation. They also emphasize the need to integrate both quantitative and qualitative assessments when validating enhancement models in practical deployments.

In summary, the results from these two real-world test sets demonstrate that MCR-UIE can significantly improve the visual quality of underwater images captured under varying environmental conditions in deep-sea aquaculture. The model effectively suppresses color casts, enhances texture details, and boosts the overall image contrast. This confirms its practical potential for deployment in intelligent aquaculture monitoring systems, supporting downstream tasks such as fish detection, length measurement, and behavioral analysis.

However, we acknowledge that the current evaluation in practical deep-sea aquaculture scenarios primarily relies on qualitative comparisons and NR-IQA metrics, due to the absence of paired ground-truth images. Acquiring such reference data in real-world underwater environments is inherently challenging because of dynamic lighting conditions, uncontrollable turbidity, and the non-rigid nature of underwater scenes.

Moving forward, potential strategies may include the use of synthetic datasets that simulate the degradation characteristics of deep-sea cages, or indirect validation through downstream tasks such as fish detection, tracking, and body length estimation. Additionally, expert visual assessments can provide a complementary perspective when reference data are unavailable. These approaches could help mitigate the limitations of no-reference evaluation and enhance the robustness of model validation in real-world aquaculture applications.

4.4. Ablation Experiments

To evaluate the effectiveness of MCR-UIR, we perform a series of ablation experiments to investigate the contributions of its key components. The following variants are examined: (a) Semi-base: A baseline semi-supervised framework employing the consistency loss. (b) Semi-base + DQR: Extends the Semi-base model by incorporating the dynamic quality reliability repository, while excluding multimodal contrastive loss. (c) Semi-base + MCL₁: Adds the VGG feature contrastive loss to the Semi-base model, without utilizing the DQR repository. (d) Semi-base + MCL₂: Adds the edge feature contrastive loss to the Semi-base model, without utilizing the DQR repository. (e) Semi-base + MCL₃: Adds the color feature contrastive loss to the Semi-base model, without utilizing the DQR repository. (f) Semi-base + MCL₄: Adds the local region contrastive loss to the Semi-base model, without utilizing the DQR repository. (g) Semi-base + MCL: Adds the multimodal contrastive loss to the Semi-base model, without utilizing the DQR repository. (h) MCR-UIR: The complete proposed method, integrating both the DQR repository and the multimodal contrastive loss.

The qualitative comparisons are illustrated in Figure 9, with particular emphasis on the results of Semi-base + MCL and Semi-base + DQR. In addition, the quantitative results are given in Table 3.

(1) For Semi-base + MCL, due to the absence of reliable positive samples, the contrastive loss compels the network to differentiate excessively from the negative samples (i.e., the input images), which unfortunately leads to over-enhancement artifacts.

(2) To further dissect the impact of each modality within the multimodal contrastive loss, we additionally conduct experiments using individual contrastive branches, including VGG feature contrastive loss (MCL₁), edge feature contrastive loss (MCL₂), color feature contrastive loss (MCL₃), and local region contrastive loss (MCL₄). The results show that among the single-modality variants, the MCL₃ achieves the most competitive performance, particularly on the LSUI dataset, suggesting its strong contribution to real-world color restoration. Conversely, the MCL₄ leads to relatively poor performance, possibly due to its sensitivity to local distortions and instability in degraded underwater scenes. When combining all four modalities into a unified MCL framework, we observe consistent improvements across both datasets. The full MCL configuration, which integrates all modalities, achieves better results than any single component, confirming the complementary benefits of multimodal representations.

(3) In contrast, Semi-base + DQR lacks the contrastive regularization mechanism, and although it benefits from the dynamic quality reliability repository, the restored images still exhibit noticeable color distortions and remain visually similar to the degraded inputs.

These observations validate the complementary effectiveness of both the dynamic quality reliability repository and multimodal contrastive regularization in improving restoration quality.

4.5. Deployment Feasibility

To assess the practical deployment potential of the proposed MCR-UIE model, we evaluate its inference speed on two underwater image enhancement benchmarks used in this experiment. The model contains approximately 1.68 million parameters (1,675,281) and is designed to be lightweight and computationally efficient.

We first conduct inference tests on the UIEB dataset, which comprises 89 underwater images with a resolution of 256 × 256 pixels. The total inference time on an NVIDIA RTX 4090D GPU is 2.827 s, yielding an average inference time of 31.76 ms per image, corresponding to approximately 31.48 frames per second (FPS). Additionally, on the LSUI test set, which includes 500 images of the same resolution, the total inference time is 13.106 s, resulting in an average of 26.20 ms per image, or about 38.15 FPS.

These results demonstrate that the proposed method achieves near real-time performance, indicating its strong potential for real-world applications that require timely underwater image enhancement. In particular, such efficiency suggests its practical applicability in aquaculture monitoring systems, where high-throughput and low-latency visual processing are essential.

4.6. Analysis of Limitations

To provide a more comprehensive understanding of the limitations of the proposed MCR-UIE method, we present three representative cases, as shown in Figure 10. These examples illustrate how the model may underperform or introduce perceptual distortions under specific conditions.

Figure 10a This image features a high-resolution underwater scene with a clear blue background and a large number of fish. After enhancement, the output shows negligible changes. This suggests that the model is overly conservative in scenes that deviate from the typical degradation patterns present in the training data. The low degree of visible enhancement may result from the pseudo-label filtering mechanism and the consistency constraint, which prevent aggressive adjustments when degradation is not apparent. However, in practical applications, such scenes may still benefit from subtle contrast or clarity improvements, indicating a gap between perceptual enhancement and model behavior.

Figure 10b The input contains an object with alternating green and red bands—an uncommon color pattern in typical underwater datasets. After enhancement, these colors shift to blue and orange. This semantic color distortion reflects the model’s difficulty in preserving object colors that fall outside its learned color priors. Since the model is trained primarily on degraded natural underwater scenes without explicit object or semantic supervision, it may misinterpret such color combinations as artifacts and apply misguided corrections, leading to visually unrealistic results.

Figure 10c In this case, the input image includes a bright yellow rubber glove, which is transformed to have an orange hue after enhancement. The color shift suggests that the model mistakenly treats vivid artificial objects as being affected by underwater color distortion. Without an understanding of object semantics or color constancy, the enhancement process adjusts the hue toward what it believes is a more “natural” underwater tone, resulting in a loss of color fidelity for human-made objects.

These cases highlight two major limitations: a lack of semantic understanding, which causes the model to misinterpret unusual colors as degradations and incorrectly adjust them, and limited adaptability to high-quality or complex scenes, where the model’s enhancement behavior becomes conservative or misaligned with perceptual needs.

To address these issues, future work could consider integrating semantic segmentation priors, human-object-aware loss functions, or expanding the training dataset to include a greater diversity of object colors and scene types. In particular, increasing the diversity of the unlabeled dataset, which currently consists solely of EUVP samples, would mitigate potential domain shift effects and allow the model to better adapt to underwater environments not represented in the original benchmark. This expanded dataset could incorporate varied water types, lighting conditions, and object appearances, thereby helping the model distinguish between actual degradation and semantically meaningful color variations. Together, these enhancements would improve the model’s robustness and generalization for real-world deployment scenarios such as deep-sea aquaculture monitoring.

5. Conclusions

In this study, we propose a novel semi-supervised underwater image enhancement approach, termed MCR-UIE, which integrates a multimodal contrastive learning (MCL) and a dynamic quality reliability repository (DQR) to fully exploit both labeled and unlabeled data. The MCL introduces contrastive constraints across multiple modalities, such as perceptual, edge, color, and local region features, to enhance feature discrimination and robustness. Meanwhile, the DQR maintains high-quality pseudo-labels by dynamically selecting the best teacher outputs based on a learned reliability criterion, thus mitigating the risk of confirmation bias during training. A wide range of experiments conducted on multiple underwater image enhancement benchmarks demonstrate that MCR-UIE consistently outperforms existing advanced methods, achieving notable improvements across both full-reference and no-reference evaluation metrics. In subsequent research, we plan to extend this semi-supervised framework to other low-level vision tasks and explore more efficient memory-aware training strategies to further improve scalability and performance.

Author Contributions

Conceptualization, M.D. and G.L.; methodology, M.D.; software, M.D.; validation, G.L. and H.L.; formal analysis, Q.H.; writing—original draft preparation, M.D.; writing—review and editing, M.D. and X.H.; visualization, Y.H.; supervision, Q.H.; project administration, Y.H. and X.H.; funding acquisition, M.D., G.L., H.L. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the earmarked fund for CARS-47, the Central Public-interest Scientific Institution Basal Research Fund, CAFS (No. 2023TD97), and the Central Public-interest Scientific Institution Basal Research Fund, South China Sea Fisheries Research Institute, CAFS (No. 2023RC01, No. 2022TS06, No. 2024TS07 and No. 2024TS08).

Data Availability Statement

The datasets used in this study are available from the corresponding author upon reasonable request. They are not publicly released in order to prevent potential misuse.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shuang, X.; Zhang, J.; Tian, Y. Algorithms for improving the quality of underwater optical images: A comprehensive review. Signal Process. 2024, 219, 109408. [Google Scholar] [CrossRef]
Rout, D.K.; Kapoor, M.; Subudhi, B.N.; Thangaraj, V.; Jakhetiya, V.; Bansal, A. Underwater visual surveillance: A comprehensive survey. Ocean Eng. 2024, 309, 118367. [Google Scholar] [CrossRef]
Berman, D.; Levy, D.; Avidan, S.; Treibitz, T. Underwater single image color restoration using haze-lines and a new quantitative dataset. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2822–2837. [Google Scholar] [CrossRef]
Galdran, A.; Pardo, D.; Picón, A.; Alvarez-Gila, A. Automatic red-channel underwater image restoration. J. Vis. Commun. Image Represent. 2015, 26, 132–145. [Google Scholar] [CrossRef]
Zhang, S.; Wang, T.; Dong, J.; Yu, H. Underwater image enhancement via extended multi-scale Retinex. Neurocomputing 2017, 245, 1–9. [Google Scholar] [CrossRef]
Zuiderveld, K.J. Contrast limited adaptive histogram equalization. Graph. Gems 1994, 4, 474–485. [Google Scholar]
Wen, Z.; Zhao, Y.; Gao, F.; Su, H.; Rao, Y.; Dong, J. NUAM-Net: A Novel Underwater Image Enhancement Attention Mechanism Network. J. Mar. Sci. Eng. 2024, 12, 1216. [Google Scholar] [CrossRef]
Zhang, B.; Fang, J.; Li, Y.; Wang, Y.; Zhou, Q.; Wang, X. GFRENet: An Efficient Network for Underwater Image Enhancement with Gated Linear Units and Fast Fourier Convolution. J. Mar. Sci. Eng. 2024, 12, 1175. [Google Scholar] [CrossRef]
Zhao, S.; Mei, X.; Ye, X.; Guo, S. MSFE-UIENet: A Multi-Scale Feature Extraction Network for Marine Underwater Image Enhancement. J. Mar. Sci. Eng. 2024, 12, 1472. [Google Scholar] [CrossRef]
Yang, J.; Zhu, S.; Liang, H.; Bai, S.; Jiang, F.; Hussain, A. PAFPT: Progressive aggregator with feature prompted transformer for underwater image enhancement. Expert Syst. Appl. 2025, 262, 125539. [Google Scholar] [CrossRef]
Xiang, D.; He, D.; Sun, H.; Gao, P.; Zhang, J.; Ling, J. HCMPE-Net: An unsupervised network for underwater image restoration with multi-parameter estimation based on homology constraint. Opt. Laser Technol. 2025, 186, 112616. [Google Scholar] [CrossRef]
Fu, F.; Liu, P.; Shao, Z.; Xu, J.; Fang, M.M.-G. A multi-scale evolutionary generative adversarial network for underwater image enhancement. J. Mar. Sci. Eng. 2024, 12, 1210. [Google Scholar] [CrossRef]
Wang, M.; Zhang, K.; Wei, H.; Chen, W.; Zhao, T. Underwater image quality optimization: Researches, challenges, and future trends. Image Vis. Comput. 2024, 146, 104995. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Huang, S.; Wang, K.; Liu, H.; Chen, J.; Li, Y. Contrastive semi-supervised learning for underwater image restoration via reliable bank. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18145–18155. [Google Scholar]
Ancuti, C.O.; Ancuti, C.; De Vleeschouwer, C.; Bekaert, P. Color balance and fusion for underwater image enhancement. IEEE Trans. Image Process. 2017, 27, 379–393. [Google Scholar] [CrossRef]
Zhang, W.; Zhuang, P.; Sun, H.-H.; Li, G.; Kwong, S.; Li, C. Underwater image enhancement via minimal color loss and locally adaptive contrast enhancement. IEEE Trans. Image Process. 2022, 31, 3997–4010. [Google Scholar] [CrossRef]
Kar, A.; Dhara, S.K.; Sen, D.; Biswas, P.K. Zero-shot single image restoration through controlled perturbation of koschmieder’s model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16205–16215. [Google Scholar]
Wang, K.; Hu, Y.; Chen, J.; Wu, X.; Zhao, X.; Li, Y. Underwater image restoration based on a parallel convolutional neural network. Remote Sens. 2019, 11, 1591. [Google Scholar] [CrossRef]
Li, C.; Guo, C.; Ren, W.; Cong, R.; Hou, J.; Kwong, S.; Tao, D. An underwater image enhancement benchmark dataset and beyond. IEEE Trans. Image Process. 2019, 29, 4376–4389. [Google Scholar] [CrossRef]
Li, C.; Anwar, S.; Hou, J.; Cong, R.; Guo, C.; Ren, W. Underwater image enhancement via medium transmission-guided multi-color space embedding. IEEE Trans. Image Process. 2021, 30, 4985–5000. [Google Scholar] [CrossRef]
Islam, M.J.; Xia, Y.; Sattar, J. Fast underwater image enhancement for improved visual perception. IEEE Robot. Autom. Lett. 2020, 5, 3227–3234. [Google Scholar] [CrossRef]
Miyato, T.; Maeda, S.-i.; Koyama, M.; Ishii, S. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1979–1993. [Google Scholar] [CrossRef]
Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.-L. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Adv. Neural Inf. Process. Syst. 2020, 33, 596–608. [Google Scholar]
Wang, Y.; Wang, H.; Shen, Y.; Fei, J.; Li, W.; Jin, G.; Wu, L.; Zhao, R.; Le, X. Semi-supervised semantic segmentation using unreliable pseudo-labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4248–4257. [Google Scholar]
Wang, L.; Yoon, K.-J. Semi-supervised student-teacher learning for single image super-resolution. Pattern Recognit. 2022, 121, 108206. [Google Scholar] [CrossRef]
Zhu, H.; Han, X.; Tao, Y. Semi-supervised advancement of underwater visual quality. Meas. Sci. Technol. 2020, 32, 015404. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Liang, D.; Li, L.; Wei, M.; Yang, S.; Zhang, L.; Yang, W.; Du, Y.; Zhou, H. Semantically contrastive learning for low-light image enhancement. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; pp. 1555–1563. [Google Scholar]
Wu, H.; Qu, Y.; Lin, S.; Zhou, J.; Qiao, R.; Zhang, Z.; Xie, Y.; Ma, L. Contrastive learning for compact single image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10551–10560. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Han, J.; Shoeiby, M.; Malthus, T.; Botha, E.; Anstee, J.; Anwar, S.; Wei, R.; Armin, M.A.; Li, H.; Petersson, L. Underwater image restoration via contrastive learning and a real-world dataset. Remote Sens. 2022, 14, 4297. [Google Scholar] [CrossRef]
Polyak, B.T.; Juditsky, A.B. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 1992, 30, 838–855. [Google Scholar] [CrossRef]
Yang, M.; Sowmya, A. An underwater color image quality evaluation metric. IEEE Trans. Image Process. 2015, 24, 6062–6071. [Google Scholar] [CrossRef]
Panetta, K.; Gao, C.; Agaian, S. Human-visual-system-inspired underwater image quality measures. IEEE J. Ocean. Eng. 2015, 41, 541–551. [Google Scholar] [CrossRef]
Ke, J.; Wang, Q.; Wang, Y.; Milanfar, P.; Yang, F. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 5148–5157. [Google Scholar]
Liu, Y.; Zhu, L.; Pei, S.; Fu, H.; Qin, J.; Zhang, Q.; Wan, L.; Feng, W. From synthetic to real: Image dehazing collaborating with unlabeled real data. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 50–58. [Google Scholar]
Heo, B.; Chun, S.; Oh, S.J.; Han, D.; Yun, S.; Kim, G.; Uh, Y.; Ha, J.-W. Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights. arXiv 2020, arXiv:2006.08217. [Google Scholar]
Li, C.; Anwar, S.; Porikli, F. Underwater scene prior inspired deep underwater image and video enhancement. Pattern Recognit. 2020, 98, 107038. [Google Scholar] [CrossRef]
Peng, L.; Zhu, C.; Bian, L. U-shape transformer for underwater image enhancement. IEEE Trans. Image Process. 2023, 32, 3066–3079. [Google Scholar] [CrossRef] [PubMed]
Berman, D.; Avidan, S. Non-local image dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1674–1682. [Google Scholar]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar] [PubMed]
Drews, P.; Nascimento, E.; Moraes, F.; Botelho, S.; Campos, M. Transmission estimation in underwater single images. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, NSW, Australia, 2–8 December 2013; pp. 825–830. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Naik, A.; Swarnakar, A.; Mittal, K. Shallow-uwnet: Compressed model for underwater image enhancement (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 15853–15854. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]

Figure 1. Illustration of our MCR-UIE framework. (a) The network structure of the proposed MCR-UIE; (b) the Asymmetric Illumination-Aware Multi-Scale Network.

Figure 2. Comparison of seven NR-IQA metrics on the EUVP benchmark. MUSIQ shows the best consistency with visual quality perception, validating its suitability for underwater pseudo-label selection.

Figure 3. Visual comparison of enhancement effects on UIEB dataset. (a) Input; (b) NLD; (c) CLAHE; (d) DCP; (e) UDCP; (f) UNet; (g) UWNet; (h) CycleGAN; (i) FUnIE-GAN; (j) MCR-UIE; (k) ground truth.

Figure 4. Visual comparison of enhancement effects on LSUI dataset. (a) Input; (b) NLD; (c) CLAHE; (d) DCP; (e) UDCP; (f) UNet; (g) UWNet; (h) CycleGAN; (i) FUnIE-GAN; (j) MCR-UIE.

Figure 5. Box plots of performance metrics for different enhancement methods. (a) PSNR of UIEB, (b) RMSE of UIEB, (c) PSNR of LSUI, (d) RMSE of LSUI.

Figure 6. Bar charts of performance metrics for different enhancement methods. (a) UIQM of UIEB, (b) UIQM of LSUI, (c) UCIQE of UIEB, (d) UCIQE of LSUI.

Figure 7. Example 1 of using MCR-UIE to enhance underwater images of deep-sea cage aquaculture. (a–d) Raw image captured in aquaculture environment; (e–h) enhanced output by our method.

Figure 8. Example 2 of using MCR-UIE to enhance underwater images of deep-sea cage aquaculture. (a–d) Raw image captured in aquaculture environment; (e–h) enhanced output by our method.

Figure 9. Visual comparison of ablation results using representative images from the UIEB and LSUI datasets. (a) Input; (b) Semi-base; (c) Semi-base + DQR; (d) Semi-base + MCL₁; (e) Semi-base + MCL₂; (f) Semi-base + MCL₃; (g) Semi-base + MCL₄; (h) Semi-base + MCL; (i) MCR-UIE.

Figure 10. Challenging cases that reflect the limitations of MCR-UIE. (a) Example 1; (b) example 2; (c) example 3.

Table 1. Performance metrics using different enhancement approaches (PSNR, SSIM, RMSE).

Method	UIEB			LSUI
Method	PSNR ↑ (dB)	SSIM ↑	RMSE ↓	PSNR ↑	SSIM ↑	RMSE ↓
NLD	16.416	0.708	41.261	14.629	0.694	49.862
CLAHE	16.812	0.751	39.182	14.713	0.744	48.154
DCP	16.526	0.713	41.558	14.025	0.694	52.976
UDCP	17.478	0.752	36.284	15.613	0.756	43.976
UNet	14.668	0.706	50.550	16.851	0.772	38.738
UWNet	17.771	0.759	36.146	18.782	0.783	31.139
CycleGAN	21.723	0.795	22.694	20.570	0.784	25.124
FUnIE-GAN	19.524	0.784	27.584	17.948	0.777	32.951
MCR-UIE	23.698	0.851	18.089	22.835	0.865	19.612

Table 2. Performance metrics using different enhancement methods (UIQM, UCIQE). The best and second-best results are marked in red and blue, respectively.

Method	UIQM ↑		UCIQE ↑
Method	UIEB	LSUI	UIEB	LSUI
NLD	2.518	2.540	0.600	0.571
CLAHE	2.665	2.515	0.562	0.523
DCP	2.386	2.410	0.602	0.558
UDCP	2.829	2.821	0.601	0.559
UNet	2.810	3.075	0.573	0.532
UWNet	2.849	2.905	0.531	0.498
CycleGAN	2.850	2.997	0.604	0.508
FUnIE-GAN	3.033	3.069	0.614	0.586
MCR-UIE	2.881	3.000	0.606	0.572

Table 3. Ablation studies on UIEB and LSUI datasets with PSNR, SSIM, and RMSE. Bold metrics represent the best results.

Method	UIEB			LSUI
Method	PSNR ↑	SSIM ↑	RMSE ↓	PSNR ↑	SSIM ↑	RMSE ↓
Semi-base	22.985	0.847	19.503	21.982	0.850	22.003
Semi-base + DQR	23.201	0.848	19.013	22.285	0.861	20.892
Semi-base + MCL₁	21.902	0.837	22.205	21.165	0.785	26.121
Semi-base + MCL₂	22.282	0.840	21.145	21.784	0.845	22.309
Semi-base + MCL₃	22.562	0.844	20.355	22.030	0.853	21.817
Semi-base + MCL₄	21.898	0.836	22.670	19.826	0.835	30.048
Semi-base + MCL	22.759	0.838	20.977	22.356	0.849	23.151
MCR-UIE	23.698	0.851	18.089	22.835	0.865	19.612

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, M.; Li, G.; Hu, Y.; Liu, H.; Hu, Q.; Huang, X. Semi-Supervised Underwater Image Enhancement Method Using Multimodal Features and Dynamic Quality Repository. J. Mar. Sci. Eng. 2025, 13, 1195. https://doi.org/10.3390/jmse13061195

AMA Style

Ding M, Li G, Hu Y, Liu H, Hu Q, Huang X. Semi-Supervised Underwater Image Enhancement Method Using Multimodal Features and Dynamic Quality Repository. Journal of Marine Science and Engineering. 2025; 13(6):1195. https://doi.org/10.3390/jmse13061195

Chicago/Turabian Style

Ding, Mu, Gen Li, Yu Hu, Hangfei Liu, Qingsong Hu, and Xiaohua Huang. 2025. "Semi-Supervised Underwater Image Enhancement Method Using Multimodal Features and Dynamic Quality Repository" Journal of Marine Science and Engineering 13, no. 6: 1195. https://doi.org/10.3390/jmse13061195

APA Style

Ding, M., Li, G., Hu, Y., Liu, H., Hu, Q., & Huang, X. (2025). Semi-Supervised Underwater Image Enhancement Method Using Multimodal Features and Dynamic Quality Repository. Journal of Marine Science and Engineering, 13(6), 1195. https://doi.org/10.3390/jmse13061195

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semi-Supervised Underwater Image Enhancement Method Using Multimodal Features and Dynamic Quality Repository

Abstract

1. Introduction

2. Related Works

2.1. Underwater Image Enhancement Methods

2.2. Semi-Supervised Approaches

2.3. Contrastive Learning

3. Methods

3.1. The Network Structure of MCR-UIE

3.2. Dynamic Quality and Reliability Repository

3.3. Multimodal Contrastive Loss

3.3.1. VGG Feature Contrastive Loss

3.3.2. Edge Feature Contrastive Loss

3.3.3. Color Feature Contrastive Loss

3.3.4. Local Region Contrastive Loss

4. Experimental Results

4.1. Datasets and Settings

4.1.1. Software Configuration

4.1.2. Introduction to Dataset

4.1.3. Evaluation Metrics

4.2. Enhanced Experiments on Public Datasets

4.3. Enhanced Experiments on Deep-Sea Cage Dataset

4.4. Ablation Experiments

4.5. Deployment Feasibility

4.6. Analysis of Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI