A Multi-Domain Enhanced Network for Underwater Image Enhancement

Sun, Tianmeng; Zhang, Yinghao; Hu, Jiamin; Cui, Haiyuan; Yu, Teng

doi:10.3390/info16080627

Open AccessArticle

A Multi-Domain Enhanced Network for Underwater Image Enhancement

by

Tianmeng Sun

,

Yinghao Zhang

,

Jiamin Hu

,

Haiyuan Cui

and

Teng Yu

^*

College of Electronic Information, Qingdao University, Qingdao 260000, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(8), 627; https://doi.org/10.3390/info16080627

Submission received: 8 June 2025 / Revised: 7 July 2025 / Accepted: 21 July 2025 / Published: 23 July 2025

Download

Browse Figures

Versions Notes

Abstract

Owing to the intricate variability of underwater environments, images suffer from degradation including light absorption, scattering, and color distortion. However, U-Net architectures severely limit global context utilization due to fixed-receptive-field convolutions, while traditional attention mechanisms incur quadratic complexity and fail to efficiently fuse spatial–frequency features. Unlike local enhancement-focused methods, HMENet integrates a transformer sub-network for long-range dependency modeling and dual-domain attention for bidirectional spatial–frequency fusion. This design increases the receptive field while maintaining linear complexity. On UIEB and EUVP datasets, HMENet achieves PSNR/SSIM of 25.96/0.946 and 27.92/0.927, surpassing HCLR-Net by 0.97 dB/1.88 dB, respectively.

Keywords:

underwater image enhancement; image restoration; attention mechanism; transformer

1. Introduction

Traditional underwater image enhancement algorithms aim to address the issues of optical distortion and scattering caused by dissolved substances in underwater environments, a topic that has garnered significant attention and research over the past decade. As a pragmatic yet complex undertaking in the realm of underwater vision, these algorithms typically involve techniques such as color correction [1,2], contrast enhancement [3,4], detail enhancement [5,6], and denoising [7,8]. Traditional underwater image enhancement algorithms based on convolutional neural networks (CNNs) [9] primarily focus on local enhancement. In contrast, our approach emphasizes the exploration of global enhancement capabilities by capturing long-range dependencies within the network. We improve the network’s capability to assimilate global information by combining convolutional neural networks with transformer sub-networks, enabling the network to more effectively adjust to intricate underwater environments. Additionally, we introduce a novel multi-scale mixed attention mechanism, which consists of spatial domain feature fusion modules and frequency domain feature fusion modules. This mechanism facilitates bidirectional attention across mixed domains, effectively improving the learning of representative features for multi-scale image restoration. However, during the integration of convolutional neural networks and transformer sub-networks, issues related to feature fusion and interaction arise. Additionally, the complementarity of feature fusion between the spatial and frequency domains in the mixed-domain attention mechanism is limited. Therefore, addressing how to achieve effective fusion mechanisms and interaction strategies, as well as optimizing the complementarity of feature fusion between the spatial and frequency domains, constitutes the primary challenge of this paper.

State-of-the-art underwater image enhancement approaches that leverage convolutional neural networks typically utilize the U-Net architecture [10], which is structured as a U-shaped network consisting of encoder and decoder components. In the decoding phase, low-level features are fused with high-level features through skip connections. However, this structure does not fully leverage global contextual information, resulting in suboptimal image enhancement and susceptibility to performance bottlenecks. One reason for this is that the U-Net employs an encoder–decoder architecture, where downsampling operations in the encoding pathway lead to a significant loss of spatial information and details in high-level feature representations. In the decoding pathway, although low-level features are merged with high-level features via skip connections, this approach can only transmit a limited amount of local contextual information and fails to adequately capture the relationships between distant pixels. Furthermore, the constraints imposed by convolution and pooling operations restrict the network to a fixed receptive field size, thereby restricting its capacity to leverage a wider spectrum of contextual information.

To tackle the aforementioned challenges, this paper introduces a novel hybrid multi-domain Enhanced Network (HMENet). HMENet integrates a multi-scale hybrid attention mechanism (MHA) into the encoder–decoder architecture and incorporates transformer components within the CNN framework. This integration enables the convolutional neural network to introduce global information while processing images. Our novel network is inspired by two key points. Firstly, in contrast to conventional spatial domain-based image enhancement techniques, transforming underwater degraded images into the frequency domain allows for more effective separation and handling of imaging degradation issues caused by water refraction and scattering. Secondly, due to the successful application of transformer models in high-level visual tasks, these models have been introduced into low-level visual tasks in pursuit of long-range pixel interactions and state-of-the-art performance. Therefore, we seek to integrate the aforementioned two inspirations into a CNN-based underwater image enhancement algorithm. We enhance the frequency domain attention mechanism by incorporating a spatial domain attention mechanism, facilitating dynamic learning of attention weights. By combining these two attention mechanisms, we achieve dual-domain attention. This novel attention mechanism improves efficiency by decomposing attention into two directions while maintaining an implicitly large receptive field. Then, the transformer module is embedded into the CNN framework. Compared to the traditional CNN-based underwater image enhancement algorithms, our new approach effectively captures and leverages global contextual information, thereby enhancing the clarity, accuracy, and visual quality of the images more efficiently.

The main contributions of this paper can be summarized as follows:

1.: A novel hybrid backbone network is proposed, in which a vision transformer sub-net is embedded between the CNN-based encoder and decoder phases, enhancing the network’s utilization of global features.
2.: In both encoder and decoder phases, by replacing the original encoder–decoder component in the CNN architecture, a multi-scale hybrid attention (MHA) module is proposed, which is next to the customized residual convolutional (RC) blocks. This design achieves bi-domain bidirectional attention, effectively enhancing representation learning for multi-scale image restoration.
3.: By integrating the spatial domain feature fusion (SDFF) module and frequency domain feature fusion (FDFF) module, a dual domain attention mechanism is proposed in the multi-scale hybrid attention module, making it dynamically learnable and greatly enhancing the adaptability of the network.

2. Related Works

2.1. Traditional Methods

Existing UIE (underwater image enhancement) methods can be broadly categorized into two types based on their modeling approaches to the imaging process. The first type is the physics-based methods, which primarily utilize prior knowledge to construct their models. For instance, Chiang et al. [11] proposed a method for depth map estimation utilizing the dark channel prior, removing artificial light sources, compensating for light scattering and color attenuation, and estimating scene water depth to revitalize the color balance and clarity in submerged image data. This method does not require additional optical devices, thereby enhancing underwater visual quality. Galdran et al. [12] introduced a variant of the dark channel prior [13], which leverages red channel information to estimate the depth maps of underwater images. This approach addresses visibility loss and color degradation, while also mitigating the color artifacts that can arise from erroneous depth estimations. Peng et al. [14] proposed a method that utilizes a dark channel prior approach that simultaneously leverages image blurriness and light absorption to estimate ambient light. Compared to other underwater image enhancement methods based on imaging formation models (IFM), this approach demonstrates superior recovery and enhancement performance across various underwater hues and lighting conditions. Wang and Liu et al. [15] proposed a novel adaptive attenuation curve prior algorithm that utilizes the smoothness of light and light attenuation to estimate underwater light conditions. For underwater imagery, pixels belonging to the same cluster can exhibit a power function relationship in the RGB color space, facilitating a more precise estimation of the attenuation ratios among the color channels.

2.2. Deep Learning-Based Approaches

Deep learning-based approaches are capable of automatically extracting features and learning enhancement mappings from a large amount of paired or unpaired training data. Li et al. [16] proposed a generative adversarial network that combines the underwater imaging process under unsupervised conditions to generate high-resolution output images. Li et al. [17] presented a method for improving underwater images under weak supervision based on learned cross-domain relationships, reducing the need for paired underwater image training. This represents the first attempt to use weakly supervised learning for color correction of underwater images. Fu et al. [18] utilized a dual-branch network to address issues related to contrast reduction and color degradation. Qi et al. [19] introduced an underwater image collaborative enhancement network (UICoE-Net) utilizing a Siamese architecture with an encoder–decoder framework, incorporating feature-matching components across various layers of the Siamese architecture to facilitate correlation between branches. Zhou et al. [20] introduced a hybrid approach to contrastive learning that utilizes unpaired data to address negative samples, addressing the constraints of depending exclusively on paired data and improving the model’s capability for generalization.

Due to the excellent performance of attention mechanisms in advanced visual tasks, they have been incorporated into underwater image enhancement algorithms to selectively focus on important information. Liu, Ma, Shi, and Chen [21] developed a convolutional neural network that is fully trainable in an end-to-end manner, GridDehazeNet, which employs channel-wise attention to optimize the weights associated with various channels during feature fusion for the task of single image dehazing. Zamir et al. [22] introduced a supervised attention module for feature selection. Chen et al. [23] proposed feature attention (FA), which combines channel attention and pixel attention. This approach enhances adaptability in processing a range of data types by applying a differential treatment to various features and pixels.

Self-attention networks have gained increasing popularity in addressing various computer vision tasks, as they can be jointly trained with the model and yield exceptional results. Wang et al. [24] introduced a non-local module facilitating connections between arbitrary pairs of locations in the context of image analysis, which empowers each pixel to gather global information from the complete image. KiT et al. [25] developed non-local interactions through the implementation of pairwise local attention mechanisms, which maintain the inductive bias of locality while introducing non-local connections. The computational complexity of this approach exhibits a linear relationship with the input spatial resolution. Chen and Zhang et al. [26] developed a rectangular window self-attention mechanism for image restoration that improves the aggregation of information across windows while maintaining computational efficiency. Li and Fan et al. [27] proposed anchored stripe self-attention, which effectively extends the modeling capability of self-attention beyond the local region. Zhou et al. [20] proposed HCLR-Net, which effectively extracts features and restores texture details through an adaptive mixed attention module and a detail restoration branch.

Based on the aforementioned insights, the mixed self-attention module can better handle various types of underwater degraded images. However, due to the quadratic complexity of self-attention calculations, these modules are typically employed in large-scale models. Unlike conventional self-attention modules, we propose augmenting the spatial attention module with a frequency domain attention module, while eliminating the traditional QKV generation in the spatial attention module. Furthermore, we generate attention weights through a lightweight branch, significantly reducing the complexity of self-attention.

3. Proposed Method

3.1. Method Overview

The overall architecture of HMENet employs an encoder–decoder structure similar to that of U-Net. The encoder is comprised of multiple residual convolutional blocks (RCBlocks) combined with multi-scale hybrid attention (MHA) modules, aimed at effectively extracting features. The decoder utilizes MHA as a core component to achieve efficient feature reconstruction. Between the encoder and decoder, a transformer sub-network composed of three consecutively stacked transformer blocks is integrated. The design of this subnet draws upon the transformer module from MobileViT v2 [28]. It serves to connect the encoder and decoder while enhancing the model’s capacity to learn global features. The structure of HMENet is illustrated in Figure 1, the architecture of the RCBlock is shown in Figure 2, and the configuration of the MHA is presented in Figure 3.

Specifically, HMENet consists of two components: encoding and decoding phases. For the encoding phase, given a degraded image

I \in R^{3 \times H \times W}

, we first extract shallow features from I using a convolution block, where C represents the total number of channels and

H \times W

represents the spatial dimensions. Simultaneously, three sets of RCBlocks and MHAs are employed to perform deeper feature extraction on the degraded image. The extracted features are fused with shallow features, during which the number of channels progressively increases to 4C while the spatial resolution gradually decreases to H/4 × W/4. Consequently, the feature hierarchy becomes increasingly deeper. Finally, the obtained deep features are fed into the subsequent transformer module for further feature extraction and integration of deeper global features. The downsampling in the encoding phase is implemented using strided convolutions. For the decoding phase, the final features integrated via the Transformer module are upsampled using the decoder network, which primarily consists of three sets of RCBlocks and multi-head attention mechanisms (MHAs). During the decoding phase, the decoded features are connected with the encoded features [29], and a

1 \times 1

convolution is utilized to decrease the number of channels. Finally, a

3 \times 3

convolution is implemented on the high-resolution images obtained from the RCBlocks in the decoder network, resulting in the learned residual image

R \in R^{3 \times H \times W}

. The enhanced image is then obtained by adding the original degraded image to the residual image. The upsampling in the decoding phase is implemented using transposed convolution.

3.2. Multi-Scale Hybrid Attention Mechanism

The attention mechanism is pivotal in a range of visual tasks, allowing networks to emphasize the most pertinent features. However, traditional attention mechanisms often focus on a single domain. For instance, SPANet [30] considers only spatial domain attention, neglecting the noise reduction and multi-scale feature fusion capabilities of frequency domain attention in underwater image enhancement networks. In complex underwater image enhancement tasks, such as addressing image degradation caused by the refraction and scattering of water, the frequency domain attention mechanism demonstrates superior performance.

To address the aforementioned requirements, we constructed a multi-scale hybrid attention mechanism (MHA), which comprises a spatial domain feature fusion (SDFF) module and a frequency domain feature fusion (FDFF) module for efficient information aggregation, as illustrated in Figure 3.

Specifically, the input features are first processed through convolution followed by the GELU activation function. Subsequently, they pass through three frequency domain feature filters (FDFF) of varying scales, which enhance the information at different levels of the image by extracting features across diverse frequency scales. Subsequently, the additional features from the three frequency branches undergo refinement before being summed. The obtained features are divided into three equal parts along the channel dimension using a split operation. These parts are then fed into three spatial domain feature filters (SDFF) of varying scales for spatial modulation and feature integration. Finally, the integrated features undergo convolution to produce the final output. The specific process can be expressed as follows:

X^{'} = W_{1 \times 1} (F D F F_{K = 7} (X) + F D F F_{K = 11} (X) + F D F F_{global} (X)),

(1)

X_{1}, X_{2}, X_{3} = Split (X^{'}),

(2)

\hat{X} = ({SDFF}_{K = 7} (X_{1}), {SDFF}_{K = 11} (X_{2}), {SDFF}_{global} (X_{3})) .

(3)

In this context,

g l o b a l = W / H

denotes the outputs within the horizontal and vertical bar operation units, while

X^{'}, \hat{X}

represent the intermediate and final outputs of the MHA, respectively. The scale division within the feature fusion module facilitates the simultaneous capture of local details and global contextual information. Smaller scales (e.g., K = 7 ) are capable of capturing local details and higher-frequency information, whereas larger scales (e.g., global) encompass a broader spatial context and lower-frequency information.

3.2.1. Spatial Domain Feature Fusion Module

The traditional spatial self-attention module typically assigns different weights to the value tensor (V) by computing the similarity between the query tensor (Q) and the key tensor (K). Given an input tensor

X \in R^{C \times H \times W}

, self-attention can be expressed as follows:

Attention (Q, K, V) = Softmax (Q K^{T}) V,

(4)

where

Q = X W^{Q}

,

K = X W^{K}

,

V = X W^{V}

. The computational complexity of spatial self-attention is quadratic with respect to the spatial dimensions, resulting in quadratic complexity. To mitigate this quadratic complexity and enhance computational efficiency, a spatial feature fusion module (SDFF) has been proposed. This module generates attention weights through a lightweight branch, thereby circumventing the computations related to Q, K, and V, while also enabling data sharing across spatial and channel dimensions. This branch consists of adaptive average pooling (AAP), grouped convolutions, and a Sigmoid activation function. Additionally, the module sequentially employs horizontal and vertical strip operation units, which facilitate an implicit expansion of the receptive field by the network. We illustrate the concept using the horizontal strip operation unit as an example. A schematic diagram is shown in Figure 4.

Specifically, given an input tensor

X \in {R^{C}}^{\times H \times W}

, the process begins by utilizing adaptive average pooling to generate feature vectors. Subsequently, grouped convolutions and a Sigmoid activation function are applied to derive the attention weights. During this process, global spatial information aggregation is achieved through adaptive average pooling, wherein the spatial information of each channel is averaged into a single value, thereby facilitating data sharing across the spatial dimension to some extent. Moreover, grouped convolutions enable channels within each group to share the same convolutional kernel, thus achieving data sharing along the channel dimension. The generation process of the attention weights can be expressed as follows:

A = Sigmoid (W_{g} (AP (X))) \in R^{K},

(5)

where

W_{g}

denotes the grouped convolution, which utilizes a

1 \times 1

convolution for each channel group.

K (K ≪ W)

specifies the length of the integrated horizontal strip. Concurrently, the input tensor undergoes reflective padding and unfolds into local features. The attention weights are then applied to perform a weighted fusion of these local features, enabling refined extraction and representation of the features. The specific process can be expressed as follows:

{\hat{X}}_{c, h, w} = \sum_{k = 1}^{K} A_{k} X_{c, h, w} - ⌊\frac{K}{2}⌋ + k .

(6)

To achieve effective feature aggregation, the output of our SDFF module is obtained by sequentially employing horizontal and vertical strip operation units. In this process, the network leverages attention mechanisms in both directions to realize bidirectional feature aggregation, thereby effectively expanding the receptive field of the network. The specifics of our SDFF module can be articulated as follows:

X_{K}^{SDFF} = {SDFF}_{K} (X) = S_{K}^{W} (S_{K}^{H} (X)),

(7)

where

S_{K}^{H} (_{▪}) S_{K}^{W} (_{▪})

denote the horizontal bar operator and vertical bar operator, respectively. In summary, the complexity of our module is

K C H W

, which is much lower than the quadratic complexity

{(H W)}^{2}

.

3.2.2. Frequency Domain Feature Fusion Module

To effectively suppress noise in underwater images and reduce image distortion caused by optical scattering while preserving relevant information, we propose a frequency domain feature fusion module (FDFF). This module employs frequency-selective attention and significantly simplifies the computation process through the use of strip average pooling. Additionally, we incorporate learnable attention parameters to adaptively adjust the attention distribution based on the importance of different frequency components, thereby enhancing the overall learning and prediction capabilities. To effectively extract both local and global information, our FDFF module sequentially employs horizontal and vertical strip operation units. We illustrate this process using the horizontal strip operation unit as an example. A schematic representation is provided in Figure 5. Specifically, given an input tensor

X \in {R^{C}}^{\times H \times W}

, we first extract the low-frequency component

S_{c, h, w} \in R^{1 \times K}

centered at

X_{c, h, w}

using strip average pooling, where

S_{c, h, w}

represents a row vector and K denotes the length of the strip region. Subsequently, we obtain the high-frequency component by eliminating the low-frequency components generated from the input features. Next, we apply attention parameters to both the low-frequency and high-frequency components simultaneously, and sum the modulated information to obtain the horizontal feature tensor

\hat{X} = F_{K}^{H} (X)

. This process can be expressed as follows:

X_{c, h, w}^{l} = AP (S_{c, h, w}),

(8)

X_{c, h, w}^{h} = X_{c, h, w} - X_{c, h, w}^{l},

(9)

\hat{X} = W_{l}^{1} X^{l} + W_{h}^{1} X^{h},

(10)

where

W_{l}^{1}, W_{h}^{1} \in R^{C}

are the attention parameters used to modulate low-frequency and high-frequency components, respectively.

The FDFF module’s final output is obtained by sequentially applying the horizontal and vertical strip operation units. The specific process can be expressed as follows:

X_{K}^{FDFF} = {FDFF}_{K} (X) = F_{K}^{W} (F_{K}^{H} (X),)

(11)

where

F_{K}^{H} (_{▪})

and

F_{K}^{W} (_{▪})

denote the horizontal and vertical strip operators, respectively, with their corresponding attention parameters being

W_{l}^{1}, W_{h}^{1}

and

W_{l}^{2}, W_{h}^{2}

. The FDFF module does not utilize traditional convolutional layers, and the attention parameters are learned and updated directly through backpropagation, resulting in a highly lightweight parameterization.

3.3. Transformer Sub-Network

The core idea of integrating the transformer mechanism into the backbone architecture is to leverage the global feature extraction capabilities of the transformer to enhance the performance of the network. The transformer sub-network consists of three transformer blocks, which are inspired by the transformer module in MobileViT v2 [28]. The specific structure of each transformer block is illustrated in Figure 6.

Specifically, for a given input tensor

X \in R^{C \times H \times W}

, the transformer block first employs a convolution kernel of size

1 \times 1

to project the channel dimension of the tensor X from C dimensions to D dimensions, resulting in the tensor

X_{1} \in R^{C \times H \times W}

. The tensor

X_{1}

is then reshaped into M non-overlapping tensors of dimension size

R^{2 \times 2 \times D}

, which are sequentially fed into the transformer block for encoding. This enables the network to proficiently model long-range dependencies across various image regions. The encoded segments are subsequently reshaped into a new tensor. The detailed procedure can be articulated as follows:

X_{2} = Fold (Transformer (B_{1}, \dots, B_{i}, \dots, B_{M})),

(12)

B_{1}, \dots, B_{i}, \dots, B_{M} = Unfold (X_{1}),

(13)

where

B_{1}

denotes the i segment tensor derived from the unfolded tensor

X_{1}

,

F o l d (_{▪})

represents the folding operation, and

U n f o l d (_{▪})

denotes the unfolding operation. The transformer block applies a convolution kernel to the resulting tensor

X_{2}

, mapping its channel dimension from D dimensions to C dimensions, yielding the tensor

X_{3}

. Finally, the original tensor X is added to

X_{3}

and passed through a standard convolutional layer with a kernel size of 3, resulting in the final output.

3.4. Loss Function

In this study, we utilize a dual-domain

L_{1}

loss for training. By integrating loss metrics from both the spatial and frequency domains, the network is capable of enhancing underwater images that are not only visually similar to real images but also retain favorable characteristics in the frequency domain. The specific process is as follows:

L_{s} = \frac{1}{E} {∥\hat{I} - G∥}_{1},

(14)

L_{f} = \frac{1}{E} {∥F (\hat{I}) - F (G)∥}_{1},

(15)

where

\hat{I} G

represents the restored image and the ground truth value, respectively, E represents the normalized total element, and

F

represents the fast Fourier transform. The final loss function is

L_{total} = L_{s} + λ L_{f}

, where

λ = 0 . 1

.

4. Experiments

4.1. Datasets

UIEB: In this study, we trained the HMENet network on the UIEB dataset [31]. The UIEB dataset comprises various types of underwater image degradation scenarios, encompassing a variety of underwater environmental conditions, making it suitable for training a robust underwater image enhancement network. The dataset comprises 950 genuine underwater images, including 890 paired images and 60 unpaired images. Additionally, we chose a selection of 90 paired images for testing, referred to as T-90, while the unpaired images are designated as C-60.

EUVP: To validate the generalizability of our model, we simultaneously trained the HMENet network on the EUVP dataset [32]. The EUVP dataset consists of 20,000 underwater images, both paired and unpaired, captured under various visibility conditions using seven different cameras.

UCCS: The UCCS dataset is composed of authentic underwater images of Zhangzi Island in the Yellow Sea, with three different hues: green, blue–green, and blue. The dataset is only used for testing. The UCCS dataset [33] consists of real underwater images captured on Zhazidong Island in the Yellow Sea, which encompasses three distinct hues: blue, blue–green, and green. This dataset is used solely for testing purposes.

U45: The U45 dataset [34] comprises images exhibiting color distortion, reduced contrast, and hazy deterioration effects specific to underwater environments. This dataset is used exclusively for testing purposes.

4.2. Evaluation Metrics

To compare the proposed model with state-of-the-art underwater image enhancement methods, quantitative evaluation was conducted using the peak signal-to-noise ratio (PSNR) [35], structural similarity index (SSIM) [36,37], underwater image quality measure (UIQM) [38], and underwater color image quality evaluation (UCIQE) [39]. Among these, the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) serve as evaluation metrics for paired datasets by measuring the similarity between the input image and the reference image. In contrast, the underwater image quality measure (UIQM) and underwater color image quality evaluation (UCIQE) differ from reference-based metrics; they focus on evaluating color saturation, brightness, and contrast, independent of any reference image. These metrics rely entirely on the attributes of the underwater images being evaluated and are applicable to unpaired datasets.

4.3. Implementation Details

The experiments were carried out using the PyTorch 2.2.0 framework and two NVIDIA GeForce GTX 2080 Ti GPUs to implement the model. The model employs a multilevel regularization strategy: transformer layers incorporate attention dropout (0.1) and path dropout (0.1), while convolutional layers utilize the batch norm and group norm. Training also incorporates random scale augmentation (0.5×–1.25×), coupled with HMENet’s inherent multi-scale feature fusion module, improving model robustness to scale variations.

Parameters in the frequency domain attention module use a zero-centered initialization strategy: parameters for low-frequency/high-frequency components are initialized to 0, the scaling parameter gamma is initialized to 0, and the shift parameter beta is initialized to 1. Training uses a three-stage optimization strategy:

(1): Warm-up phase (first 500 steps): The learning rate increases linearly from 1% of the initial value to 100%.
(2): Main training phase: This uses a cosine annealing learning rate of $5 \times 10^{- 4} \to 1 \times 10^{- 6}$ .
(3): Fine-tuning phase: This uses a fixed learning rate of $1 \times 10^{- 6}$ for 50 epochs of fine-tuning. This is combined with automatic mixed precision (AMP) and gradient clipping (max_norm = 1.0).

4.4. Quantitative Evaluation

We performed a quantitative evaluation of our model against several leading-edge (SOTA) methodologies: GDCP [40], FUInEGAN [32], UWCNN [31], WaterNet [41], UColor [1], MLLE [42], UIE-MP [43], HAAM-GAN [44], and HCLR-Net [20]. The resulting data are presented in Table 1.

To ensure a fair comparison, we provide the PSNR, SSIM, UIQM, and UCIQE metrics for each method based on the UIEB and EUVP datasets in Table 1. For the unpaired datasets, we provide the UCIQE and UIQM metrics for each method based on the UCCS and U45 datasets. Experimental results on the UIEB dataset demonstrate that our method outperforms all other comparative approaches, achieving PSNR/SSIM values of 25.96/0.946, which is 5%/0.22% higher than the second-best HCLR-Net model. Additionally, we observe a more significant improvement on the EUVP dataset, with PSNR/SSIM values reaching 27.92/0.927, surpassing the HCLR-Net model by 7.2%/1.3%. Furthermore, this study employs the parameters derived from the model trained on the UIEB dataset to perform direct testing on the UIEB-C60, UCCS, and U45 datasets. The metrics of our method demonstrate superior performance across the majority of the datasets.

For the U45 dataset, although the UIQM value of PUIE_MP is slightly higher than ours, the improved visualization results reveal that the images produced by the enhanced PUIE_MP model still exhibit a slight yellow color cast and lower brightness, as demonstrated in Figure 7. In contrast, our approach not only resolves the color cast problem but also demonstrates enhanced efficacy in terms of overall luminance. Furthermore, our approach achieves enhancements that are closest to the true color representation of the images.

To rigorously evaluate HMENet’s robustness against training stochasticity, we conducted five independent training trials with distinct random seeds. Table 2 reports the standard deviations (stds) of all performance metrics across these trials for both HMENet and competing methods on five benchmark datasets. The significantly lower standard deviations of our approach quantitatively confirm its superior stability and reproducibility compared to state-of-the-art methods under varying initialization conditions. Table 2 is an extension of Table 1.

To comprehensively assess HMENet’s practical deployment efficiency, we conducted a systematic comparison of computational metrics against state-of-the-art methods. Table 3 reports the trainable parameters (#Params), multiply–accumulate operations (MACs), GPU memory consumption, and inference time across nine competing approaches and our proposed HMENet. The significantly lower complexity profile of HMENet (3.5 M Params/16.7 G MACs) quantitatively confirms its superior hardware adaptability while maintaining competitive enhancement performance. This efficiency–performance co-optimization demonstrates HMENet’s viability for resource-constrained underwater platforms. Table 3 is an extension of Table 1.

4.5. Quantitative Evaluation

In this section, we present a comparative visual analysis across different test datasets to assess the strengths and weaknesses of the aforementioned models. The result images for the UIEB dataset are shown in Figure 7, those for the EUVP dataset in Figure 8, the U45 results in Figure 9, and the UCCS results in Figure 10.

We carefully selected different scenes from paired datasets for evaluation. GDCP [40] employs an adaptive color correction IFM to restore degraded images; however, this may lead to an enhancement of color cast under extreme conditions. FUInEGAN [32] is an efficient lightweight GAN model that results in elevated color saturation; however, it fails to preserve pertinent image details in the enhancement of complex degraded images, resulting in the introduction of considerable artifacts. The UWCNN [31] network architecture is overly simplistic, resulting in limited robustness and causing the enhanced images to be slightly blurred. WaterNet [41] leverages traditional preprocessing techniques, achieving significant improvements in color correction and contrast enhancement; however, the overall brightness is lower than that of the original images.UColor [1] demonstrates good contrast; however, its performance can be suboptimal in certain scenarios due to GDCP’s prior influence. MLLE [42] enhances images by employing the principle of minimal color loss and the maximum attenuation mapping–guided fusion strategy, effectively enhancing image details. However, due to limitations in color restoration methods, the resulting images do not reliably reflect the actual colors of the objects. PUIE_MP [43] introduces MC likelihood estimation and MP estimation to predict the final results, but its enhancement effect shows a slight color bias compared to our method. HAAM-GAN [44] successfully addresses the color bias issue, but the enhanced images often suffer from excessive saturation, leading to a lack of a realistic appearance in the images. HCLR-Net [20] employs a hybrid contrastive learning approach to obtain feature distributions similar to those of the reference images; however, its enhancement results lack certain image details when compared to our method. In contrast, our approach not only enhances the details of the original images but also effectively restores the true color information, resulting in improved visual quality.

In our experimental analysis employing the unpaired dataset UCCS, we enhanced images of green, blue, and blue–green hues to assess and compare the color-correction performance of different methodologies, as illustrated in Figure 10. The magnified sections highlight each method’s ability to capture image details. However, it is noteworthy that GDCP [40], FUInEGAN [32], and UWCNN [31] fail to completely eliminate color bias, resulting in images that appear unnatural and slightly blurred. WaterNet [41], UColor [1], and PUIE_MP [43] achieve better contrast; however, their outputs exhibit a slight color bias compared to our method. MLLE [42] and HAAM-GAN [44] address the color bias issue, but the enhanced images suffer from local over-enhancement, resulting in unrealistic colors. HCLR-Net [20] produces enhanced images that are accurate in color representation and free of bias; yet, when compared to our method, it shows slightly inferior saturation and contrast. To assess the efficacy of our proposed method, we conducted a comparative analysis with multiple methods on the U45 dataset, as illustrated in Figure 9.

Our approach effectively eliminates various color biases while demonstrating superior performance in terms of image brightness, contrast, and detail. Additionally, the generated results visually resemble real images more closely. The comprehensive comparisons indicate that our method exhibits enhanced performance overall.

4.6. Ablation Studies

In this section, we perform ablation studies to validate the effectiveness of the proposed HMENet. These experiments are divided into four parts: the combination mode of bidirectional strip operation units in the multi-scale hybrid attention mechanism (MHA), the design–selection of different strip sizes, the design–selection of MHA, and the design–selection of different transform sub-network layers. All experiments are trained for 500 epochs on the UIEB dataset and tested on the UIEB-C60 dataset.

4.6.1. The Design Selection of FDFF and SDFF

First, we developed a baseline model to facilitate the ablation experiments, which includes the transformer sub-network but excludes the multi-scale hybrid attention mechanism (MHA) module. Subsequently, we incorporated various modules into the baseline network, as outlined in the accompanying Table 4. The baseline model achieved a PSNR of 23.06 dB on the UIEB dataset. With a negligible increase in complexity, the proposed spatial domain feature fusion module (SDFF) yielded performance gains of 2.19 dB for horizontal and 2.09 dB for vertical strip units, respectively. Subsequently, we investigated the compatibility of the horizontal and vertical strip operation units, with their combination patterns defined as follows: SDFF-W-H, SDFF-H-W, and SDFF-Parallel. All combination patterns achieved higher scores compared to the use of spatial strip operation units alone, with the sequence of using horizontal and vertical strip operation units demonstrating the best performance. This finding confirms the compatibility of horizontal and vertical strip operation units. We further conducted a similar investigation on the frequency domain feature fusion module (FDFF). Compared to the use of the SDFF module, using frequency horizontal and vertical strip operation units either individually or in parallel did not yield performance gains, whereas the sequential use of horizontal and vertical strip operation units resulted in the best outcome, providing a performance increase of 0.18 dB. Ultimately, we selected the version specified in No.11 as the default configuration for our model.

4.6.2. Selection of Different Strip Sizes

Taking FDFF as an example, we further investigated the impact of different strip sizes, as shown in Table 5. Compared to the use of the SDFF module alone, the FDFF module employing a mixture of different strip sizes achieved higher scores. The incorporation of multi-scale strip sizes (K = 7, K = 11, and K = Global) facilitates the simultaneous capture of local details and global contextual information in the image. Smaller scales (e.g., K = 7) are effective in capturing fine features and high-frequency signals, while larger scales (e.g., K = Global) cover a broader receptive field and low-frequency signals. This confirms the effectiveness of the multi-size mixed design.

4.6.3. Design Choice of MHA

Under the established configurations of the FDFF and SDFF modules, we further investigated the various design choices of the MHA module, as illustrated in Table 6. Firstly, we validated the necessity of each module by employing it multiple times. Subsequently, we explored the optimal design of the MHA module by altering the deployment sequence of the modules. Although the repeated use of the FDFF module achieved a performance gain of 0.16 dB when compared to the sequential application of the SDFF and FDFF modules, and a gain of 0.26 dB when compared to their parallel usage, the sequential application of the SDFF and FDFF modules yielded a performance enhancement of 0.49 dB in comparison. This clearly demonstrates the effectiveness of the MHA design.

4.6.4. Design–Selection of Transform

Under the condition of a fixed MHA module, further investigations were conducted on the proposed transform sub-network, as shown in Table 7. The transform sub-network consisted of transformer modules based on MobileViT v2. We validated the necessity and optimal design of the transform sub-network by varying the number of transformer layers, denoted as N. When the transform sub-network was not utilized, i.e., N = 0, the PSNR was 23.76 dB, which is a difference of 2.2 dB compared to the best result achieved with N = 6. This underscores the necessity of the transform sub-network. Subsequently, by setting N to 4, 5, 6, 10, 12, and 16, we assessed the influence of varying the number of transformer layers on the overall network performance. This further validated the efficacy of the transform sub-network.

To rigorously determine the optimal architectural placement of the transformer module, we evaluated four distinct transformer configurations: (1) a serial MHA before the encoder, (2) a serial MHA between the encoder–decoder (proposed), (3) a serial MHA after the decoder, and (4) a parallel MHA. Table 8 reports the quantitative performance metrics (PSNR, SSIM, and inference time) for these configurations on the UIEB benchmark. The significantly superior enhancement capability of our encoder–decoder intermediate placement (achieving peak 25.96 dB PSNR) quantitatively confirms its effectiveness for cross-level feature fusion and computational efficiency compared to alternative integration approaches.

5. Conclusions

Underwater images frequently suffer from severe degradation phenomena such as light absorption, dispersion, and color distortion due to the complex and variable nature of underwater environments. Existing convolutional neural network (CNN) models face significant limitations, particularly their constrained receptive fields, which hinder their ability to learn diverse image features and achieve robust generalization. To address these challenges, this study introduced the hybrid multi-domain enhanced network (HMENet), which integrates a transformer sub-network within a CNN-based encoder–decoder architecture to expand the receptive field and capture global dependencies, along with a multi-scale hybrid attention mechanism (MHA) combining spatial (SDFF) and frequency domain (FDFF) modules for dynamic feature fusion. Experimental results demonstrate that HMENet achieves state-of-the-art performance with PSNR/SSIM scores of 25.96/0.946 on UIEB and 27.92/0.927 on EUVP, outperforming HCLR-Net by 5% and 7.2% in PSNR, respectively, while also showing strong generalization on unpaired datasets (UCCS, U45).

6. Future Work

While the HMENet model demonstrates superior performance over state-of-the-art methods in both quantitative and qualitative evaluations, its real-time computational efficiency when processing high-resolution images requires optimization. Furthermore, its performance in extremely challenging scenarios (e.g., highly turbid waters) could be further improved. Future work will explore lightweight transformer variants to achieve faster inference, integrate physical scattering models to enhance robustness under extreme conditions, and extend to multimodal inputs (e.g., sonar or depth data) for broader underwater vision tasks.

Author Contributions

Conceptualization, T.S. and J.H.; methodology, T.S. and Y.Z.; software, T.S.; validation, H.C., J.H. and T.S.; formal analysis, T.S. and Y.Z.; investigation, H.C.; resources, T.Y.; data curation, Y.Z.; writing—original draft preparation, T.S.; writing—review and editing, T.Y.; visualization, T.Y.; supervision, T.Y.; project administration, T.Y.; funding acquisition, T.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

In this manuscript, we employed specific mathematical notations to ensure clarity and consistency. The following conventions were used throughout: notably

R^{H \times W \times C}

signifies the dimensionality of tensors. Tensors are denoted by uppercase blodface, e.g., X. The superscripts and subscripts of X denote its indices. Therefore, the dimensionality of the tensor

X_{2}^{1}

can be expressed as

X_{2}^{1} \in R^{H \times W \times C}

.

References

Li, C.; Anwar, S.; Hou, J.; Cong, R.; Guo, C.; Ren, W. Underwater Image Enhancement via Medium Transmission-Guided Multi-Color Space Embedding. IEEE Trans. Image Process. 2021, 30, 4985–5000. [Google Scholar] [CrossRef]
Wang, Y.; Yan, Y.; Ding, X.; Fu, X. Underwater Image Enhancement via L2 based Laplacian Pyramid Fusion. In Proceedings of the OCEANS 2019 MTS/IEEE SEATTLE, Seattle, WA, USA, 27–31 October 2019; pp. 1–4. [Google Scholar]
Dai, C.; Lin, M. Adaptive contrast enhancement for underwater image using imaging model guided variational framework. Multimed. Tools Appl. 2024, 83, 83311–83338. [Google Scholar] [CrossRef]
Zhang, W.; Pan, X.; Xie, X.; Li, L.; Wang, Z.; Han, C. Color correction and adaptive contrast enhancement for underwater image enhancement. Comput. Electr. Eng. 2021, 91, 106981. [Google Scholar] [CrossRef]
Zhang, W.; Dong, L.; Xu, W. Retinex-inspired color correction and detail preserved fusion for underwater image enhancement. Comput. Electron. Agric. 2022, 192, 106585. [Google Scholar] [CrossRef]
Zhang, D.; Zhou, J.; Guo, C.; Zhang, W.; Li, C. Synergistic Multiscale Detail Refinement via Intrinsic Supervision for Underwater Image Enhancement. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 7033–7041. [Google Scholar]
Guan, M.; Xu, H.; Jiang, G.; Yu, M.; Chen, Y.; Luo, T.; Zhang, X. DiffWater: Underwater Image Enhancement Based on Conditional Denoising Diffusion Probabilistic Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 2319–2335. [Google Scholar] [CrossRef]
Lu, S.; Guan, F.; Zhang, H.; Lai, H. Underwater image enhancement method based on denoising diffusion probabilistic model. J. Vis. Commun. Image Represent. 2023, 96, 103926. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Volume 9351. [Google Scholar]
Chiang, J.Y.; Chen, Y.C. Underwater Image Enhancement by Wavelength Compensation and Dehazing. IEEE Trans. Image Process. 2012, 21, 1756–1769. [Google Scholar] [CrossRef]
Galdran, A.; Pardo, D.; Picón, A.; Alvarez-Gila, A. Automatic Red-Channel Underwater Image Restoration. J. Vis. Commun. Image Represent. 2015, 26, 132–145. [Google Scholar] [CrossRef]
He, K.; Sun, J.; Tang, X. Single Image Haze Removal Using Dark Channel Prior. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1956–1963. [Google Scholar]
Peng, Y.T.; Cosman, P.C. Underwater Image Restoration Based on Image Blurriness and Light Absorption. IEEE Trans. Image Process. 2017, 26, 1579–1594. [Google Scholar] [CrossRef]
Wang, Y.; Liu, H.; Chau, L.P. Single Underwater Image Restoration Using Adaptive Attenuation-Curve Prior. IEEE Trans. Circuits Syst. Video Technol. 2018, 65, 992–1002. [Google Scholar] [CrossRef]
Li, J.; Skinner, K.A.; Eustice, R.M.; Johnson-Roberson, M. WaterGAN: Unsupervised Generative Network to Enable Real-Time Color Correction of Monocular Underwater Images. IEEE Robot. Autom. Lett. 2017, 2, 1944–1950. [Google Scholar] [CrossRef]
Li, C.; Guo, J.; Guo, C. Emerging From Water: Underwater Image Color Correction Based on Weakly Supervised Color Transfer. IEEE Signal Process. Lett. 2018, 25, 323–327. [Google Scholar] [CrossRef]
Fu, X.; Cao, X. Underwater Image Enhancement with Global–Local Networks and Compressed-Histogram Equalization. Signal Process. Image Commun. 2020, 86, 115892. [Google Scholar] [CrossRef]
Qi, Q.; Zhang, Y.; Tian, F.; Wu, Q.J.; Li, K.; Luan, X.; Song, D. Underwater Image Co-Enhancement With Correlation Feature Matching and Joint Learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 1133–1147. [Google Scholar] [CrossRef]
Zhou, J.; Sun, J.; Li, C.; Jiang, Q.; Zhou, M.; Lam, K.M.; Zhang, W.; Fu, X. HCLR-Net: Hybrid Contrastive Learning Regularization with Locally Randomized Perturbation for Underwater Image Enhancement. Int. J. Comput. Vis. 2024, 132, 4132–4156. [Google Scholar] [CrossRef]
Liu, X.; Ma, Y.; Shi, Z.; Chen, J. GridDehazeNet: Attention-Based Multi-Scale Network for Image Dehazing. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7313–7322. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Multi-Stage Progressive Image Restoration. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14816–14826. [Google Scholar]
Chen, X.; Fan, Z.; Li, P.; Dai, L.; Kong, C.; Zheng, Z.; Huang, Y.; Li, Y. Unpaired Deep Image Dehazing Using Contrastive Disentanglement Learning. arXiv 2022, arXiv:2203.07677. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Lee, H.; Choi, H.; Sohn, K.; Min, D. KNN Local Attention for Image Restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2129–2139. [Google Scholar]
Chen, Z.; Zhang, Y.; Gu, J.; Zhang, Y.; Kong, L.; Yuan, X. Cross Aggregation Transformer for Image Restoration. arXiv 2023, arXiv:2211.13654. [Google Scholar]
Li, Y.; Fan, Y.; Xiang, X.; Demandolx, D.; Ranjan, R.; Timofte, R.; Van Gool, L. Efficient and Explicit Modelling of Image Hierarchies for Image Restoration. arXiv 2023, arXiv:2303.00748. [Google Scholar]
Mehta, S.; Rastegari, M. Separable Self-attention for Mobile Vision Transformers. arXiv 2022, arXiv:2206.02680. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient Transformer for High-Resolution Image Restoration. arXiv 2022, arXiv:2111.09881. [Google Scholar]
Wang, T.; Yang, X.; Xu, K.; Chen, S.; Zhang, Q.; Lau, R.W.H. Spatial Attentive Single-Image Deraining With a High Quality Real Rain Dataset. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12262–12271. [Google Scholar]
Li, C.; Anwar, S.; Porikli, F. Underwater Scene Prior Inspired Deep Underwater Image and Video Enhancement. Pattern Recognit. 2020, 98, 107038. [Google Scholar] [CrossRef]
Islam, M.J.; Xia, Y.; Sattar, J. Fast Underwater Image Enhancement for Improved Visual Perception. IEEE Robot. Autom. Lett. 2020, 5, 3227–3234. [Google Scholar] [CrossRef]
Liu, R.; Fan, X.; Zhu, M.; Hou, M.; Luo, Z. Real-World Underwater Enhancement: Challenges, Benchmarks, and Solutions Under Natural Light. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 4861–4875. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wang, W. A Fusion Adversarial Underwater Image Enhancement Network with a Public Test Dataset. arXiv 2019, arXiv:1906.06819. [Google Scholar]
Korhonen, J.; You, J. Peak Signal-to-Noise Ratio Revisited: Is Simple Beautiful? In Proceedings of the 2012 Fourth International Workshop on Quality of Multimedia Experience, Melbourne, Australia, 5–7 July 2012; pp. 37–38. [Google Scholar]
Liu, Y.; Zhai, G.; Gu, K.; Liu, X.; Zhao, D.; Gao, W. Reduced-Reference Image Quality Assessment in Free-Energy Principle and Sparse Representation. IEEE Trans. Multimed. 2018, 20, 379–391. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Panetta, K.; Gao, C.; Agaian, S. Human-Visual-System-Inspired Underwater Image Quality Measures. IEEE J. Ocean. Eng. 2016, 41, 541–551. [Google Scholar] [CrossRef]
Yang, M.; Sowmya, A. An Underwater Color Image Quality Evaluation Metric. IEEE Trans. Image Process. 2015, 24, 6062–6071. [Google Scholar] [CrossRef]
Peng, Y.T.; Cao, K.; Cosman, P.C. Generalization of the Dark Channel Prior for Single Image Restoration. IEEE Trans. Image Process. 2018, 27, 2856–2868. [Google Scholar] [CrossRef]
Li, C.; Guo, C.; Ren, W.; Cong, R.; Hou, J.; Kwong, S.; Tao, D. An Underwater Image Enhancement Benchmark Dataset and Beyond. IEEE Trans. Image Process. 2020, 29, 4376–4389. [Google Scholar] [CrossRef]
Zhang, W.; Zhuang, P.; Sun, H.H.; Li, G.; Kwong, S.; Li, C. Underwater Image Enhancement via Minimal Color Loss and Locally Adaptive Contrast Enhancement. IEEE Trans. Image Process. 2022, 31, 3997–4010. [Google Scholar] [CrossRef]
Fu, Z.; Wang, W.; Huang, Y.; Ding, X.; Ma, K.K. Uncertainty Inspired Underwater Image Enhancement. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Volume 13678, pp. 465–482. [Google Scholar]
Zhang, D.; Wu, C.; Zhou, J.; Zhang, W.; Li, C.; Lin, Z. Hierarchical attention aggregation with multi-resolution feature learning for GAN-based underwater image enhancement. Eng. Appl. Artif. Intell. 2023, 125, 106743. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the hybrid multi-domain enhanced network (HMENet). The upper half represents the encoding phase, which includes RCBlocks, MHA, and the transformer sub-network. The lower section illustrates the decoding process, where features extracted by the encoder are combined with the decoded features to improve the network’s performance.

Figure 2. Internal structure of the residual convolution block (RCBlock).

Figure 3. Internal structure of the multi-scale hybrid attention mechanism (MHA) module.

Figure 4. Spatial domain feature fusion module (SDFF) structure.

Figure 5. Frequency domain feature fusion module (FDFF) structure.

Figure 6. Internal structure of the transformer block.

Figure 7. Visual comparison of different methods on the UIEB dataset.

Figure 8. Visual comparison of different methods on the EUVP dataset.

Figure 9. Visual comparison of different methods on the U45 dataset.

Figure 10. Visual comparison of different methods on the UCCS dataset.

Table 1. A quantitative comparison of different methods on the UIEB, EUVP, UCCS, and U45 datasets.

Methods	UIEB T-90		EUVP		UIEB C-60		UCCS		U45
Methods	PSNR↑	SSIM↑	PSNR↑	SSIM↑	UIQM↑	UCIQE↑	UIQM↑	UCIQE↑	UIQM↑	UCIQE↑
GDCP [40]	13.38	0.747	16.38	0.644	2.250	0.569	2.699	0.564	2.248	0.594
FUInEGAN [32]	17.11	0.701	21.92	0.887	2.220	0.508	3.095	0.529	2.473	0.519
UWCNN [31]	17.95	0.847	17.73	0.704	2.546	0.520	3.025	0.498	3.079	0.546
WaterNet [41]	17.35	0.813	20.14	0.681	2.382	0.597	3.039	0.569	2.993	0.599
UColor [1]	21.90	0.872	21.89	0.795	2.482	0.553	3.019	0.550	3.159	0.573
MLLE [42]	19.08	0.825	15.04	0.632	2.250	0.569	2.868	0.570	2.485	0.595
PUIE_MP [43]	21.52	0.854	22.60	0.814	2.521	0.558	3.003	0.536	3.199	0.578
HAAM-GAN [44]	22.95	0.889	22.55	0.746	2.876	0.570	3.029	0.556	3.029	0.606
HCLR-Net [20]	24.99	0.925	26.04	0.915	2.695	0.586	3.045	0.579	3.103	0.610
Ours	25.96	0.946	27.92	0.927	2.635	0.586	3.107	0.574	3.067	0.611

Table 2. Quantitative comparison of the standard deviations of metrics across different methods and datasets.

Methods	UIEB T-90		EUVP		UIEB C-60		UCCS		U45
Methods	PSNRstd	SSIMstd	PSNRstd	SSIMstd	UIQMstd	UCIQEstd	UIQMstd	UCIQEstd	UIQMstd	UCIQEstd
GDCP [40]	0.21	0.012	0.25	0.011	0.035	0.012	0.045	0.014	0.041	0.014
FUInEGAN [32]	0.28	0.015	0.31	0.013	0.033	0.011	0.040	0.009	0.036	0.010
UWCNN [31]	0.26	0.014	0.29	0.012	0.037	0.010	0.041	0.011	0.044	0.011
WaterNet [41]	0.24	0.013	0.27	0.011	0.034	0.010	0.043	0.012	0.042	0.010
UColor [1]	0.32	0.015	0.30	0.014	0.036	0.010	0.044	0.011	0.046	0.010
MLLE [42]	0.25	0.014	0.22	0.010	0.033	0.010	0.040	0.011	0.034	0.009
PUIE_MP [43]	0.31	0.015	0.033	0.014	0.037	0.010	0.043	0.011	0.047	0.010
HAAM-GAN [44]	0.34	0.016	0.32	0.013	0.042	0.011	0.044	0.012	0.044	0.011
HCLR-Net [20]	0.20	0.011	0.38	0.016	0.032	0.006	0.041	0.011	0.038	0.008
Ours	0.12	0.004	0.15	0.005	0.020	0.005	0.025	0.006	0.022	0.005

Table 3. Quantitative comparison of the efficiency metrics for different methods on various datasets.

Methods	#Params (M)	MACs (G)	GPU Mem (GB)	Inference Time (ms)
GDCP [40]	0.02	0.15	0.8	35.2
FUInEGAN [32]	2.3	5.7	1.5	22.5
UWCNN [31]	0.12	1.8	1.2	18.7
WaterNet [41]	0.34	4.2	1.8	25.4
UColor [1]	3.1	15.3	2.7	32.1
MLLE [42]	0.08	0.9	1.1	15.8
PUIE_MP [43]	4.7	21.6	3.4	41.3
HAAM-GAN [44]	5.2	24.8	3.8	45.6
HCLR-Net [20]	3.8	18.3	3.1	38.2
Ours	3.5	16.7	2.9	29.3

Table 4. Ablation results of different SDFF and FDFF designs.

No.	Method	PSNR↑	SSIM↑
1	Base	23.06	0.891
2	SDFF-W	25.25	0.931
3	SDFF-H	25.15	0.924
4	SDFF- Parallel	25.13	0.918
5	SDFF-W-H	25.35	0.926
6	SDFF-H-W	25.78	0.928
7	FDFF-W+SDFF	25.42	0.925
8	FDFF-H+SDFF	25.53	0.929
9	FDFF- Parallel+SDFF	25.61	0.927
10	FDFF-W-H+SDFF	25.45	0.924
11	FDFF-H-W+SDFF	25.96	0.946

Table 5. Ablation results of different strip size designs.

Method	PSNR↑	SSIM↑
SDFF	25.78	0.928
SDFF+FDFF-K7	25.35	0.927
SDFF+FDFF-K11	25.55	0.928
SDFF+FDFF-K7-K11	25.13	0.923
SDFF+FDFF-K7-K11-Global	25.96	0.946

Table 6. Ablation results of different MHA designs.

Method	PSNR↑	SSIM↑
SDFF+SDFF	25.38	0.866
FDFF+FDFF	25.47	0.926
SDFF+FDFF-Parallel	25.21	0.918
SDFF+FDFF	25.31	0.827
FDFF+SDFF	25.96	0.946

Table 7. Ablation results of different transformer layer configurations.

Method	N = 0	N = 4	N = 5	N = 6	N = 10	N = 12	N = 16
PSNR↑	23.76	25.64	25.78	25.96	25.75	25.14	24.92

Table 8. Ablation study on transformer placement configurations.

Placement Configuration	PSNR↑	SSIM↑
MHA before Encoder (Serial)	25.42	0.913
MHA between Enc–Dec (Serial)	25.96	0.932
MHA after Decoder (Serial)	25.18	0.908
MHA in Parallel	25.67	0.925

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, T.; Zhang, Y.; Hu, J.; Cui, H.; Yu, T. A Multi-Domain Enhanced Network for Underwater Image Enhancement. Information 2025, 16, 627. https://doi.org/10.3390/info16080627

AMA Style

Sun T, Zhang Y, Hu J, Cui H, Yu T. A Multi-Domain Enhanced Network for Underwater Image Enhancement. Information. 2025; 16(8):627. https://doi.org/10.3390/info16080627

Chicago/Turabian Style

Sun, Tianmeng, Yinghao Zhang, Jiamin Hu, Haiyuan Cui, and Teng Yu. 2025. "A Multi-Domain Enhanced Network for Underwater Image Enhancement" Information 16, no. 8: 627. https://doi.org/10.3390/info16080627

APA Style

Sun, T., Zhang, Y., Hu, J., Cui, H., & Yu, T. (2025). A Multi-Domain Enhanced Network for Underwater Image Enhancement. Information, 16(8), 627. https://doi.org/10.3390/info16080627

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Domain Enhanced Network for Underwater Image Enhancement

Abstract

1. Introduction

2. Related Works

2.1. Traditional Methods

2.2. Deep Learning-Based Approaches

3. Proposed Method

3.1. Method Overview

3.2. Multi-Scale Hybrid Attention Mechanism

3.2.1. Spatial Domain Feature Fusion Module

3.2.2. Frequency Domain Feature Fusion Module

3.3. Transformer Sub-Network

3.4. Loss Function

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Quantitative Evaluation

4.5. Quantitative Evaluation

4.6. Ablation Studies

4.6.1. The Design Selection of FDFF and SDFF

4.6.2. Selection of Different Strip Sizes

4.6.3. Design Choice of MHA

4.6.4. Design–Selection of Transform

5. Conclusions

6. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI