Enhancing Underwater Images with LITM: A Dual-Domain Lightweight Transformer Framework

Hu, Wang; Rong, Zhuojing; Zhang, Lijun; Liu, Zhixiang; Chu, Zhenhua; Zhang, Lu; Zhou, Liping; Xu, Jingxiang

doi:10.3390/jmse13081403

Open AccessArticle

Enhancing Underwater Images with LITM: A Dual-Domain Lightweight Transformer Framework

by

Wang Hu

¹

,

Zhuojing Rong

¹,

Lijun Zhang

^1,*

,

Zhixiang Liu

²

,

Zhenhua Chu

¹

,

Lu Zhang

³,

Liping Zhou

⁴ and

Jingxiang Xu

^1,5,*

¹

College of Engineering Science and Technology, Shanghai Ocean University, Shanghai 201306, China

²

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China

³

Qingdao Conson Oceantec Valley Development Co., Ltd., Qingdao 266237, China

⁴

School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China

⁵

Shanghai Engineering Research Center of Marine Renewable Energy, Shanghai 201306, China

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(8), 1403; https://doi.org/10.3390/jmse13081403

Submission received: 16 June 2025 / Revised: 20 July 2025 / Accepted: 21 July 2025 / Published: 23 July 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Underwater image enhancement (UIE) technology plays a vital role in marine resource exploration, environmental monitoring, and underwater archaeology. However, due to the absorption and scattering of light in underwater environments, images often suffer from blurred details, color distortion, and low contrast, which seriously affect the usability of underwater images. To address the above limitations, a lightweight transformer-based model (LITM) is proposed for improving underwater degraded images. Firstly, our proposed method utilizes a lightweight RGB transformer enhancer (LRTE) that uses efficient channel attention blocks to capture local detail features in the RGB domain. Subsequently, a lightweight HSV transformer encoder (LHTE) is utilized to extract global brightness, color, and saturation from the hue–saturation–value (HSV) domain. Finally, we propose a multi-modal integration block (MMIB) to effectively fuse enhanced information from the RGB and HSV pathways, as well as the input image. Our proposed LITM method significantly outperforms state-of-the-art methods, achieving a peak signal-to-noise ratio (PSNR) of 26.70 and a structural similarity index (SSIM) of 0.9405 on the LSUI dataset. Furthermore, the designed method also exhibits good generality and adaptability on unpaired datasets.

Keywords:

underwater image enhancement; lightweight transformer; multi-modal integration

1. Introduction

As vital carriers of marine information, underwater images are widely used in marine biology detection, underwater exploration, archaeology, and the navigation of remotely operated vehicles [1,2,3]. Additionally, underwater imagery can also benefit aquaculture factory environments. However, light absorption and scattering, low brightness in underwater environments, and noise interference can cause underwater image quality degradation [4]. Therefore, efficient UIE techniques are urgently needed to meet the image quality requirements of marine remote sensing missions. Although recent approaches such as diffusion models [5,6] and advanced generative adversarial neural networks (GAN) [7] have significantly improved image quality, these methods still suffer from high computational complexity, slow inference speed, and poor robustness in complex and noisy underwater environments.

Many UIE methods have been continuously proposed to improve degraded underwater images. These methods can be mainly categorized into physics-based methods [8,9,10,11], non-physical methods [12,13,14,15], and deep learning-based methods [16,17,18,19,20,21,22,23]. Physics-based methods usually establish a mathematical model of underwater imaging to invert key parameters in the image formation process (such as scene radiance, background light, and transmittance). These methods assume that the water body is a homogeneous medium and rely on accurate parameter estimation. Although physics-based methods perform well in specific controlled scenarios, they are highly dependent on assumptions about water homogeneity and accurate parameter estimation [24]. In practical applications, these assumptions are often difficult to meet in complex environments, such as variations in water turbidity, illumination dynamics, and depth differences, thus limiting the effectiveness and robustness of such methods [25]. In addition, methods based on non-physical models utilize image processing techniques to directly enhance images (such as histogram equalization, Retinex theory, wavelet transform, and multi-scale fusion). Non-physical methods have high computational efficiency and good adaptability. However, they often suffer from problems such as over-enhancement, loss of key details, and sensitivity to noise interference, which seriously limit their application in complex and noisy underwater environments [26].

Recently, deep learning-based methods have been widely used in the UIE field due to their powerful feature extraction capabilities and end-to-end learning mechanisms [27]. Compared with traditional physical model and non-physical model methods, deep learning-based methods can automatically learn complex feature expressions from a large amount of data, reducing the reliance on accurate physical parameter estimation and manual parameter adjustment. Current deep learning-based methods primarily include convolutional neural networks (CNN), generative adversarial networks (GAN), and diffusion models. Earlier, CNN-based methods had the advantages of a simple structure, stable training, and efficient implementation. However, these methods performed poorly in enhancing image details and generating realistic images. GAN-based methods can generate more realistic and detailed images. However, the model training process of GAN-based methods is complex, and mode collapse or training instability often occurs. Diffusion model-based methods have demonstrated strong image generation capabilities in recent years, effectively capturing the global structure and fine details of images. As a generative model, the diffusion model has a high computational cost and a relatively slow reasoning process. Despite recent advancements in UIE, several critical challenges remain unresolved [28]. These challenges primarily involve the ability to suppress noise while preserving fine image details, to correct severe color distortions without compromising the natural appearance of the scene, and to efficiently fuse global contextual cues with local features to achieve real-time semantic consistency and visual fidelity [29]. Therefore, developing more stable and efficient UIE methods is crucial to meet the needs of complex underwater environments.

To address these problems, we propose a novel UIE method named LITM, which outperforms other methods in various performance indicators. Inspired by the success of UIEC²-Net [23] in underwater UIE tasks, we introduce a lightweight transformer block to address the noise interference and color distortion problems in underwater images, thereby extracting global feature information in the HSV domain. Unlike the classical multi-head self-attention transformer [30], our design adopts a lightweight structure with channel-wise attention, which significantly reduces computational cost while preserving the ability to model long-range dependencies. First, we design an LRTE to process the input image, which can effectively address the degradation problem of the initial image. The LRTE can mitigate noise interference and color distortion issues in underwater images while also enhancing local details in the RGB domain. Second, we propose an LHTE to extract global feature information in the HSV domain. This module can effectively capture global semantic features such as brightness and color distribution and enhance the model’s ability to perceive the overall image. In our design, we explicitly leverage the complementary nature of RGB and HSV spaces by extracting local details from RGB and global semantic cues from HSV, thereby enabling more effective and perceptually consistent enhancement. Ultimately, we utilize MMIB to effectively integrate the enhancement information from the RGB and HSV domains with the input image. The MMIB fuses multiple information to achieve more natural image enhancement effects. We present a comprehensive comparison of LITM with other UIE methods on different metrics. Our proposed LITM effectively improves the degraded underwater images and achieves excellent results in all metrics.

In summary, our main contributions are as follows:

We propose a novel UIE method, LITM, which can effectively improve underwater images. The LITM utilizes a lightweight transformer to address noise interference and color distortion issues in underwater images, as well as to extract global feature information in the HSV domain.
We design an LRTE to process the input image, effectively addressing initial degradation issues. In parallel, we propose an LHTE to extract global feature information from the HSV domain.
We introduce an MMIB to effectively fuse enhancement information from the RGB and HSV domains along with the input image, enabling more natural and visually consistent image enhancement.
We compare the LITM method with other UIE methods on paired and unpaired UIE datasets. The final experiments demonstrate that our proposed LITM method significantly outperforms the state-of-the-art methods.

The remaining work of this paper is arranged as follows: In Section 2, we discuss the related work of UIE. Then, we give the proposed methodology in Section 3. Section 4 presents the experiments. Finally, we conclude our work in Section 5.

2. Related Work

In this section, we review existing underwater image enhancement methods: physics-based methods, non-physical methods, and deep learning-based methods. The physics-based methods construct an underwater imaging propagation model to restore the actual color and structure of underwater images. The non-physical methods enhance the image by adjusting contrast, correcting color, and other means. Deep learning-based methods automatically learn the mapping relationship between image degradation and restoration, enabling end-to-end enhancement effects.

2.1. Physics-Based and Non-Physical Methods

Physics-based UIE methods establish optical transmission models to represent the physical degradation mechanisms involved in underwater imaging. These degradation mechanisms primarily include light absorption and scattering effects. Physics-based UIE methods rely on prior knowledge from modeling. These methods invert the key parameters of scene radiance, background light, and transmittance from information such as image brightness, color, and scene depth. Finally, the inverse process of image degradation is achieved, and a clear image is reconstructed. Drews et al. [8] proposed the UDCP method, which estimates the transmittance of underwater images by applying the dark channel prior (DCP) only in the green and blue channels. This method effectively improves the applicability and performance of the standard DCP in underwater image enhancement. Li et al. [9] proposed a single underwater image restoration method based on blue-green channel dehazing and red channel correction, which effectively improved the image clarity and contrast. Akkaynak et al. [10] proposed the Sea-Thru method, which achieves high-fidelity underwater image color restoration based on physical models and depth maps. Zhang et al. [11] proposed a variational scene restoration model (VERI) for simultaneous visual enhancement and resolution improvement, which effectively removes scattered light interference and compression artifacts in images. This method achieves high-quality image restoration under various complex imaging conditions.

Non-physical model methods are typically based on the statistical characteristics of natural images, enhancing the visual quality of images by mathematically modeling and optimizing the color, contrast, and other image features. Such methods do not rely on physical modeling in the imaging process but improve the perception of degraded images through image color correction, contrast enhancement, and other means. Wang et al. [12] proposed an underwater image enhancement method based on the fusion of image enhancement and color correction, which achieved image restoration effects with natural colors and precise details. Li et al. [13] proposed an underwater image enhancement network, Water-Net, based on multi-scale feature fusion, which achieved effective color correction and detail enhancement of underwater images under different water quality conditions. Kang et al. [14] proposed a perception-driven structured patch decomposition and fusion framework (SPDF) by fusing contrast-enhanced and detail-enhanced images in a perceptually consistent and conceptually independent image space. Zhang et al. [15] proposed a weighted wavelet perceptual fusion method (WWPF), which effectively improved the color realism and detail clarity of underwater images through color correction, global and local contrast enhancement, and multi-scale wavelet fusion.

Physics-based methods can invert the degradation process of underwater images through imaging models. These methods have strong physical interpretations and perform well in specific scenarios. Non-physical methods rely on image statistical characteristics or visual perception mechanisms and exhibit high processing efficiency, as well as sound subjective enhancement effects. However, physics-based methods are limited by the accuracy of parameter estimation and the capabilities of environmental modeling. Non-physical methods are challenging to apply in the face of complex and changeable degradation factors. Therefore, there is an urgent need for an underwater image enhancement method that combines perceptual consistency, robustness, and environmental adaptability to enhance the quality and stability of real-world applications.

2.2. Deep Learning-Based Methods

Deep learning-based methods utilize a large amount of data to automatically learn the mapping relationship between degraded images and their corresponding clear images, thereby significantly improving the performance and adaptability of image enhancement. Deep learning methods can achieve sound enhancement effects without complex physical modeling. Current methods mainly include CNN, GAN, and diffusion model architectures. Li et al. [17] proposed a convolutional neural network model based on underwater scene priors to achieve efficient enhancement and accurate color reconstruction of underwater images and videos. This model demonstrated excellent enhancement effects and good generalization capabilities in both synthetic and real-world scenes. Peng et al. [18] proposed a U-shape transformer method for underwater image enhancement, which combines channel and spatial attention mechanisms (CMSFFT and SGFMT) and multi-color space loss functions. They achieved an enhancement effect that was significantly better than existing methods on the constructed large-scale real-world dataset LSUI. Jiang et al. [16] proposed an ultra-lightweight real-time underwater image enhancement network called FA+Net, which efficiently handles color deviation and detail degradation through strong prior decomposition and fine-grained enhancement modules. Huang et al. [19] proposed an underwater image enhancement network based on depth perception. This network significantly improved the visual quality and task adaptability across multiple datasets by introducing depth estimation, region-aware fusion, and dual-branch feature learning. Tang et al. [20] proposed a transformer-based conditional diffusion model for underwater image enhancement. This method combines a lightweight denoising network with a non-uniform skip sampling strategy, ensuring inference efficiency while achieving enhanced performance on datasets such as UIEB and LSUI. Zhao et al. [22] proposed an underwater image enhancement framework named PA-Diff that integrates physical priors and diffusion models. The PA-Diff introduces physical prior generation, implicit neural reconstruction, and physical perception diffusion transformer, which effectively improves the model’s modeling ability and image restoration effect for complex underwater degradation distribution.

Although deep learning-based methods have powerful feature expression capabilities and end-to-end learning advantages, they can achieve significant enhancement effects in complex scenes. However, these methods are highly data-dependent, have slow inference speed, and have limited performance when facing complex degradation types and multi-scale detail modeling. Therefore, we need a more stable and efficient UIE method with consistent perception and visual elements.

3. Methodology

In this section, we present a detailed overview of the proposed LITM method. We first describe the overall framework of LITM. Then, we comprehensively introduce lightweight transformer-based networks. Finally, we describe the core components of the LITM architecture.

3.1. Overview of the LITM

As shown in Figure 1, the proposed LITM framework consists of three main modules: (a) the lightweight RGB transformer enhancer, (b) the lightweight HSV transformer encoder, and (c) the multi-modal integration block. Specifically, the raw image is first processed by LRTE, which includes Convolution InstanceNorm LeakyReLU (CIL), lightweight transformer block (LTB), and convolution sigmoid (CS). The LTB extracts hierarchical detail features and then generates out1 through CS. Simultaneously, the image is transformed into the HSV domain and processed by the LHTE module. The LHTE begins with a combined CIL and MaxPooling (MP), followed by three LTB+MP, and two additional LTB. The global semantic features of this branch are processed through global average pooling (GAP), and then

o u t 2

is generated through RGB. MMIB consists of six stacked CIL modules and a final CS, which effectively integrates complementary RGB and HSV domain features to generate the final enhanced underwater image.

During the training phase, we optimize our model using the following loss function. In order to maintain the pixel-level similarity between the enhanced image and the original reference image, we adopt the

L_{1}

loss as the reconstruction constraint, which is widely used in image enhancement and restoration tasks due to its robustness to outliers and ability to preserve detail [31,32]. The loss is defined as follows:

L_{1} = \frac{1}{N} \sum_{i = 1}^{N} |{\hat{y}}_{i} - y_{i}|

(1)

where

{\hat{y}}_{i}

and

y_{i}

denote the i-th pixel of the predicted and ground truth images, respectively. N is the total number of pixels.

To improve the perceptual quality of enhanced images in high-level semantic space, we introduce perceptual loss as an auxiliary supervision term. Specifically, perceptual loss extracts intermediate features through a pre-trained network and compares the output image with the reference image in the feature space. It is defined as follows:

L_{p} = \sum_{l \in L} \frac{1}{C_{l} H_{l} W_{l}} {∥ϕ_{l} (\hat{y}) - ϕ_{l} (y)∥}_{2}^{2}

(2)

where

ϕ_{l} (\cdot)

represents the feature extraction function of the lth layer of the network.

C_{l}

,

H_{l}

,

W_{l}

are the number of channels, height, and width of the feature map of this layer, respectively, and

L

is the selected feature layer set.

The final training objective combines the

L_{1}

and

L_{p}

to balance low-level pixel fidelity and high-level perceptual quality. The total loss is defined as

L_{total} = L_{1} + λ_{p} \cdot L_{p}

(3)

where

λ_{p}

is a weighting factor that controls the contribution of the perceptual loss. In our experiments, we set

λ_{p} = 0.2

.

The pseudocode of the training process of LITM is described in detail in Algorithm 1. The input underwater image x is initially processed through a convolutional layer (Conv_3×3), followed by instance normalization (IN) and leakyrelu (LReLU) activation, creating an initial representation h. This representation undergoes six successive lightweight transformer blocks (TransformBlock), progressively extracting hierarchical, detailed features. The output is then passed through a

1 \times 1

convolutional layer with a sigmoid activation, producing an enhanced RGB representation

o u t 1

. This result is subsequently transformed into the HSV domain as

h s v_i n p u t

for further processing.

Algorithm 1 The training process of LITM.

Require:: Input image $x \in R^{B \times 3 \times H \times W}$
Ensure:: Enhanced image $\hat{y}$
1:: Step 1:
2:: $h \leftarrow LReLU (IN ({Conv}_{3 \times 3} (x)))$
3:: for $i = 1$ to 6 do
4:: $h \leftarrow TransformBlock (h)$
5:: end for
6:: $o u t 1 \leftarrow σ ({Conv}_{1 \times 1} (h))$
7:: $h s v_i n p u t \leftarrow HSV (o u t 1)$
8:: Step 2:
9:: $x_{1} \leftarrow ReLU ({Conv}_{3 \times 3} (h s v_i n p u t))$
10:: $x_{1} \leftarrow MaxPool (x_{1})$
11:: for $i = 2$ to 4 do
12:: $x_{i} \leftarrow TransformBlock (x_{i - 1})$
13:: $x_{i} \leftarrow MaxPool (x_{i})$
14:: end for
15:: $x_{r} \leftarrow TransformBlock (x_{4})$
16:: $x_{r} \leftarrow AdaptiveAvgPool (x_{r})$
17:: $o u t 2 \leftarrow RGB (x_{r})$
18:: Step 3:
19:: $z \leftarrow Concat (x, o u t 1, o u t 2)$
20:: for $i = 1$ to 6 do
21:: $z \leftarrow LReLU (IN ({Conv}_{3 \times 3} (z)))$
22:: end for
23:: $C \leftarrow σ ({Conv}_{1 \times 1} (z))$
24:: $C_{1}, C_{2} \leftarrow Split (C)$
25:: $\hat{y} \leftarrow λ_{1} \cdot C_{1} \cdot o u t 1 + λ_{2} \cdot C_{2} \cdot o u t 2$
26:: return $\hat{y}$

In the HSV domain,

h s v_i n p u t

is first processed through a

3 \times 3

convolution, followed by ReLU activation and MaxPooling, yielding the initial feature

x_{1}

. The features are then processed through three successive lightweight transformer blocks, interleaved with MaxPooling operations, to extract global semantic information. Afterward, two additional lightweight transformer blocks further refine this global semantic representation

x_{r}

, which is then aggregated via adaptive average pooling. Since the output of AdaptiveAvgPool is set to (1, 1) in our implementation, it is reasonable to refer to it as global average pooling (GAP) throughout this paper. The resultant feature is transformed back into the RGB domain, generating

o u t 2

.

Finally, the original input image x, along with the enhanced outputs

o u t 1

and

o u t 2

, are concatenated into a unified feature z. This unified feature undergoes further refinement through six convolutional layers, incorporating instance normalization and LReLU activation. The refined features pass through a final

1 \times 1

convolutional layer with sigmoid activation, producing two confidence maps,

C_{1}

and

C_{2}

. These confidence maps weight and fuse

o u t 1

and

o u t 2

, resulting in the final enhanced underwater image

\hat{y}

.

Unlike previous HSV+RGB fusion methods such as UIEC²-Net, which apply image-level pixel-wise fusion using simple attention maps, our proposed LITM performs feature-level multi-modal integration. In particular, we design distinct RGB and HSV pathways to extract local and global features separately, and leverage a lightweight transformer block in the HSV branch to model long-range semantic dependencies. Furthermore, our MMIB integrates multi-scale features more adaptively than the heuristic curve fusion used in UIEC²-Net. These improvements allow LITM to achieve better generalization and perceptual consistency in challenging underwater scenes.

3.2. Lightweight Transformer-Based Network

Unlike the structure based on the visual transformer, the lightweight transformer-based network does not require dividing the image into patches to calculate the spatial attention weights. We use lightweight, efficient channel attention (ECA) to model channel dependency. This structure can reduce computational complexity and facilitate the model’s ability to learn relevant color information, thereby more effectively correcting color distortion in low-quality images.

To obtain a robust feature representation, we introduce a lightweight transformer block (LTB) for extracting feature information. The advantage is to reduce the model’s size and the number of parameters. In Figure 2, we present the structure of LTB, which includes layer normalization, efficient channel attention, and feedforward. The input feature I first passes through LayerNorm and ECA. The output of ECA is added to the original input to obtain the intermediate feature

I_{1}

. Subsequently,

I_{1}

is normalized and fed into the FeedForward network to enhance its nonlinear expression ability and local features. The forward output result is fused residually with

I_{1}

to produce the enhanced feature representation. The LTB formula is as follows:

\begin{matrix} M_{n} = L n (I) \\ I_{1} = E C A (M_{n}) + I \\ \tilde{O} = F F w (L n (I_{1})) + I_{1} \end{matrix}

(4)

where

M_{n}

,

I_{1}

, and

\tilde{O}

denote the input feature I after layer norm, the output of ECA is added to the original feature input, and the final feature map, respectively.

L n (\cdot)

represents the LayerNorm, while

E C A (\cdot)

denotes the efficient channel attention, whose specific structure is illustrated in Figure 2b.

Given an input feature map

X \in R^{B \times C \times H \times W}

, we divide it equally along the channel dimension into two heads:

X = [X^{(1)}, X^{(2)}], X^{(i)} \in R^{B \times \frac{C}{2} \times H \times W}, i = 1, 2

(5)

Each sub-head

X^{(i)}

is then processed through the following steps. First, we apply global average pooling to extract channel descriptors:

z^{(i)} = AvgPool (X^{(i)}) \in R^{B \times \frac{C}{2} \times 1 \times 1}

(6)

Then, to enable 1D convolution along the channel dimension, the descriptor

z^{(i)}

is reshaped into a 1D sequence:

{\tilde{z}}^{(i)} = Reshape (z^{(i)}) \in R^{B \times 1 \times \frac{C}{2}}

(7)

Next, a 1D convolution is applied to model local inter-channel dependencies, followed by a sigmoid activation:

a^{(i)} = σ (Conv 1 D ({\tilde{z}}^{(i)})) \in R^{B \times 1 \times \frac{C}{2}}

(8)

where

σ (\cdot)

denotes the sigmoid activation function.

The attention weights are used to reweight the original features:

{\hat{X}}^{(i)} = X^{(i)} \cdot a^{(i)}

(9)

Finally, the outputs of the two sub-heads are concatenated to obtain the attention-modulated output:

\hat{X} = [{\hat{X}}^{(1)}, {\hat{X}}^{(2)}] \in R^{B \times C \times H \times W}

(10)

3.3. Core Components of LITM Architecture

In this paper, we use LRTE and LHTE to process the initially degraded underwater images. Finally, we propose MMIB to effectively fuse the information of each module.

Lightweight RGB Transformer Enhancer (LRTE).

The LRTE is shown in Figure 1a. The original image is input into a feature extraction module (CIL), which consists of a 3 × 3 convolution, instance normalization, and a LeakyReLU activation function, to extract low-level structural information. Subsequently, the feature map passes through six lightweight transformer blocks (LTB) in sequence. Finally, the feature passes through a 1 × 1 convolution layer and is normalized by the sigmoid activation function to output the initial enhanced result

o u t 1

. The entire process can be formulated as follows:

\begin{matrix} h_{0} & = ϕ (IN ({Conv}_{3 \times 3} (x))) \\ h_{i} & = LTB (h_{i - 1}), i = 1, 2, \dots, 6 \\ \hat{I} & = σ ({Conv}_{1 \times 1} (h_{6})) \end{matrix}

(11)

where

ϕ (\cdot)

denotes the LeakyReLU function,

IN (\cdot)

represents instance normalization, and

σ (\cdot)

is the sigmoid activation function. The stack of LTBs allows the network to effectively enhance the structure and texture details within the RGB domain.

Lightweight HSV Transformer Encoder (LHTE).

We present the structure of LHTE in Figure 1b, which includes CIL, LTB, MP, and the forward propagation process. We use HSV to transform

o u t 1

to HSV space. The feature information is further refined through convolutional layers and transformer layers with maximum pooling operations and multiple LTBs, and feature aggregation is completed using global average pooling (GAP). Finally, the global feature vector

f_{g}

is decoded into HSV enhancement parameters, which are then transformed into the enhanced RGB image

o u t 2 = RGB (f_{g})

. The LHTE process is formulated as follows:

\begin{matrix} x_{1} & = ϕ (IN ({Conv}_{3 \times 3} (x_{0}))), x_{1} \leftarrow MP (x_{1}) \\ x_{i} & = LTB (x_{i - 1}), x_{i} \leftarrow MP (x_{i}), i = 2, 3, 4 \\ x_{5} & = LTB (LTB (x_{4})) \\ f_{g} & = GAP (x_{5}) \\ o u t 2 & = RGB (f_{g}) \end{matrix}

(12)

where

ϕ (\cdot)

denotes the LeakyReLU activation function,

IN (\cdot)

represents instance normalization, and

MP (\cdot)

denotes a

2 \times 2

MaxPooling operation for downsampling.

LTB (\cdot)

refers to the lightweight transformer block used for hierarchical feature extraction.

GAP (\cdot)

stands for global average pooling, which aggregates spatial information to form the global feature vector

f_{g}

. This hierarchical encoding process effectively captures multi-scale semantic information from the HSV domain.

Extracting global semantic features from the HSV domain offers several advantages over the RGB domain. In HSV space, hue, saturation, and brightness are explicitly separated, allowing the model to capture the global distributions of luminance and color more easily. This separation facilitates robust modeling of illumination and color distortion, which are common in underwater scenes. In contrast, RGB channels are entangled and more sensitive to local changes, making them more suitable for capturing local details rather than global patterns. Thus, the combination of RGB-local and HSV-global features leads to a complementary enhancement strategy.

Multi-Modal Integration Block (MMIB).

To fuse the complementary information of the two modalities, we designed a multi-modal fusion module (Figure 1c). This module takes

o u t 1

,

o u t 2

, and the raw image as inputs and generates an enhanced image through channel-by-channel enhancement operations. This design fully combines the advantages of the RGB branch in preserving structural details and the HSV branch in color restoration, effectively improving the quality and visual consistency of the enhanced image. This process can be denoted as follows:

\begin{matrix} z_{0} & = ϕ (IN ({Conv}_{3 \times 3} ([x, o u t 1, o u t 2]))) \\ z_{i} & = ϕ (IN ({Conv}_{3 \times 3} (z_{i - 1}))), i = 1, 2, \dots, 5 \\ C & = σ ({Conv}_{1 \times 1} (z_{5})) \\ \hat{y} & = λ_{1} C_{1} {o u t}_{1} + λ_{2} C_{2} {o u t}_{2} \end{matrix}

(13)

where

ϕ (\cdot)

denotes the LeakyReLU activation function,

IN (\cdot)

represents instance normalization, and

σ (\cdot)

is the sigmoid function. The input of this module is the concatenation of the original image x, the RGB enhanced result

o u t_{1}

, and the HSV enhanced result

o u t_{2}

. After a series of modules, we perform weighted fusion to obtain the final enhanced image. Here,

λ_{1}

and

λ_{2}

are balancing factors empirically set to 0.5.

4. Experiments

4.1. Datasets and Evaluation Metrics

To comprehensively evaluate and compare with other UIE methods, several well-known datasets are utilized. The Underwater Image Enhancement Benchmark (UIEB) and Large-Scale Underwater Image (LSUI) datasets are widely used paired datasets for training and testing our model. The UIEB and LSUI datasets have paired degraded underwater images and ground truth maps, which can be well used for supervised training and quantitative evaluation of the model. Additionally, we utilize the unpaired datasets U45 and C60 to verify the model’s generality and adaptability. These underwater image datasets comprise a diverse range of real underwater scenes and are well-suited for evaluation and comparison with other methods.

UIEB [33]. The UIEB dataset consists of manually collected underwater images and the corresponding GT images selected by manual voting. The UIEB dataset includes 890 paired underwater images. We randomly divide the dataset into 800 and 90 paired images (U_90) for training and validation, respectively.

LSUI [18]. The LSUI dataset currently contains 4279 degraded images and their corresponding reference images. This dataset comprises a substantial collection of underwater images, featuring diverse subjects such as deep-sea creatures, deep-sea rocks, and cave formations. We randomly select 3800 and 479 paired underwater images (L_479) for training and validation, respectively.

C60 [33]. The UIEB dataset contains 60 challenging underwater images (C60) without suitable reference images. In this paper, we use C60 for verification.

U45 [34]. The U45 dataset consists of 45 typical underwater images. This dataset contains common color attenuation, lack of contrast, and fogging phenomena in underwater environments.

Evaluation Metrics. Peak signal-to-noise ratio (PSNR) [35] and structural similarity index (SSIM) [36] are classic indicators for measuring image reconstruction quality. PSNR quantifies the enhancement algorithm’s performance at the pixel level by comparing the pixel intensity differences with those of a reference image. At the same time, SSIM evaluates brightness, contrast, and structural similarity to better reflect the structural-level enhancement. Fréchet Inception Distance (FID) [37] and Learned Perceptual Image Patch Similarity (LPIPS) [38] measure the difference in feature space distribution and perceptual similarity between generated images and real images, respectively. FID is a metric for evaluating the difference between the distribution of generated and real images. LPIPS is a similarity metric for evaluating the perceptual differences of images based on deep learning models, which quantifies the similarity of two images by analyzing the differences in their perceptual features. In the unpaired C60 and U45 datasets, we use two pairs of no-reference metrics for evaluation: Underwater Color Image Quality Evaluation (UCIQE) [39] and Underwater Image Quality Measurement (UIQM) [40]. UCIQE and UIQM conduct a comprehensive assessment of underwater image quality based on image characteristics, including color distribution, contrast, and clarity.

4.2. Implementation Details

This study implemented the LITM model using PyTorch 2.0.1, and all experiments were conducted on a personal server equipped with an Intel(R) Core™ i9-13900K CPU, 64 GB of memory, and an NVIDIA GeForce RTX 4090 GPU. The size of both training and validation images is 256 × 256, the batch size is set to 4, and the num_workers parameter is 4. The initial learning rate is set to 0.001, and the number of epochs is set to 200.

4.3. Comparison with Other UIE Methods

We compare the performance of various UIE methods on the U_90 and L_479 datasets. The compared methods include physics-based methods, non-physical methods, and deep learning-based methods. Among them, UDCP [8] is a physics-based method that mainly relies on the physical laws of imaging to enhance the image. WWPE [15] is a non-physical method, which is usually processed by designing prior rules or image statistical characteristics. The remaining methods are all deep learning-based methods, including UWCNN [17], UIEC²-Net [23], U-Shape [18], FA⁺Net [16], UVZ [19], DM [20], and DATDM [21], which use deep neural networks to learn mapping relationships from large-scale data to improve image quality. To ensure reproducibility and fairness in comparison, we used the official MATLAB (The implemented version is MATLAB R2023a.) implementations for UDCP and WWPE, applying them directly to the input images without any further modification. For the deep learning-based methods, including UWCNN, UIEC²-Net, U-Shape, FA⁺Net, UVZ, DM, and DATDM, we adopted the authors’ publicly available PyTorch-based implementations. All methods were evaluated using their default settings, and we selected the best-performing pre-trained weights for validation, as provided by the original authors or trained according to their recommended protocols. We use 256 × 256-sized images for training and validation for all methods.

Table 1 presents a quantitative comparison of various UIE methods on the paired datasets U_90 and L_479. Our proposed method, LITM, achieves state-of-the-art performance across all metrics on both datasets. On the U_90 dataset, LITM achieves the lowest FID (28.68) and LPIPS (0.0989) while also obtaining the highest SSIM (0.9044) and the top PSNR score (23.73). Similarly, on the L_479 dataset, LITM again achieves the best results in all four metrics: FID (29.02), LPIPS (0.1092), PSNR (26.71), and SSIM (0.9410). These results demonstrate the superior perceptual and structural enhancement capabilities of LITM. While the diffusion-based DATDM method achieves competitive performance, LITM consistently outperforms it on all metrics. Traditional physics-based methods (e.g., UDCP) and non-physical methods (e.g., WWPE) suffer from significant performance degradation due to their limited generalization capabilities in diverse underwater scenes. Among the deep learning-based methods, FA⁺Net and UVZ achieved relatively good performance on structural metrics (PSNR and SSIM), but their perceptual metrics (e.g., FID and LPIPS) performed poorly. UIEC²-Net also demonstrates strong overall performance, achieving the second-best SSIM and PSNR on both datasets, as well as competitive FID and LPIPS scores. UWCNN falls short in all metrics and performs poorly in terms of perceptual and structural quality. U-Shape achieves competitive PSNR on the L_479 dataset, but its FID and LPIPS scores are relatively poor, indicating poor perceptual restoration. DM performs worse than other deep learning methods on most metrics.

Additionally, we compared the visual effects of different UIE methods on the paired datasets U_90 and L_479. As shown in Figure 3, the images generated by physics-based and non-physical methods (such as UDCP and WWPE) generally exhibit severe color distortion. The enhancement effect of UWCNN is generally average, and the U-shape is slightly better than that of UWCNN. FA+Net and UVZ have positive effects in some images, but they struggle to adapt to all underwater scenes, and their generalization ability is limited. The images generated by the DM method are generally blue and have an unbalanced color palette. Although the enhancement effect of DATDM is ideal, it is lower than our proposed LITM method in all performance indicators. We also use red boxes to highlight key areas and zoom in on them for enhanced visibility.

To further validate the statistical significance and robustness of the performance improvements, we calculate 95% confidence intervals (CI) for PSNR and SSIM scores on the U_90 and L_479 datasets, as shown in Table 2. The results demonstrate that our proposed LITM achieves the best mean performance with consistently tight confidence bounds across both datasets. For instance, on the L_479 dataset, our method achieves a PSNR CI of [26.30, 27.11] and an SSIM CI of [0.9356, 0.9454], which not only exceed those of all baseline methods but also show lower variance. These intervals indicate that LITM produces more stable and reliable enhancement outcomes, further supporting the statistical significance of the reported improvements.

To verify the generality and adaptability of LITM, we conducted further validation experiments on the unpaired datasets C60 and U45. Table 3 shows a quantitative comparison of various UIE methods, which were evaluated using the UCIQE and UIQM metrics. The top three results for each metric are marked in red, blue, and green, respectively. On the C60 dataset, our proposed method achieves the second-highest UCIQE score (0.5914) and the third-highest UIQM score (0.5845), indicating excellent performance in both colorimetry and perceptual quality. While DATDM slightly surpasses our method in UIQM (0.6072), its UCIQE score (0.5700) is inferior. On the U45 dataset, our method performs well in both metrics, achieving the highest UCIQE (0.6148) and the third highest UIQM (0.8060). These results confirm the robustness and excellent perceptual quality of our method under various underwater conditions.

Although WWPE achieves promising results on UCIQE (0.5948 and 0.5999) and UIQM (0.9597 and 1.2948), the abnormally high UIQM value suggests potential over-enhancement or color distortion. This reflects the limitation of UIQM, which may favor high saturation and contrast, even when the perceptual quality is degraded. In Figure 4, the images generated by WWPE tend to generate over-saturated or unnatural tones. Other deep learning-based methods, such as UWCNN and U-Shape, performed mediocre in terms of metrics and visual quality. As shown in Figure 4, UWCNN often produces blurry textures and washed-out colors, while U-Shape slightly improves contrast. FA+Net achieved the third-best UCIQE score (0.5774) on the C60 dataset but performed poorly on UIQM. UIEC²-Net also shows competitive performance, achieving the third-highest UCIQE score (0.5898) on the C60 dataset and the third-highest (0.6009) on U45. DATDM performs reasonably well in both metrics and visual appearance, producing relatively natural color restoration. However, our proposed LITM generates more transparent textures and more balanced tones in various scenes, outperforming DATDM in both quantitative scores and visual fidelity. Meanwhile, the performance indicators in paired datasets are superior to those in DATDM. We also use red boxes to highlight key areas and zoom in on them for enhanced visibility.

In summary, comprehensive experiments on both paired (U_90 and L_479) and unpaired (C60 and U45) datasets demonstrate the strong overall performance and robustness of the proposed LITM method. On the paired dataset, LITM achieves the best results on all key metrics, including FID, LPIPS, PSNR, and SSIM, significantly outperforming existing physical-based, non-physical, and deep learning-based UIE methods. On the unpaired dataset, although LITM does not achieve the highest scores on all UCIQE and UIQM metrics, it consistently ranks among the top, demonstrating highly competitive performance and balanced enhancement quality. LITM produces sharper textures, natural colors, and better perceptual consistency in a variety of underwater scenes. Compared to recent diffusion-based methods, such as DM and DATDM, LITM exhibits a stronger generalization ability and superior visual quality. These results confirm the generality and adaptability of LITM for underwater image enhancement tasks.

4.4. Computational Complexity Comparison

To comprehensively evaluate the computational efficiency of the proposed LITM framework, we compare its model complexity and runtime with those of several representative UIE methods in terms of the number of parameters (Params), inference time (Infertime), and floating-point operations (FLOPs), as shown in Table 4.

Our LITM model achieves a good balance between lightweight design and computational performance. Specifically, it contains only 0.42 M parameters, which is significantly smaller than U-Shape (31.62 M), UVZ (5.39 M), DM (10.2 M), and DATDM (18.34 M). Despite being more compact than UIEC²-Net (0.54 M), our method maintains a strong enhancement capability while reducing inference time to 0.0317 s, making it more suitable for real-time or low-latency applications. In terms of FLOPs, LITM reaches 28.04 GFLOPs, which is notably lower than UVZ (124.88 G), DM (62.3 G), and DATDM (66.89 G), and comparable to UIEC²-Net (26.06 G) and U-Shape (26.09 G). Although UWCNN (2.61 G) and FA⁺Net (0.5852 G) exhibit very low computational complexity, their enhancement performance and perceptual quality remain limited. These results demonstrate that LITM not only reduces the computational burden but also offers superior inference efficiency. Overall, compared to both traditional deep models and recent diffusion-based methods, LITM offers an efficient solution with minimal computational cost, making it well-suited for deployment in resource-constrained environments.

4.5. Methodological Comparison and Analysis of Existing UIE Methods

In Table 5, we present a methodological analysis of various representative UIE methods and summarize their main theoretical foundations as well as their advantages and disadvantages. Traditional physics-based methods (e.g., UDCP) offer good interpretability and are easy to implement. However, they often exhibit poor generalization and limited enhancement performance in complex underwater environments. Non-physical methods, such as WWPE, typically enable fast inference and effectively enhance image contrast, but they often suffer from over-enhancement and noticeable color distortion.

Deep learning-based methods, including UWCNN, U-Shape, FA⁺Net, and UVZ, generally offer improved structural reconstruction and computational efficiency. However, they exhibit notable differences in perceptual quality and generalization ability. Specifically, UIEC²-Net demonstrates strong generalization with balanced structural and color preservation, though there remains room for improvement in perceptual realism. Meanwhile, recent diffusion-based approaches (DM and DATDM) possess strong theoretical foundations and promising generative capabilities. However, their computational complexity and inference time are significantly higher, restricting their practical application scenarios.

In comparison, our proposed LITM method leverages a transformer-based architecture combined with RGB-HSV fusion, achieving an optimal balance between perceptual quality, structural fidelity, and computational efficiency. Although LITM has slightly higher computational overhead than simpler CNN-based methods, it still demonstrates excellent overall performance, adaptability, and generalization capabilities under a variety of underwater imaging conditions.This comprehensive methodological analysis further validates the advantages and novelty of our proposed approach.

4.6. Ablation Studies

To demonstrate the effectiveness of our proposed LRTE, LHTE, and MMIB modules, we design ablation experiments to compare LITM with a model containing only LRTE (denoted as Model-A) and a model containing both LRTE and LHTE (denoted as Model-B). As shown in Table 6, Model-B outperforms Model-A in FID, PSNR, SSIM, and UCIQE, indicating that the introduction of the LHTE module enhances structural restoration and color balance. The LPIPS scores of both models are nearly identical (0.1167 vs. 0.1168), suggesting comparable perceptual similarity. However, the lack of effective integration of the global features introduced by LHTE resulted in a slight decrease in the UIQM score of Model-B compared with Model-A. We introduce the MMIB module into the LITM model, which effectively addresses this limitation and achieves significant improvements in all indicators. LITM achieves the best FID (28.68), LPIPS (0.0989), PSNR (23.70), SSIM (0.9048), UCIQE (0.6227), and UIQM (0.8250). These results verify the effectiveness of each module and confirm the advantages of their joint design within the LITM framework.

As shown in Figure 5, we show the original input, model-A (only LRTE), model-B (LRTE and LHTE), and the enhanced results generated by the LITM model, and make visual comparisons with the ground truth (GT). In the first row of results, the enhanced results generated by model-A have relatively weak color restoration and noticeable artifacts in the background. Model-B improves color saturation and detail preservation, but its enhanced images have slight overexposure and imbalance in some areas. In contrast, the enhancement results generated by the proposed LITM are significantly closer to the real images. In the second and third rows, LITM not only restores more transparent object textures but also eliminates the residual color distortion in Model-A and Model-B. These qualitative results further demonstrate the effectiveness of integrating LRTE, LHTE, and MMIB modules in enhancing underwater images.

We conducted ablation experiments on the loss function to demonstrate the contribution of perceptual loss (

L_{p}

) and reconstruction loss (

L_{1}

) individually. As shown in Table 7, using only

L_{1}

or only

L_{p}

results in substantially poorer performance on most metrics (FID, LPIPS, PSNR, and SSIM), compared to the combined loss function

L_{t o t a l}

. Notably, employing either loss alone severely reduces the perceptual quality and structural fidelity of enhanced underwater images, leading to significantly lower PSNR and SSIM values and notably higher FID and LPIPS scores. Although other image quality metrics degrade significantly, the UCIQE and UIQM scores of (

L_{1}

and

L_{p}

) are unexpectedly high. This contradictory result occurs because UCIQE and UIQM overemphasize contrast and color enhancement while ignoring perceptual realism and structural integrity. Therefore, severe distortion and unnatural enhancement not only damage visual realism but also inadvertently improve UCIQE and UIQM scores.

Overall, these results demonstrate that the combined

L_{t o t a l}

achieves significantly superior performance and visually plausible enhancements compared to the individual loss components.

5. Conclusions

In this paper, we propose a lightweight transformer-based underwater image enhancement framework, LITM, to address the challenges of underwater image degradation. The LITM utilizes a lightweight transformer to address noise interference and color distortion issues in underwater images, as well as to extract global feature information in the HSV domain. We introduce an MMIB to effectively fuse enhancement information from the RGB and HSV domains along with the input image, enabling more natural and visually consistent image enhancement. Extensive experiments demonstrate that LITM outperforms state-of-the-art methods on both the UIEB and LSUI datasets. Specifically, LITM achieves the highest PSNR scores of 23.70 and 26.70 and SSIM scores of 0.9048 and 0.9405 on the UIEB and LSUI datasets, respectively. In addition, LITM exhibits excellent perceptual quality, achieving the lowest FID scores of 28.68 and 29.02, as well as LPIPS scores of 0.0989 and 0.1092 on the two datasets. Moreover, LITM shows strong generalization and adaptability on the unpaired datasets.

However, our proposed LITM method has achieved good performance in the UIE task, but there is still room for further improvement. The information fused by the current model comes from the original underwater images. In future work, we plan to introduce a color correction module to preprocess the original images, thereby improving the fusion effect. Additionally, we will explore more efficient transformer structures to further enhance model performance and computational efficiency.

Author Contributions

Conceptualization, W.H.; methodology, W.H. and J.X.; software, W.H.; validation, W.H. and J.X.; formal analysis, Z.L., L.Z. (Liping Zhou), L.Z. (Lu Zhang) and Z.C.; investigation, W.H. and J.X.; resources, L.Z. (Lijun Zhang) and J.X.; writing—original draft preparation, W.H.; writing—review and editing, J.X.; visualization, Z.R.; supervision, J.X.; funding acquisition, J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Key R&D Program of China (No. 2022YFD2401100) and special fund for Promoting High Quality Development of Marine and Fisheries Industry in Fujian Province, China (FJHYF-L-2023-16).

Data Availability Statement

Data will be made available on request.

Acknowledgments

The authors would like to express their gratitude for the support of Fishery Engineering and Equipment Innovation Team of Shanghai High level Local University.

Conflicts of Interest

Author Lu Zhang was employed by the company Qingdao Conson Oceantec Valley Development Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UIE	underwater image enhancement
LITM	lightweight transformer-based model
LRTE	lightweight RGB transformer enhancer
LHTE	lightweight HSV transformer encoder
MMIB	multimodal integration block
ROV	remotely operated vehicle
CNN	convolutional neural networks
HSV	hue–saturation–value
CIL	Convolution InstanceNorm LeakyReLU
CS	convolution sigmoid
MP	MaxPooling
GAP	global average pooling
IN	instance normalization
CI	confidence intervals
GAN	generative adversarial networks
DCP	dark channel prior
VERI	variational scene restoration model
SPDF	structured patch decomposition and fusion framework
WWPF	weighted wavelet perceptual fusion method
LTB	lightweight transformer block
ECA	efficient channel attention
UIEB	Underwater Image Enhancement Benchmark
LSUI	Large-Scale Underwater Image
PSNR	peak signal-to-noise ratio
SSIM	structural similarity index
FID	Fréchet Inception Distance
LPIPS	Learned Perceptual Image Patch Similarity
UCIQE	Underwater Color Image Quality Evaluation
UIQM	Underwater Image Quality Measurement
GT	ground truth

References

Zhang, L.; Fan, J.; Qiu, Y.; Jiang, Z.; Hu, Q.; Xing, B.; Xu, J. Marine zoobenthos recognition algorithm based on improved lightweight YOLOv5. Ecol. Inf. 2024, 80, 102467. [Google Scholar] [CrossRef]
Zhang, L.; Qiu, Y.; Fan, J.; Li, S.; Hu, Q.; Xing, B.; Xu, J. Underwater fish detection and counting using image segmentation. Aquac. Int. 2024, 32, 4799–4817. [Google Scholar] [CrossRef]
Diamanti, E.; Ødegård, Ø. Visual sensing on marine robotics for the 3D documentation of Underwater Cultural Heritage: A review. J. Archaeol. Sci. 2024, 166, 105985. [Google Scholar] [CrossRef]
Jian, M.; Liu, X.; Luo, H.; Lu, X.; Yu, H.; Dong, J. Underwater image processing and analysis: A review. Signal Process. Image Commun. 2021, 91, 116088. [Google Scholar] [CrossRef]
Chen, H.; Xiang, Q.; Hu, J.; Ye, M.; Yu, C.; Cheng, H.; Zhang, L. Comprehensive exploration of diffusion models in image generation: A survey. Artif. Intell. Rev. 2025, 58, 99. [Google Scholar] [CrossRef]
Huang, Y.; Huang, J.; Liu, Y.; Yan, M.; Lv, J.; Liu, J.; Xiong, W.; Zhang, H.; Cao, L.; Chen, S. Diffusion model-based image editing: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4409–4437. [Google Scholar] [CrossRef] [PubMed]
Rastogi, R.; Rawat, V.; Kaushal, S. Advancements in Image Restoration Techniques: A Comprehensive Review and Analysis through GAN. In Generative Artificial Intelligence and Ethics: Standards, Guidelines, and Best Practices; IGI Global: Hershey, PA, USA, 2025; pp. 53–90. [Google Scholar] [CrossRef]
Drews, P.; Nascimento, E.; Moraes, F.; Botelho, S.; Campos, M. Transmission estimation in underwater single images. In Proceedings of the IEEE International Conference on Computer Vision Workshops 2013, Sydney, Australia, 2–3 October 2013; pp. 825–830. [Google Scholar] [CrossRef]
Li, C.; Quo, J.; Pang, Y.; Chen, S.; Wang, J. Single underwater image restoration by blue-green channels dehazing and red channel correction. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 1731–1735. [Google Scholar] [CrossRef]
Akkaynak, D.; Treibitz, T. Sea-thru: A method for removing water from underwater images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 1682–1691. [Google Scholar] [CrossRef]
Zhang, H.; Qi, T.; Zeng, T. Scene recovery: Combining visual enhancement and resolution improvement. Pattern Recognit. 2024, 153, 110529. [Google Scholar] [CrossRef]
Wang, Y.; Ding, X.; Wang, R.; Zhang, J.; Fu, X. Fusion-based underwater image enhancement by wavelet decomposition. In Proceedings of the 2017 IEEE International Conference on Industrial Technology (ICIT), Toronto, ON, Canada, 22–25 March 2017; pp. 1013–1018. [Google Scholar] [CrossRef]
Song, W.; Wang, Y.; Huang, D.; Liotta, A.; Perra, C. Enhancement of underwater images with statistical model of background light and optimization of transmission map. IEEE Trans. Broadcast. 2020, 66, 153–169. [Google Scholar] [CrossRef]
Kang, Y.; Jiang, Q.; Li, C.; Ren, W.; Liu, H.; Wang, P. A perception-aware decomposition and fusion framework for underwater image enhancement. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 988–1002. [Google Scholar] [CrossRef]
Zhang, W.; Zhou, L.; Zhuang, P.; Li, G.; Pan, X.; Zhao, W.; Li, C. Underwater Image Enhancement via Weighted Wavelet Visual Perception Fusion. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 2469–2483. [Google Scholar] [CrossRef]
Jiang, J.; Ye, T.; Chen, S.; Chen, E.; Liu, Y.; Jun, S.; Bai, J.; Chai, W. Five A⁺ Network: You Only Need 9K Parameters for Underwater Image Enhancement. In Proceedings of the 34th British Machine Vision Conference 2023, BMVC, Aberdeen, UK, 20–24 November 2023; Available online: https://papers.bmvc2023.org/0149.pdf (accessed on 20 July 2025).
Li, C.; Anwar, S.; Porikli, F. Underwater scene prior inspired deep underwater image and video enhancement. Pattern Recognit. 2020, 98, 107038. [Google Scholar] [CrossRef]
Peng, L.; Zhu, C.; Bian, L. U-shape transformer for underwater image enhancement. IEEE Trans. Image Process. 2023, 32, 3066–3079. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Wang, X.; Xu, C.; Li, J.; Feng, L. Underwater variable zoom: Depth-guided perception network for underwater image enhancement. Expert Syst. Appl. 2025, 259, 125350. [Google Scholar] [CrossRef]
Tang, Y.; Kawasaki, H.; Iwaguchi, T. Underwater image enhancement by transformer-based diffusion model with non-uniform sampling for skip strategy. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5419–5427. [Google Scholar] [CrossRef]
Hu, W.; Chen, S.; Luo, T.; Zhang, L.; Zhang, H.; Liu, Z.; Zhang, S.; Xu, J. DATDM: Dynamic attention transformer diffusion model for underwater image enhancement. Alex. Eng. J. 2025, 126, 591–604. [Google Scholar] [CrossRef]
Zhao, C.; Dong, C.; Cai, W. Learning a physical-aware diffusion model based on transformer for underwater image enhancement. arXiv 2024, arXiv:2403.01497. [Google Scholar]
Wang, Y.; Guo, J.; Gao, H.; Yue, H. UIEC²-Net: CNN-based underwater image enhancement using two color space. Signal Process. Image Commun. 2021, 96, 116250. [Google Scholar] [CrossRef]
Sulis, M.; Meyerhoff, S.B.; Paniconi, C.; Maxwell, R.M.; Putti, M.; Kollet, S.J. A comparison of two physics-based numerical models for simulating surface water–groundwater interactions. Adv. Water Res. 2010, 33, 456–467. [Google Scholar] [CrossRef]
Matos, T.; Martins, M.; Henriques, R.; Goncalves, L. A review of methods and instruments to monitor turbidity and suspended sediment concentration. J. Water Process Eng. 2024, 64, 105624. [Google Scholar] [CrossRef]
Martins, G.S.; Santos, L.; Dias, J. User-adaptive interaction in social robots: A survey focusing on non-physical interaction. Int. J. Soc. Robot. 2019, 11, 185–205. [Google Scholar] [CrossRef]
Verma, G.; Kumar, M.; Raikwar, S. F2UIE: Feature transfer-based underwater image enhancement using multi-stackcnn. Multimed. Tools Appl. 2024, 83, 50111–50132. [Google Scholar] [CrossRef]
Zhou, J.; Yang, T.; Zhang, W. Underwater vision enhancement technologies: A comprehensive review, challenges, and recent trends. Appl. Intell. 2023, 53, 3594–3621. [Google Scholar] [CrossRef]
Zhang, R.; Liu, G.; Zhang, Q.; Lu, X.; Dian, R.; Yang, Y.; Xu, L. Detail-aware network for infrared image enhancement. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5000314. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; 30, pp. 6000–6010. [Google Scholar]
Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 2016, 3, 47–57. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 4681–4690. [Google Scholar] [CrossRef]
Li, C.; Guo, C.; Ren, W.; Cong, R.; Hou, J.; Kwong, S.; Tao, D. An underwater image enhancement benchmark dataset and beyond. IEEE Trans. Image Process. 2019, 29, 4376–4389. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Li, J.; Wang, W. A fusion adversarial underwater image enhancement network with a public test dataset. arXiv 2019, arXiv:1906.06819. [Google Scholar]
Korhonen, J.; You, J. Peak signal-to-noise ratio revisited: Is simple beautiful? In Proceedings of the 2012 Fourth International Workshop on Quality of Multimedia Experience, Melbourne, Australia, 5–7 July 2012; pp. 37–38. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 24–34. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar] [CrossRef]
Yang, M.; Sowmya, A. An underwater color image quality evaluation metric. IEEE Trans. Image Process. 2015, 24, 6062–6071. [Google Scholar] [CrossRef] [PubMed]
Panetta, K.; Gao, C.; Agaian, S. Human-visual-system-inspired underwater image quality measures. IEEE J. Ocean. Eng. 2015, 41, 541–551. [Google Scholar] [CrossRef]

Figure 1. The overall framework of the proposed LITM. (a) Lightweight RGB transformer enhancer (LRTE). The LRTE contains Convolution InstanceNorm LeakyReLU (CIL), lightweight transformer block (LTB) and convolution sigmoid (CS). (b) Lightweight HSV transformer encoder (LHTE). The LHTE is utilized to extract global brightness, color, and saturation from the HSV domain. The LHTE contains CIL, LTB, MaxPooling (MP), and global average pooling (GAP). The *2 represents two LTB. (c) Multi-modal integration block (MMIB). The MMIB fuses enhanced information from RGB, HSV pathways, and the input image. The symbol ⊙ denotes element-wise multiplication, and ⊕ represents channel-wise concatenation.

Figure 2. The detailed structure of the lightweight transformer-based network. (a) Lightweight transformer block (LTB). (b) Efficient channel attention (ECA). The ECA utilizes global average pooling (GAP), Conv 1d, and Sigmoid to process information from different heads. The symbol ⊙ denotes element-wise multiplication, and ⊕ represents channel-wise concatenation.

Figure 3. Visual comparison of different UIE methods on the paired datasets U_90 and L_479.

Figure 4. Visual comparison of different UIE methods on the unpaired datasets C60 and U45.

Figure 5. Visual comparison of LITM, Model-A, and Model-B.

Table 1. Quantitative comparison of different UIE methods on paired datasets U_90 and L_479. The top two results are indicated by bold and underlined values, respectively.

Methods	U_90				L_479
Methods	FID↓	LPIPS↓	PSNR↑	SSIM↑	FID↓	LPIPS↓	PSNR↑	SSIM↑
UDCP (13’) [8]	46.60	0.2720	13.74	0.6699	73.34	0.3742	12.51	0.6474
WWPE (24’) [15]	55.02	0.2091	18.00	0.7749	55.68	0.2671	17.89	0.7485
UWCNN (20’) [17]	50.57	0.1960	18.21	0.8418	54.90	0.2519	19.18	0.8594
UIEC²-Net (21’) [23]	31.93	0.1093	23.19	0.9003	31.69	0.1154	25.89	0.9336
U-Shape (23’) [18]	54.89	0.2090	19.54	0.7830	39.63	0.1479	24.67	0.8842
FA⁺Net (23’) [16]	31.53	0.1251	21.25	0.8890	48.86	0.1994	22.83	0.9129
UVZ (25’) [19]	39.20	0.1326	21.11	0.8943	50.54	0.1556	22.27	0.9232
DM (23’) [20]	41.63	0.2065	15.94	0.8397	53.44	0.2568	15.17	0.8397
DATDM (25’) [21]	28.95	0.1079	22.21	0.8952	30.58	0.1250	23.54	0.9301
Ours	28.68	0.0989	23.70	0.9048	29.02	0.1092	26.70	0.9405

Table 2. 95% Confidence intervals (CI) of PSNR and SSIM for different methods on U_90 and L_479 datasets.

Method	U_90 (CI)		L_479 (CI)
Method	PSNR	SSIM	PSNR	SSIM
UDCP (13’) [8]	[13.04, 14.44]	[0.6433, 0.6967]	[12.15, 12.86]	[0.6344, 0.6603]
WWPE (24’) [15]	[17.39, 18.62]	[0.7580, 0.7918]	[17.57, 18.21]	[0.7394, 0.7577]
UWCNN (20’) [17]	[17.61, 18.81]	[0.8202, 0.8635]	[18.90, 19.46]	[0.8514, 0.8674]
UIEC²-Net (21’) [23]	[22.32, 24.06]	[0.8879, 0.9128]	[25.46, 26.31]	[0.9283, 0.9389]
U-Shape (23’) [18]	[18.91, 20.18]	[0.7623, 0.8036]	[24.34, 24.99]	[0.8753, 0.8931]
FA⁺Net (23’) [16]	[20.51, 21.99]	[0.8749, 0.9031]	[22.56, 23.09]	[0.9081, 0.9176]
UVZ (25’) [19]	[20.43, 21.80]	[0.8774, 0.9112]	[21.95, 22.58]	[0.9178, 0.9286]
DM (23’) [20]	[15.35, 16.54]	[0.8216, 0.8579]	[14.81, 15.53]	[0.8301, 0.8493]
DATDM (25’) [21]	[21.42, 23.02]	[0.8785, 0.9118]	[23.18, 23.91]	[0.9242, 0.9359]
Ours	[22.72, 24.69]	[0.8910, 0.9187]	[26.30, 27.11]	[0.9356, 0.9454]

Table 3. Quantitative comparison of different UIE methods on unpaired C60 and U45 datasets. The values marked in red, blue, and green represent the three best results, respectively.

Methods	C60		U45
Methods	UCIQE↑	UIQM↑	UCIQE↑	UIQM↑
UDCP (13’) [8]	0.5622	−0.0109	0.5952	0.3429
WWPE (24’) [15]	0.5948	0.9597	0.5999	1.2948
UWCNN (20’) [17]	0.5051	0.1262	0.5256	0.5135
UIEC²-Net (21’) [23]	0.5898	0.5017	0.6009	0.7212
U-Shape (23’) [18]	0.5374	0.3568	0.5542	0.5891
FA⁺Net (23’) [16]	0.5774	0.4411	0.5859	0.6832
UVZ (25’) [19]	0.5657	0.5645	0.5864	0.7813
DM (23’) [20]	0.5640	0.5829	0.5872	0.7844
DATDM (25’) [21]	0.5700	0.6072	0.6064	0.8490
Ours	0.5914	0.5845	0.6148	0.8060

Table 4. Comparative evaluation of model complexity and computational efficiency. The × indicates that the method has no corresponding parameters and FLOPs.

Method	Params (M)	Infertime (s)	FLOPs (G)
UDCP (13’) [8]	×	1.1985	×
WWPE (24’) [15]	×	0.1676	×
UWCNN (20’) [17]	0.04	0.0135	2.61
UIEC²-Net (21’) [23]	0.54	0.0993	26.06
U-Shape (23’) [18]	31.62	0.1008	26.09
FA⁺Net (23’) [16]	0.009	0.0064	0.5852
UVZ (25’) [19]	5.39	0.1768	124.88
DM (23’) [20]	10.2	0.1272	62.3
DATDM (25’) [21]	18.34	0.2087	66.89
Ours	0.42	0.0317	28.04

Table 5. Comparative Methodological Analysis of Existing UIE Methods.

Method	Type	Advantages	Disadvantages
UDCP (13’) [8]	Physics-based	Interpretable, physically-grounded model, simple	Poor generalization, low perceptual quality, unstable under complex underwater conditions
WWPE (24’) [15]	Non-physical	Efficient inference speed, enhances contrast rapidly	Severe over-enhancement, significant color distortions
UWCNN (20’) [17]	CNN-based	Moderate computational cost, relatively simple structure	Poor overall enhancement quality, limited perceptual realism and structural fidelity
UIEC²-Net (21’) [23]	CNN-based	Good balance between structure and color preservation, strong generalization	Still room for improvement in perceptual quality metrics such as LPIPS and FID
U-Shape (23’) [18]	CNN-based	Competitive PSNR scores, relatively good structural reconstruction	Moderate perceptual quality, lower color restoration accuracy, higher FID and LPIPS scores
FA⁺Net (23’) [16]	CNN-based	Good structural metrics (SSIM, PSNR), good balance between speed and performance	Limited perceptual realism, high LPIPS and FID, relatively unstable color correction
UVZ (25’) [19]	CNN-based	Good structural quality, relatively balanced results across metrics	Limited performance in perceptual quality (high LPIPS), moderate generalization to unseen scenarios
DM (23’) [20]	Diffusion-based	Strong theoretical foundation, promising generative capabilities	High computational complexity, lower structural metrics (low PSNR and SSIM), relatively poor realism
DATDM (25’) [21]	Diffusion-based	Strong perceptual realism and detail restoration, competitive quantitative performance	High computational cost and complexity, inference relatively slow compared to lightweight methods
Ours	Transformer-based	Best balance between perceptual and structural quality, high generalization and adaptability, competitive computational efficiency	Slightly more computational cost compared to the simplest CNN-based methods

Table 6. Quantitative comparison of LITM, Model-A, and Model-B. The red represents the best results.

Method	FID↓	LPIPS↓	PSNR↑	SSIM↑	UCIQE↑	UIQM↑
Model-A	33.21	0.1167	21.92	0.8943	0.6190	0.8196
Model-B	31.37	0.1168	22.29	0.8986	0.6221	0.7684
LITM	28.68	0.0989	23.70	0.9048	0.6227	0.8250

Table 7. Quantitative comparison of Only

L_{1}

, Only

L_{p}

, and

L_{t o t a l}

. The red represents the best results.

Table 7. Quantitative comparison of Only

L_{1}

, Only

L_{p}

, and

L_{t o t a l}

. The red represents the best results.

Method	FID↓	LPIPS↓	PSNR↑	SSIM↑	UCIQE↑	UIQM↑
Only $L_{1}$	276.32	0.7725	7.77	0.0674	0.6662	1.2067
Only $L_{p}$	293.22	0.9346	7.43	0.0571	0.6712	1.2663
$L_{t o t a l}$	28.68	0.0989	23.70	0.9048	0.6227	0.8250

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, W.; Rong, Z.; Zhang, L.; Liu, Z.; Chu, Z.; Zhang, L.; Zhou, L.; Xu, J. Enhancing Underwater Images with LITM: A Dual-Domain Lightweight Transformer Framework. J. Mar. Sci. Eng. 2025, 13, 1403. https://doi.org/10.3390/jmse13081403

AMA Style

Hu W, Rong Z, Zhang L, Liu Z, Chu Z, Zhang L, Zhou L, Xu J. Enhancing Underwater Images with LITM: A Dual-Domain Lightweight Transformer Framework. Journal of Marine Science and Engineering. 2025; 13(8):1403. https://doi.org/10.3390/jmse13081403

Chicago/Turabian Style

Hu, Wang, Zhuojing Rong, Lijun Zhang, Zhixiang Liu, Zhenhua Chu, Lu Zhang, Liping Zhou, and Jingxiang Xu. 2025. "Enhancing Underwater Images with LITM: A Dual-Domain Lightweight Transformer Framework" Journal of Marine Science and Engineering 13, no. 8: 1403. https://doi.org/10.3390/jmse13081403

APA Style

Hu, W., Rong, Z., Zhang, L., Liu, Z., Chu, Z., Zhang, L., Zhou, L., & Xu, J. (2025). Enhancing Underwater Images with LITM: A Dual-Domain Lightweight Transformer Framework. Journal of Marine Science and Engineering, 13(8), 1403. https://doi.org/10.3390/jmse13081403

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Underwater Images with LITM: A Dual-Domain Lightweight Transformer Framework

Abstract

1. Introduction

2. Related Work

2.1. Physics-Based and Non-Physical Methods

2.2. Deep Learning-Based Methods

3. Methodology

3.1. Overview of the LITM

3.2. Lightweight Transformer-Based Network

3.3. Core Components of LITM Architecture

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison with Other UIE Methods

4.4. Computational Complexity Comparison

4.5. Methodological Comparison and Analysis of Existing UIE Methods

4.6. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI