Next Article in Journal
Hybrid Attention Mechanism Combined with U-Net for Extracting Vascular Branching Points in Intracavitary Images
Next Article in Special Issue
Parallel Axial Attention and ResNet-Based Bearing Fault Diagnosis Method
Previous Article in Journal
Path Planning for Mobile Robots in Dynamic Environments: An Approach Combining Improved DBO and DWA Algorithms
Previous Article in Special Issue
A Lightweight Degradation-Aware Framework for Robust Object Detection in Adverse Weather
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

UCA-Net: A Transformer-Based U-Shaped Underwater Enhancement Network with a Compound Attention Mechanism

1
State Key Laboratory of Advanced Technology for Materials Synthesis and Processing, Wuhan University of Technology, Wuhan 430070, China
2
Hubei Key Laboratory of Broadband Wireless Communication and Sensor Networks, Wuhan University of Technology, Wuhan 430070, China
3
School of Information Engineering, Wuhan University of Technology, Wuhan 430070, China
4
National Deep Sea Center, Qingdao 266237, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(2), 318; https://doi.org/10.3390/electronics15020318
Submission received: 27 November 2025 / Revised: 7 January 2026 / Accepted: 8 January 2026 / Published: 11 January 2026

Abstract

Images captured underwater frequently suffer from color casts, blurring, and distortion, which are mainly attributable to the unique optical characteristics of water. Although conventional UIE methods rooted in physics are available, their effectiveness is often constrained, particularly in challenging aquatic and illumination conditions. More recently, deep learning has become a leading paradigm for UIE, recognized for its superior performance and operational efficiency. This paper proposes UCA-Net, a lightweight CNN-Transformer hybrid network. It incorporates multiple attention mechanisms and utilizes composite attention to effectively enhance textures, reduce blur, and correct color. A novel adaptive sparse self-attention module is introduced to jointly restore global color consistency and fine local details. The model employs a U-shaped encoder–decoder architecture with three-stage up- and down-sampling, facilitating multi-scale feature extraction and global context fusion for high-quality enhancement. Experimental results on multiple public datasets demonstrate UCA-Net’s superior performance, achieving a PSNR of 24.75 dB and an SSIM of 0.89 on the UIEB dataset, while maintaining an extremely low computational cost with only 1.44M parameters. Its effectiveness is further validated by improvements in various downstream image tasks. UCA-Net achieves an optimal balance between performance and efficiency, offering a robust and practical solution for underwater vision applications.

1. Introduction

The marine environment holds abundant mineral and biological resources. In recent years, global demand for ocean exploration has grown rapidly. However, this progress is hindered by low-quality underwater images. The significant differences in physical properties between water and air cause severe degradation in underwater images, such as color casts, reduced contrast, blur, and noise [1]. Increasing depth exacerbates light scattering and absorption, while insufficient ambient light leads to extreme darkness. These issues limit the usability of underwater imagery. Thus, enhancing underwater images is crucial for marine exploration and benefits tasks such as underwater detection [2], vehicle navigation [3], and marine biology research [4]. Therefore, developing efficient and robust Underwater Image Enhancement (UIE) technology to restore the true color and clear details of images is a core problem urgently needing to be addressed in the field of marine vision. The UCA-Net proposed in this paper is designed to tackle this challenge, aiming to provide a high-performance, lightweight solution.
Initial approaches to underwater image enhancement were based on physical models and statistical techniques. Optical models and attenuation estimation were used to restore images, with brightness and color correction reducing scattering effects. Some employed additive noise models or underwater transmission models for inverse restoration [5]. Histogram equalization adjusted luminance to enhance contrast while avoiding over-enhancement [6]. However, underwater complexity limits the accuracy of such models, and prior assumptions often fail under varying conditions. Consequently, these methods remain prone to noise, artifacts, and residual distortions, especially in challenging environments.
Lately, progress in UIE has been significantly propelled by deep learning. Data-driven models can learn feature representations automatically, improving image quality significantly. Common architectures include CNNs [7], U-Net, and ResNet, while GANs [8] show strong performance in generating visually realistic results. A significant innovation is the Vision Transformer (ViT) [9], which substitutes convolution with self-attention and excels at modeling long-range dependencies. This enables ViT to handle global context and fine details, addressing issues like color distortion and contrast loss. However, ViTs lack CNNs’ local inductive bias, limiting their ability to capture fine local features, and their high computational cost increases inference time.
Therefore, this paper introduces UCA-Net, a lightweight underwater image enhancement network based on a hybrid CNN-Transformer architecture. By jointly optimizing global color correction and local detail enhancement, UCA-Net achieves excellent visual quality with low parameter complexity. Specifically, we first design the Depthwise-Separable Convolutional Residual Attention Composite Block (DCRAC), which integrates multiple attention types and residual connections to enhance textures and reduce blurring and noise in degraded regions. Next, we propose the Deformable Convolution-Transformer Block (DCTB), where the deformable convolution layer adapts to underwater geometric distortions. The Frequency–Domain Feature Fusion Module (FDFM) ensures the organic fusion of the two output features to achieve a more optimal utilization of features. Meanwhile, a dual-path channel attention transformer learns global color distribution and illumination conditions, improving color shifts and low contrast. These modules are embedded in a U-shaped encoder–decoder framework: The encoder gradually extracts multi-scale features through DCRAC, DCTB fuses global context at the bottleneck layer, and the decoder uses skip connections to reconstruct detailed enhanced images, completing the UIE task. The main contributions are summarized below:
  • We designed UCA-Net, an innovative underwater image enhancement U-shaped network based on the combination of CNN and Transformer. This network has a good enhancement effect, is very effective for the restoration of global and local colors and details, and has a lightweight architecture and a relatively low number of parameters.
  • We proposed the Transformer module DCTB (Deformable Convolution-Transformer Block), which consists of a prestructure composed of deformable convolution and a dual-path channel attention Transformer module containing adaptive learning parameters. At the same time, it takes into account the effect of enhancing local details and paying attention to overall information.
  • We designed a feature fusion module in the frequency domain, which establishes a balance mechanism between frequency-selective enhancement and information retention, which is particularly beneficial for scenarios such as underwater image enhancement.
  • We conducted a large number of comparative experiments on multiple datasets to prove that UCA-Net is superior to the existing advanced methods while maintaining a smaller number of parameters and model complexity.
The remainder of this paper is organized as follows: Section 2 reviews related work on underwater image enhancement, including traditional and deep learning methods. Section 3 details our proposed UCA-Net architecture and its core modules, including PTCHM, DCRAC, DCTB, and FDFM, and elaborates on the loss function design. Section 4 presents the experimental setup, quantitative and qualitative results, and ablation studies on the model’s components, with added quantitative analysis for downstream tasks. Section 5 concludes the paper and outlines directions for future work.

2. Related Works

Recently, techniques for underwater image restoration and enhancement have been primarily classified into two categories: traditional and deep learning-based approaches. Traditional methods can further be categorized into statistical-based and physical model-based approaches.
Statistical methods enhance underwater images using heuristic pixel operations rather than modeling light propagation. For instance, Hitam et al. [10] employed contrast-limited adaptive histogram equalization to improve contrast and suppress overexposure. Zhang et al. [11] applied segmented color correction and dual-prior optimization to enhance detail and restore natural colors. Ancuti et al. [12] fused multiple exposures to optimize color and visibility. Zhang et al. [13] combined multi-scale Retinex with a physical underwater model for comprehensive restoration.
Physical model-based methods enhance underwater images by reversing light propagation, requiring accurate environmental parameters and priors like the dark channel [14] and red channel [15]. Li et al. [5] proposed a defogging framework combining a physical model with histogram priors to preserve detail and restore color. Peng et al. [16] used a joint model to estimate transmittance, correct color, and iteratively restore radiance considering blur and absorption. Galdran et al. [15] introduced adaptive red channel recovery based on light attenuation. However, these methods are sensitive to environmental changes and heavily rely on accurate priors, limiting their robustness in complex conditions.
Deep learning has greatly advanced underwater image enhancement (UIE), mainly through CNNs and GANs. CNN-based methods are predominant. Li et al. [17] proposed UWCNN, a lightweight CNN using underwater scene priors, and trained on synthetic data representing 10 water types to model wavelength-specific absorption for color and visibility correction. Naik et al. [18] introduced Shallow-UWNet, a compact network for effective enhancement. Sharma et al. [19] designed a dynamic, attention-guided multi-branch model to improve image quality. Qi et al. [20] proposed UICoE-Net, a collaborative framework using feature matching and joint learning to enhance contrast and color consistency.
Generative Adversarial Networks (GANs) are neural networks that optimize image generation through an adversarial process between a generator and a discriminator, and have been widely applied in image generation and style transfer. Yang et al. [21] proposed a Conditional GAN (CGAN) using U-Net as the generator and PatchGAN as the discriminator. By inputting degraded underwater images, their model progressively produces clearer, more realistic outputs through adversarial training. Islam et al. [22] introduced a fast underwater image enhancement method incorporating depth-separable convolution and channel attention, significantly reducing parameters and computational complexity. Li et al. [23] developed an unsupervised generative network, WaterGAN, which combines underwater optical scattering models and color transmission characteristics to generate high-quality enhanced images. Chen et al. [8] presented PUIE-Net, a perception-driven enhancement network that balances visual quality and physical authenticity by embedding deep learning with physical priors.
Architectures based on Transformers have recently demonstrated robust performance in computer vision tasks, owing to their capacity for global modeling. Swin Transformer [24] reduces computation via local windows and shifted windowing, improving cross-region interaction. UIEFormer by Qu et al. [25] uses a lightweight hierarchical Transformer with cross-stage fusion and adaptive color correction, enhancing contrast and fidelity. URSCT, proposed by Ren et al. [26], combines dual convolutions with attention for better detail restoration. BCTA-Net by Liang et al. [27] introduces a two-level color correction scheme, merging global color stats with local pixel refinement to avoid over-correction and detail loss.

3. Proposed Method

In this section, we will introduce the relevant details and structure of UCA-Net in detail. As shown in the figure, UCA-Net adopts the classic UNet network architecture as the main structure, and the overall manifestation is an encoder–decoder network with skip connections. The overall architecture of the proposed UCA-Net is illustrated in Figure 1. It follows a hierarchical U-shaped encoder–decoder structure designed to effectively capture multi-scale features. The encoder path consists of five stages, where each stage integrates the Depthwise-Separable Convolutional Residual Attention Composite Block (DCRAC) and the Adaptive Dual-path Self-Attention Block (ADSB) to jointly process local textures and global dependencies. As the input image passes through the encoder, it is progressively down-sampled to extract high-level semantic representations. At the bottleneck, the Deformable Convolution-Transformer Block (DCTB) is employed to model long-range context and compensate for underwater geometric distortions. The decoder path mirrors the encoder, utilizing skip connections to fuse fine-grained spatial details from the encoder with the up-sampled semantic features. Finally, the Frequency–Domain Feature Fusion Module (FDFM) is applied to ensure the organic integration of these features, leading to the high-quality reconstruction of the enhanced underwater image. The initial input was an underwater distorted image L a l l = α 1 L l 1 + α 2 L S S I M + α 3 L p e r c . After the image passes through a convolutional layer, the network will generate low-level features F 0 R H × W × C (C represents the number of channels), and it is then fed into the subsequent encoder–decoder network for deeper-level processing and feature fusion. The initial information is first fed into the joint module Parallel Transformer-CNN Hybrid Module (PTCHM). In the subsequent ADSB module, the input features are re-weighted along the channel dimension using a multi-head self-attention mechanism to further capture global contextual representations F G R H × W × C . Meanwhile, DCRAC extracts and processes the local feature F L R H × W × C through the connection of multiple convolutional layers and identities, as well as the construction of a composite attention module. During this process, each encoder will connect its output to the corresponding peer decoder, achieving the fusion of some features between the encoder and the decoder. The overall network is divided into five layers, each of which is composed of a combined module of DCRAC and ADSB, as well as FDFM. Finally, after passing through another convolutional layer, we will obtain the output restored image I 0 . While we acknowledge that UCA-Net appears complex in terms of the number of modules, we emphasize that its lightweight nature is defined by its parameter efficiency and computational efficiency.
We extensively utilize Depthwise-Separable Convolutions and Sparse Self-Attention Mechanisms, which result in a total number of parameters and FLOPs that is significantly lower than many SOTA models that use standard convolutions and full attention mechanisms. We have explicitly compared UCA-Net’s parameters and FLOPs with other SOTA methods in the table in Section 4.7 to substantiate the lightweight claim. Despite the number of modules, they are highly modular, with each module focusing on solving a specific aspect of underwater image degradation (e.g., DCRAC for local details, DCTB for global color and geometric distortion). While this design increases initial implementation complexity, it offers high flexibility and interpretability. Next, we will explain the design and mechanism of each module and illustrate its corresponding functions and contributions in the entire enhancement task.

3.1. PTCHM

A central element of UCA-Net is the Parallel Transformer-CNN Hybrid Module (PTCHM). As depicted in Figure 1, this module contains a primary unit known as the Transformer-CNN Unit (TCU), which integrates CNN and Transformer branches in a parallel configuration. The PTCHM leverages the strengths of both CNNs and Transformers: the Transformer branch allocates weights to global features and captures long-range dependencies, while the CNN branch is effective in extracting local features and fine details. Combining these allows PTCHM to adjust global image properties (e.g., color, contrast, brightness) while preserving local textures and details, resulting in superior performance for underwater image enhancement tasks. Unlike normal blending modules used for tasks such as image dehazing or super-resolution, PTCHM integrates DCRAC (for enhancing local details and texture) and DCTB (for handling global color distortion and geometric deformation), forming a dual optimization mechanism. More importantly, PTCHM does not rely on simple spatial-domain feature fusion; this frequency–domain fusion allows the network to perform frequency-selective enhancement on high-frequency details (texture) and low-frequency information (color, illumination), a capability not present in common spatial-domain fusion methods, thereby providing a more refined and targeted approach to underwater image restoration. The TCU is composed of three primary sub-networks: (1) the Depthwise-Separable Convolutional Residual Attention Composite Block (DCRAC), which employs depthwise-separable convolutions, residual connections, and a Composite Attention Module (CAM) to minimize computational load while maintaining feature richness; (2) the Deformable Convolution-Transformer Block (DCTB), featuring deformable convolution layers and an Adaptive Dual-path Self-Attention Block (ADSB); and (3) the FDFM, which decomposes spatial features into frequency components via the discrete cosine transform (DCT) to enable more sophisticated feature integration compared to conventional element-wise operations. This hybrid design allows the model to adapt to geometric variations while capturing long-range dependencies, enhancing its ability to process complex underwater scenes.

3.2. DCRAC

The effective application of underwater images frequently necessitates the enhancement of fine details and texture features. However, due to various degradation factors, underwater images often suffer from detail loss and blurring. To improve the network’s capability for reconstructing and restoring local fine-grained features, we designed the DCRAC module. Specifically, DCRAC is composed of three residual convolutional modules constructed using depthwise-separable convolutions, combined with a Compound Attention Module (CAM). In each depthwise-separable convolution, we apply a channel reordering technique to address the information barrier issue caused by independent channel computations in depthwise-separable convolutions. This enhancement improves the feature representation capability of the convolutional layers. The Compound Attention Module (CAM) we designed is composed of parallel spatial and channel attention branches, followed by a pixel-level attention mechanism. The entire CAM operates in a coarse-to-fine manner: first, the input features are processed through the channel and spatial attention modules, where the respective channel attention weights and spatial attention weights are computed sequentially. These weights are then used to recalibrate the features, allowing the network to adaptively emphasize the importance of different regions within the feature map. The two parallel attention modules independently process the input features, assigning unequal weights to different channels and pixels, thereby enhancing the overall feature representation for image enhancement. Following this, a pixel attention module is employed to fully integrate the outputs from the previous two attention branches. This enables pixel-wise weight allocation, which further refines the regional representations. Through this multi-level refinement process, the network progressively enhances and strengthens feature representations across different levels of abstraction, ultimately achieving more precise and discriminative feature expression. As shown in Figure 2 and Figure 3, the CAM further fine-tunes the processing of fine-grained details, thereby enhancing the block’s overall performance in local feature enhancement. When the input is given as X R H × W × C , the feature map is first passed through both the spatial attention and channel attention branches, which generate their respective outputs. This dual-branch mechanism enables the module to decouple spatial contextual dependencies from channel-wise semantics, allowing for a more structured refinement of feature activations. By attending to where and what to focus on, the module achieves a synergistic enhancement of salient patterns in underwater imagery.
S A   =   C o n v 7 × 7 ( c o n c a t ( X a v g ,   X m a x ) )
C A   =   C o n v 7 × 7 ( R e L U ( X G A P ) )
The output SA(X) and CA(X), after feature stitching, will be input into the pixel attention module:
P A 1 = S A ( X ) + C A ( X )
X C = c o n c a t ( X + P A 1 )
P A ( X ) = σ C o n v 7 × 7 ( X C )
The output obtained is
C A M ( X ) = P A ( S A X + C A X )
After the input feature map X R H × W × C passes through the first 3 × 3 depthwise-separable convolutional layer, it is followed by a Dropout layer, which randomly deactivates a portion of the feature units to suppress overfitting during training. Subsequently, a ReLU activation function is applied to introduce non-linearity into the network, enabling it to better model complex patterns within the data. The attention-weighted feature map obtained from the CAM is subsequently passed into a depthwise-separable convolutional layer, where it undergoes further transformation. This is followed by a Dropout operation. The output is then passed through a ReLU activation function, introducing non-linearity and enabling the network to learn more expressive features. To facilitate more efficient training, the output feature is then fused with the original input via a residual connection, which helps preserve low-level information and accelerates model convergence. The integration of Dropout and residual learning establishes a balanced mechanism between regularization and information preservation, which is particularly beneficial in scenarios like underwater image enhancement, where features may be sparse or degraded. The final output is as follows:
X 1 = R e L U ( d r o p o u t ( C o n v 3 × 3 ( X ) ) )
X 2 = C A M ( X + X 1 )
X 3 = R e L U ( d r o p o u t ( C o n v 3 × 3 ( X 2 ) ) )
O u t p u t = R e L U ( C o n v 3 × 3 ( X 3 ) )

3.3. DCTB

  • Deformable convolution
In UIE tasks, previous methods typically used standard or depthwise-separable convolutions to extract local features from images or sequences. These approaches generally perform well in normal conditions by capturing basic structures and patterns. However, they have limitations in more complex environments. Underwater images and biological structures often feature irregularities, blurriness, and distorted object boundaries, such as drift or fragmentation. In these cases, traditional convolutions, which rely on fixed receptive fields and static sampling patterns, struggle to capture non-rigid targets or locally distorted features. This results in reduced receptive fields, distorted context, and weaker semantic coherence.
To address these issues, we introduced Deformable Convolution into the UIE framework to improve the network’s ability to model complex spatial structures. Unlike standard convolutions, deformable convolution uses learnable offset parameters, allowing the sampling locations to adapt based on the feature distribution. This flexibility overcomes the limitations of fixed grid sampling, improving the model’s ability to handle irregular boundaries, non-rigid shapes, and complex lighting conditions found in underwater environments. Deformable Convolution, first proposed by Johnson et al. [28] in “Deformable Convolutional Networks”, was designed to overcome the limitations of conventional CNNs in handling geometric transformations and spatial deformations in visual data. We are the first to integrate deformable convolution into the UIE task, aiming to leverage its adaptive spatial sampling capability to handle the irregular and non-rigid structures typical of underwater imagery.
The core idea of deformable convolution is to add learnable offsets to the sampling positions of standard convolution, enabling the sampling points to dynamically adjust their positions according to the input content, thereby achieving an adaptive receptive field, as shown in Figure 4. Specifically, for a traditional convolution operation, its output at position p 0 can be expressed as follows:
y ( p 0 ) = p n R w ( p n ) · x ( p 0 + p n )
Among them, R is the set of sampling points (such as the 9 positions of a 3 × 3 convolution), and w is the convolution weight. In deformable convolution, the sampling position becomes
y ( p 0 ) = p n R w ( p n ) · x ( p 0 + p n + p n )
The offset term p n for each sampling location in the deformable convolution is learned from the input feature map using an additional convolutional layer. As illustrated in Figure 4, this mechanism allows the shape of the convolutional kernel to adapt dynamically—from a fixed rectangular grid to a flexible, data-driven configuration. This adaptability significantly enhances the model’s ability to handle the complex, cluttered, and occluded nature of underwater environments. Crucially, this deformation of the sampling grid introduces no extra kernel parameters. Instead, it enables the convolution to expand its effective receptive field without increasing computational complexity or model size. This makes the operation both computationally efficient and capable of capturing non-rigid patterns and spatial variations, which are prevalent in underwater scenes.
We design the C (DCTB) (Figure 5) by combining deformable convolution and self-attention to capture both spatial adaptability and contextual dependencies. The module begins with a residual block based on deformable convolution. Unlike traditional convolutions, deformable convolution uses learnable offsets to dynamically shift sampling locations, allowing the receptive field to adapt to complex image geometries. This adaptability is particularly effective for capturing key features in irregular or deformable objects—common in underwater organisms and terrains with complex textures and edges. To address such challenges, deformable convolution enhances the model’s ability to extract robust features across diverse scales, orientations, and shapes. By enabling the kernel to change shape and position based on input features, it significantly improves flexibility in capturing salient cues from non-rigid regions. This capability is further strengthened by the Adaptive Dual-path Self-Attention Block (ADSB), which models long-range dependencies and emphasizes globally relevant features. Together, deformable convolution and self-attention allow the network to more effectively recognize and represent complex structures in scenes with occlusion, irregularity, or poor visibility.
We acknowledge that Deformable Convolutions typically involve higher memory access costs and inference latency. However, to maintain the lightweight nature of UCA-Net, we adopted a lightweight Deformable Convolution strategy within the DCTB. The Deformable Convolution is only applied to the critical feature extraction layers of the DCTB module, not the entire network. We constructed the feature extraction part of the Deformable Convolution using Depthwise-Separable Convolutions, which significantly reduces the number of parameters and computational load (FLOPs).
Despite a slight overhead in memory access, the geometric adaptability provided by Deformable Convolutions is essential for detail restoration in underwater images. Experimental results confirm the success of this strategy: the performance gain from DCTB (especially in detail recovery) far outweighs the minor increase in latency, and the overall inference speed of UCA-Net remains superior to most SOTA models, successfully achieving a balance between performance and efficiency.
Underwater images often suffer from global color distortion, inconsistent tones, low contrast, and uneven illumination, all of which impair effective feature representation. To address these issues, we propose the Adaptive Dual-path Self-Attention Block (ADSB), designed to regulate global color and lighting, thereby enhancing the refinement of global features. Unlike standard Transformer-based attention mechanisms, which compute dense interactions across all tokens and channels—often introducing redundant computation and noise—the ADSB adopts a more efficient strategy. Specifically, it uses a dual-branch sparse-dense attention mechanism with adaptive scaling, selectively allocating computational resources based on the importance and spatial distribution of features. This design reduces computational load while preserving essential semantic and structural information, allowing the model to focus on visually relevant areas in underwater scenes. The ADSB includes a sparse attention branch guided by a sparsity operator ρ 1 , which filters out low- or negatively correlated query-key-value features. This helps the model focus on high-response regions, enhancing texture and edge information. To ensure that important global cues are not missed, a dense attention branch, guided by a complementary operator ρ 0 , is introduced to capture long-range dependencies and maintain contextual continuity. This branch helps correct global illumination imbalance and color bias. To integrate both branches, we employ an adaptive fusion strategy using learnable weights, allowing the model to dynamically balance the contributions from sparse and dense attention depending on the content and enhancement demands. This mechanism ensures task-specific optimization for complex underwater image enhancement.
In the ADSB module, the input feature tensor X R H × W × C is first processed via layer normalization to obtain the normalized representation X 0 . This normalized tensor then passes through a 1 × 1 point-wise convolution followed by a 3 × 3 depth-wise convolution and flattening operation, producing the query ( Q R C × H W ), key ( K R C × H W ), and value ( V R C × H W ) representations, where
Q , K , V = F l a t t e n ( D C o n v 3 × 3 ( P C o n v ( X 0 ) ) )
Then our dense attention branch will obtain the global attention score through ρ 0 :
ρ 0 = S o f t m a x
D P A = ρ 0 ( Q K T d ) V
Similarly, the sparse attention branch will also obtain the corresponding sparse attention score after normalization operations:
ρ 1 = x 2 ( x 0 ) 0     ( x < 0 )
S P A = ρ 1 ( Q K T d ) V
Directly applying attention outputs from the two branches may lead to information loss and imbalanced weighting, which can impair the overall enhancement effect. To mitigate this, we introduce adaptive learnable parameters w 0 , w 1 to fuse the outputs of the sparse and dense branches. These parameters are normalized weights, enabling the model to dynamically balance the contributions of each branch based on feature relevance. This design ensures more stable and effective integration, maintaining the complementarity between localized saliency and global context.
O u t p u t = w 0 · D P A + w 1 · S P A
This design enables the model to more effectively suppress irrelevant regions and assign greater attention to high-frequency feature areas, thereby improving the focus on salient structures. By introducing adaptive attention fusion, the model can dynamically adjust its reliance on sparse and dense attention branches during training. Such flexibility allows the network to automatically learn optimal attention patterns tailored to different tasks and datasets, ultimately achieving better generalization and performance across diverse conditions. This task-adaptive balance not only mitigates overfitting to redundant regions but also ensures robust feature enhancement in complex scenarios such as underwater imaging or low-visibility environments.

3.4. FDFM

To address the challenge of effectively integrating complementary features from different convolutional modules in underwater image enhancement networks, we propose a novel Frequency-Domain Feature Fusion Module (FDFM). This module leverages discrete cosine transform (DCT) to decompose spatial features into frequency components, enabling more sophisticated feature integration compared to conventional element-wise addition or concatenation approaches.
  • DCT
The Discrete Cosine Transform (DCT) serves as the mathematical foundation of our frequency–domain fusion approach. DCT is an orthogonal linear transformation that converts spatial-domain signals into frequency–domain representations, where the basis functions are cosine waves of varying frequencies. Mathematically, for a 2D signal f(x,y) of size M × N, the 2D DCT is defined as follows:
F u , v = α ( u ) α ( v ) x = 0 N 1 1 y = 0 N 2 1 f ( x , y ) c o s [ π u ( 2 x + 1 ) 2 N 1 ] c o s [ π v ( 2 y + 1 ) 2 N 2 ]
α ( k ) = 1 N                         ( for   k = 0 ) 1 N     ( for   k = others )
The key advantage of DCT over other frequency transforms lies in its energy compaction property: for natural images, most of the signal energy is concentrated in the low-frequency coefficients located in the upper-left corner of the transformed matrix, while high-frequency components (typically representing noise and fine textures) are distributed in the lower-right region. This property makes DCT particularly suitable for image enhancement tasks, as it naturally separates structural information from noise and artifacts.
In the context of our FDFM, DCT plays a crucial role in enabling frequency-selective feature fusion. By transforming both DCTB and DCRAC features into the frequency domain, we can decompose each feature map into four distinct frequency bands—low-frequency (LL), mid-low-frequency (LH), mid-high-frequency (HL), and high-frequency (HH) components—each capturing different aspects of image information. The low-frequency band primarily contains global structure, color, and illumination information, which is essential for maintaining the overall appearance and color fidelity in underwater image enhancement. The mid-frequency bands preserve edge details and texture patterns that define object boundaries and surface characteristics. The high-frequency band encodes fine textures and noise, which in underwater images often includes scattering artifacts and sensor noise that need to be selectively suppressed or enhanced. We selected DCT over alternatives like the Fast Fourier Transform (FFT) or Wavelet Transform, primarily based on its energy compaction property and the advantage of real-number operations. DCT concentrates most of the image information into a few low-frequency coefficients, allowing us to more effectively decouple and selectively enhance low-frequency information (corresponding to image color and illumination) and high-frequency information (corresponding to image details and texture) in the frequency domain. Compared to the complex number operations involved in FFT, the real-number nature of DCT makes it easier to implement and compute within a deep learning framework.
We employed a differentiable DCT implementation based on PyTorch, which is entirely constructed from standard tensor operations, avoiding complex custom operations. Consequently, its backpropagation is efficient and stable, integrating seamlessly into the network’s training pipeline without requiring extra computational overhead or special gradient handling, thus ensuring end-to-end optimization of the model training.
The frequency–domain representation allows our module to apply different fusion strategies to different frequency components, enabling more sophisticated feature integration than spatial-domain methods. Specifically, the adaptive attention mechanism can learn to emphasize low-frequency components when color correction is needed, enhance mid-frequency components for detail preservation, and suppress high-frequency components when noise reduction is required. This frequency-selective processing is particularly advantageous for underwater image enhancement, where different image regions may exhibit varying degrees of color distortion, contrast degradation, and noise contamination, requiring adaptive enhancement strategies that cannot be effectively achieved through uniform spatial-domain operations.
As in Figure 6, The FDFM architecture comprises four stages. First, both input feature maps from two modules X 1 R H × W × C and X 2 R H × W × C undergo 2D DCT transformation and are decomposed into four frequency bands through spatial quadrant partitioning, with each band upsampled to original dimensions via bilinear interpolation. Second, corresponding frequency bands from both inputs are element-wise added and processed through dedicated 1 × 1 convolutional filters, serving as learnable frequency–domain filters that adaptively enhance or suppress specific components.
y 1 = C o n v 1 × 1 ( D C T X 1 + D C T X 2 )
And then, a frequency–domain attention mechanism generates adaptive weights: the concatenated original features pass through adaptive average pooling, followed by two 1 × 1 convolutions with ReLU activation, and finally Softmax normalization to produce four attention weights corresponding to the frequency bands.
y 2 = S o f t m a x ( R e L U ( C o n v 1 × 1 ( C o n v 1 × 1 ( A v g ( y 1 ) ) ) ) )
Finally, the attention-weighted bands are concatenated and fused through a 1×1 convolution, batch normalization, and ReLU activation, then combined with the original features via a learnable residual connection. This residual connection preserves spatial-domain information, accelerates convergence, and enables adaptive balancing between frequency–domain and spatial-domain representations.
O u t p u t = I D C T ( R e L U ( B N ( C o n v 1 × 1 ( y 2 ) ) ) + D C T X 1 + D C T X 2 2 )
This frequency–domain approach is particularly advantageous for underwater image enhancement, as it naturally separates noise (typically concentrated in high frequencies) from useful image content (distributed across low and mid frequencies), enabling selective enhancement while maintaining computational efficiency through the use of fast Fourier transform implementations.

3.5. Loss Function

Due to the inherent complexity of underwater image degradation, relying on a single loss function is often inadequate for achieving optimal enhancement. Therefore, we adopt a composite loss function that combines L1 loss, perceptual loss, and SSIM loss, each assigned with a specific weight. This multi-objective design enables the model to simultaneously optimize for pixel-level accuracy, perceptual consistency, and structural similarity, ensuring balanced enhancement of both low-level details and high-level visual fidelity.

3.5.1. L1 Loss

The L1 loss computes the average pixel-wise absolute difference between the predicted image and the ground truth. It emphasizes overall similarity in color and brightness between the enhanced and reference images, promoting global consistency. The loss is formally defined as follows:
L l 1 = 1 N i = 1 N y i y i ^
Among them, y i is the true value, y i ^ is the predicted value, and N is the total number of pixels.

3.5.2. SSIM Loss

The Structural Similarity Index Measure (SSIM) loss compares the predicted and reference images in terms of luminance, contrast, and structural similarity within a sliding window. By mimicking the sensitivity of the human visual system to structural distortions, SSIM encourages the enhanced image to better preserve perceptual quality and local consistency. The SSIM loss is defined as follows:
L S S I M = 1 ( 2 μ y μ + C 1 ) ( 2 σ y y ^ + C 2 ) ( μ y 2 + μ y ^ 2 + C 1 ) ( σ y 2 + σ y ^ 2 + C 2 )
μ is the mean of the local window (for luminance comparison), σ is the standard deviation (for contrast comparison), σ y y ^ is the covariance (for structural similarity comparison), and C 1 , C 2 are the stable constants.

3.5.3. Perceptual Loss

The perceptual loss [28] measures high-level similarity between the enhanced image and the ground truth by comparing their feature representations extracted from a pre-trained network—specifically VGG-19 [29] in our case. Unlike pixel-based losses, it operates in the feature space, capturing semantic information such as texture and shape, and is more robust to spatial misalignments and color shifts source. Formally, the loss is defined as follows:
L p e r c = 1 C j H j W j c = 1 C j h = 1 H j w = 1 W j j y c , h , w j y ^ c , h , w
Here, j denotes the feature map extracted from the j-th layer of the pre-trained network. These feature maps encode hierarchical representations, ranging from low-level textures to high-level semantic structures, and serve as the basis for perceptual comparison in feature space.
The total loss is formulated as a weighted sum of the aforementioned three components—L1 loss, perceptual loss, and SSIM loss—with their respective weights denoted by α 1 , α 2 and α 3 . These hyperparameters are empirically set to 0.2, 0.2, and 1.0, respectively, to balance the contributions of each term and ensure that all losses operate at a comparable scale. The total loss function is defined as follows:
L a l l = α 1 L l 1 + α 2 L S S I M + α 3 L p e r c
These values were determined through a combination of limited grid search and empirical tuning. We initially fixed α 1 and α 2 to small values and then adjusted α 3 . We assigned a high weight to the perceptual loss based on the visual importance of underwater image enhancement. UIE requires not only pixel-level accuracy (L1 loss) but, more critically, the restoration of visually perceptible quality such as texture and structure. We found that a higher perceptual loss weight effectively guides the network to learn higher-level feature representations, resulting in images that are visually more natural and less prone to artifacts. Lower weights for perceptual loss led to images that appeared “smooth” but lacked realism, while the high weight better balances the enhancement effect and visual fidelity, which is crucial for subsequent visual analysis tasks.

4. Experiments

4.1. Dataset

We evaluate UCA-Net using four public datasets: UIEB [30] (800 pairs), EUVP [22] (1050 pairs), UFO [31] (1000 pairs), and LSUI [32] (1000 pairs).
(1)
UIEB contains 950 real-world underwater images, including 890 paired samples generated via 12 methods and 60 unpaired challenging cases.
(2)
EUVP offers large-scale paired and unpaired images from varied underwater scenes, supporting both supervised and unsupervised tasks.
(3)
LSUI includes 4212 public underwater images.
(4)
UFO provides 1200 paired samples and a 120-image unpaired subset (UFO-120) for benchmarking.
Seven test sets are used:
(1)
EUVP-T100: 100 paired images for reference-based testing.
(2)
EUVP-R100: 100 unpaired images for no-reference testing.
(3)
UIEB-T90: 90 paired samples for quantitative evaluation.
(4)
UIEB-R60: 60 unpaired images for perceptual testing.
(5)
LSUI-T100: 100 paired images for supervised testing.
(6)
UFO-120: 120 unpaired samples from the UFO dataset.
(7)
U45 [30]: 45 diverse unpaired images as a challenging no-reference set.
This comprehensive protocol ensures robust evaluation of UCA-Net across various paired/unpaired conditions and underwater scenes.

4.2. Experimental Environment

All experiments are conducted on a workstation equipped with an RTX 4070 GPU, a 3.00 GHz AMD Ryzen 9 7845HX CPU, and 16 GB RAM. The software environment includes CUDA 12.3, cuDNN 9.0, PyTorch 2.3.1, and Python 3.11. During training, we use 100 epochs with a batch size of 1. We avoid the use of traditional batch normalization (BN) layers in UCA-Net. Instead, we employ layer normalization, or in some modules remove the normalization layer altogether, relying only on the inherent stability of residual connections and depth-wise separable convolutions. For the Transformer module, we use standard layer normalization, which is independent of the batch size. Overall, we minimize the use of BN layers in the overall network architecture. The initial learning rate is set to 2 × 10−4, and all training images are randomly cropped to 256 × 256 resolution before being fed into the network.

4.3. Evaluation Metrics

We evaluate visual quality using five metrics: PSNR [33], SSIM [34], UIQM [35], UCIQE [36], and CCF [37]. PSNR and SSIM are full-reference metrics applied to paired sets (EUVP-T100, UIEB-T90, UFO-120, LSUI-T100). UIQM and UCIQE are no-reference metrics assessing contrast, color, and sharpness. CCF is a recent no-reference metric measuring colorfulness and contrast. Together, these metrics comprehensively assess accuracy, structure, and aesthetics under varied underwater conditions.
To quantitatively evaluate the performance of UCA-Net, we employ the Peak Signal-to-Noise Ratio (PSNR) as a primary metric for image fidelity. PSNR is mathematically defined based on the Mean Squared Error (MSE) between the enhanced image   I e n and the ground truth image   I g t . For an image of size M times N, the MSE and PSNR are calculated as follows:
M S E = 1 M N i = 0 M 1 j = 0 N 1 [   I g t i , j   I e n ( i , j ) ] 2
P S N R = 10 l o g 10 ( M A X I 2 M S E )
where M A X I represents the maximum possible pixel value of the image (e.g., 255 for 8-bit representations). A higher PSNR value indicates that the enhanced image is closer to the reference ground truth in terms of pixel-level accuracy.
During testing, we use both full- and no-reference metrics. PSNR and SSIM evaluate pixel-level fidelity and structural similarity, applied to paired sets (EUVP-T100, UIEB-T90, UFO-120, LSUI-T100) as they require ground truth. UIQM and UCIQE, suitable for unpaired sets (U45, UIEB-R60, UFO-120), assess color, sharpness, and contrast without reference. UIQM combines colorfulness, sharpness, and contrast via weighted scoring; higher scores reflect better visual quality. UCIQE linearly integrates chroma, saturation, and contrast, capturing typical underwater degradations. CCF, a recent metric, measures colorfulness and spatial contrast, aligning with perceived visual appeal. Together, these metrics offer a balanced evaluation across fidelity and perceptual aspects.

4.4. Compared Methods

To benchmark the performance of our network, we compare it against seven representative methods, including one traditional enhancement algorithm and six deep learning-based models. Traditional method: DCP (Dark Channel Prior) [38]. Deep learning-based methods: UWNet [18], FUnIE-GAN [22], DeepWaveNet [19], LitenhanceNet [39], DCSS-Net [40]. and HisMamba [41]. For the traditional method (DCP), we directly evaluate the authors’ official implementation on our test sets. For deep learning-based methods, we utilize the official source code and pretrained weights provided by the respective authors to ensure fair and reproducible comparisons.

4.5. Qualitative Comparison

We conduct a qualitative comparison of all selected methods across six representative test sets. These sets cover both paired and unpaired scenarios under diverse underwater conditions.
Figure 7 shows results on EUVP-T100. Visually, DCP suffers from severe color cast and distortion. UWNet shows low contrast. FUnIE-GAN restores color and contrast but introduces noise and blur. DeepWaveNet and LitenhanceNet show under-enhancement. In contrast, UCA-Net achieve the best color fidelity and structural detail, performing robustly under challenging underwater conditions.
Figure 8 shows qualitative results on EUVP-R50. DCP exhibits severe color cast with poor enhancement. UWNet and DeepWaveNet correct color moderately but cause blurring and texture loss. FUnIE-GAN introduces artifacts, degrading visual quality. LitenhanceNet under-enhances, with blur and detail loss. In contrast, UCA-Net restores natural colors while preserving textures and details, yielding the best perceptual quality.
Figure 9 illustrates results on UIEB-T90. DCP performs poorly in both color and detail. UWNet and FUnIE-GAN show yellow over-enhancement due to blue-green suppression. LitenhanceNet under-enhance. UCA-Net and DeepWaveNet produce the best results, though DeepWaveNet still loses saturation and detail.
Figure 10 shows results on UIEB-R60. DCP and UWNet fail in color correction, reducing brightness. FUnIE-GAN enhances better but over-amplifies red tones (e.g., first image). DeepWaveNet shows unnatural color balance in challenging scenes (e.g., third image). LitenhanceNet produce good results but suffer from reduced contrast and noise, causing blur. Overall, LitenhanceNet and UCA-Net perform best, offering balanced enhancement with natural colors and preserved details.
Figure 11 shows the results on UFO-T120. DCP introduces color cast and artifacts, degrading visual quality. DeepWaveNet and LitenhanceNet fail to correct blue-green deviations (e.g., third image). UWNet enhances well overall but lacks accurate white balance and detail preservation. FUnIE-GAN and UCA-Net deliver the best results, achieving superior color fidelity and texture retention.
Figure 12 shows the results on LSUI-T100. DCP yields poor enhancement, with major color and clarity deviation. UWNet and FUnIE-GAN improve some aspects but cause a yellow tint and color distortion. DeepWaveNet restores color better but blurs details. In contrast, UCA-Net and other top methods balance color correction and detail preservation well, suppressing artifacts. UCA-Net achieves the most natural and visually pleasing results.
Figure 13 shows the results on U45. Input images suffer from severe color distortion and blur. DCP fails to correct these, leaving color cast and artifacts. UWNet improves color slightly but does not reduce blur. FUnIE-GAN and DeepWaveNet suppress green poorly and over-enhance red (e.g., third image). LitenhanceNet improves color but retains blur and introduces noise, causing detail loss. In contrast, UCA-Net achieves the best color and sharpness restoration, effectively handling underwater degradation and preserving fine details. Figure 14 shows the results on different blue and green partial pictures of the UIEB test set, which shows the contrast results in the same color tones.

4.6. Quantitative Comparison

We conduct a quantitative comparison on four benchmark datasets: EUVP-T100, UIEB-T90, UFO-T100, and LSUI-T100. As shown in Table 1, we report the average PSNR and SSIM for each method. These datasets offer reliable ground-truth references for quantitative evaluation. Results show that traditional methods underperform due to limitations of handcrafted priors. Among deep learning methods, UCA-Net ranks first or second on all datasets, indicating balanced performance in fidelity and perceptual quality. These results align with the qualitative analysis in Figure 6, Figure 8, Figure 10, and Figure 11, confirming the strength of UCA-Net in color, texture, and detail enhancement.
We conduct no-reference evaluations on EUVP-R50, UIEB-R60, and U45 using UIQM, UCIQE, and CCF (Table 2), which assess color, contrast, and sharpness. Among eight methods, UCA-Net maintains balanced performance. Although it shows visual superiority in qualitative comparisons, it does not consistently lead in all metrics due to the heuristic nature of these scores, which may misalign with human perception in complex scenes. On EUVP-R50, UCA-Net scored 2.98 (UIQM) and 0.606 (UCIQE), indicating good quality. It achieves the top UIQM and UCIQE on UIEB-R60, confirming its effectiveness. Notably, UCA-Net’s UCIQE varies little across datasets, suggesting stable performance under diverse underwater conditions.

4.7. Detailed Functional Evaluation

In this section, we validate the effectiveness of UCA-Net from two perspectives: color fidelity and detail preservation.
As shown in Figure 15, we compare color histograms of reference and enhanced images from UIEB. UCA-Net shows the best color alignment, with distributions closely matching the ground truth, indicating effective natural color restoration under underwater distortions.
In Figure 16, we evaluate texture restoration by zooming into regions of enhanced UIEB images. DCP degrades texture during color adjustment. UWNet produces over-smoothed results lacking depth. FUnIE-GAN fails to maintain structure in complex regions. In contrast, UCA-Net preserves fine details and edges, achieving sharp, natural enhancements even in challenging conditions.

4.8. Ablation Research

To further verify the effectiveness of the proposed modules in UCA-Net, we conducted a comprehensive ablation study. All experiments were performed on the UIEB dataset, with UIEB-T90 used as the test set. The evaluation was based on two full-reference metrics: PSNR and SSIM. Table 3 presents the results, comparing performance across multiple model variants, each with one specific component removed or altered. The results validate the importance and contribution of each module to the network’s overall performance in underwater image enhancement. The specific experiments are as follows:
(1)
No. 1: “w/o DCRAC” indicates that DCRAC has been removed.
(2)
No. 2: “w/o DCTB” indicates that DCTB has been removed.
(3)
No. 3: “w/o ADSB” indicates that ADSB has been removed.
(4)
No. 4: “w/o CAM” indicates that CAM has been removed.
(5)
No. 8: “w/o DC” indicates that the deformable convolution in DCTB has been replaced by a regular convolution.
(6)
No. 9: “w/o FDFM” indicates that FDFM has been replaced by simple per-pixel addition.
(7)
No. 9: Loss Function Experiment.
As shown in Table 3, the full UCA-Net achieves PSNR 24.75 and SSIM 0.89 on UIEB-T90. Removing the DCRAC module drops PSNR by 1.13 and SSIM by 0.11, confirming its effectiveness in reducing noise and enhancing local details via residual convolution and composite attention. Removing ADSB lowers PSNR by 1.12 and SSIM by 0.05, highlighting its role in global color and illumination adjustment. Excluding CAM yields PSNR 24.53 and SSIM 0.81, showing its importance for texture and spatial consistency. Replacing deformable convolutions in DCTB causes performance degradation, proving their benefit for handling irregular underwater textures. Removing FDFM also weakens the results, validating its necessary role in feature fusion targeting global versus local features. These results validate the architectural design of UCA-Net, and each module is crucial to improving the quality.
In Table 4, ablation on the loss function shows that perceptual loss has the greatest impact when removed. L1 and SSIM loss show minor effects, especially under equal weighting. Optimal results occur when α1 = 0.2, α2 = 0.2, and α3 = 1, indicating perceptual loss should dominate.

4.9. Downstream Visual Applications

To verify the specific application effects of our method on image enhancement, we applied it to object detection and image segmentation tasks.
In the object detection task, we trained YOLOv5 [42] on the Aquarium dataset. As shown in Figure 17, test images with detection metrics allow visual comparison across seven methods. DCP suffers from strong color bias, missing small targets. UWNet and FUnIEGAN over-enhance red, while LitenhenceNet introduces noise and degrades details, leading to missed detections. In contrast, our method detects small, blurred, and low-contrast targets more effectively. We have placed the corresponding quantitative comparison results in Table 5.
For semantic segmentation, we trained U-Net [43] on SUIM [44]. Figure 18 shows segmentation results with labeled objects. DCP produces a greenish bias, hindering edge detection. UWNet and DeepWaveNet cause color turbidity, impairing accuracy. Our method preserves color and edge features better, supporting more accurate segmentation.

4.10. Complexity Comparison

Model complexity is measured by parameter count (Params) and Multiply–Accumulate Operations (MACs). We compared these across methods (Table 6). Though our model has higher complexity than traditional CNNs, it outperforms other self-attention-based models in both metrics. This shows that UCA-Net achieves a lightweight yet effective design, balancing performance and efficiency.

5. Conclusions

In this paper, we presented UCA-Net, an innovative lightweight CNN-Transformer hybrid network designed to address the core degradation problems prevalent in underwater images, specifically color casts, low contrast, and blurred details. It features an Adaptive Sparse Attention Module (ADSB) with deformable convolution-based residuals to enhance global features for underwater color and light correction. A Composite Attention Module combines three complementary attentions to refine fine details. The parallel dual-attention design jointly handles global and local restoration, addressing distortion and texture loss. The frequency domain feature fusion mechanism organically fused and output the results of parallel attention in the frequency domain. Comparative and ablation studies confirm that UAC-Net consistently outperforms existing methods in both image quality and detail recovery. Experimental results on multiple public datasets demonstrate the superiority of UCA-Net. Notably, on the UIEB dataset, our model achieves a PSNR of 24.75 dB and an SSIM of 0.89, outperforming several state-of-the-art methods while maintaining an extremely low parameter count of only 1.44M.
Despite strong results, UAC-Net’s lightweight structure can be further optimized. For future work, our efforts will focus on further enhancing UCA-Net’s practicality and generalization capability. To make UCA-Net suitable for real-time applications or resource-constrained platforms, we will explore network compression strategies, such as employing Post-Training Quantization (PTQ) or Knowledge Distillation, to substantially reduce model size and inference latency without significantly degrading performance. Furthermore, to assess and improve UCA-Net’s generalization capability for other types of degraded images, we will investigate cross-domain adaptation strategies, such as fine-tuning the model using unsupervised domain adaptation techniques, enabling it to better adapt to image enhancement tasks in non-underwater scenarios (e.g., hazy or low-light conditions), thereby broadening the application scope of UCA-Net.

Author Contributions

Methodology, software, data curation, writing—original draft preparation, C.Y.; validation, J.Z., G.L. and L.W.; investigation, J.Z.; resources, Z.D.; writing—review and editing, C.Y. and J.Z.; supervision, J.Z. and L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in UIEB at https://doi.org/10.1109/TIP.2019.2955241, reference number [30]; EUVP at https://api.semanticscholar.org/CorpusID:85498726 accessed on 7 January 2026, reference number [22], UFO at https://doi.org/10.15607/RSS.2020.XVI.018, reference number [31], and LSUI at https://doi.org/10.1109/TIP.2023.3276332, reference number [32].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DCRACDepthwise-Separable Convolutional Residual Attention Composite Block
DCTBDeformable Convolution-Transformer Block
FDFMFrequency–Domain Feature Fusion Module
PTCHMParallel Transformer-CNN Hybrid Module
ADSBAdaptive Dual-path Self-Attention Block

References

  1. Zhou, J.C.; Zhang, D.H.; Zhang, W.S. Classical and state-of-the-art approaches for underwater image defogging: A comprehensive survey. Front. Inf. Technol. Electron. Eng. 2020, 21, 1745–1769. [Google Scholar] [CrossRef]
  2. Liu, J.; Li, S.; Zhou, C.; Cao, X.; Gao, Y.; Wang, B. SRAF-Net: A scene-relevant anchor-free object detection network in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5405914. [Google Scholar] [CrossRef]
  3. Huang, Z.; Wan, L.; Sheng, M.; Zou, J.; Song, J. An underwater image enhancement method for simultaneous localization and mapping of autonomous underwater vehicle. In Proceedings of the 3rd International Conference on Robotics, Automation and Sciences (ICORAS), Wuhan, China, 1–3 June 2019; pp. 137–142. [Google Scholar]
  4. Marengo, M.; Durieux, E.D.H.; Marchand, B.; Francour, P. A review of biology, fisheries and population structure of Dentex dentex (Sparidae). Rev. Fish Biol. Fish. 2014, 24, 1065–1088. [Google Scholar] [CrossRef]
  5. Li, C.Y.; Guo, J.C.; Cong, R.M.; Pang, Y.W.; Wang, B. Underwater image enhancement by dehazing with minimum information loss and histogram distribution prior. IEEE Trans. Image Process. 2016, 25, 5664–5677. [Google Scholar] [CrossRef] [PubMed]
  6. Fazal, S.; Khan, D. Underwater image enhancement using bi-histogram equalization with fuzzy plateau limit. In Proceedings of the 2021 7th International Conference on Signal Processing and Communication (ICSC), Noida, India, 25–27 November 2021; pp. 261–266. [Google Scholar]
  7. Ding, X.; Wang, Y.; Zhang, J.; Fu, X. Underwater image dehaze using scene depth estimation with adaptive color correction. In Proceedings of the OCEANS, Aberdeen, UK, 19–22 June 2017; pp. 1–5. [Google Scholar]
  8. Chen, L.; Jiang, Z.; Tong, L.; Liu, Z.; Zhao, A.; Zhang, Q.; Dong, J.; Zhou, H. Perceptual underwater image enhancement with deep learning and physical priors. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 3078–3092. [Google Scholar] [CrossRef]
  9. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
  10. Hitam, M.S.; Awalludin, E.A.; Yussof, W.N.J.H.W.; Bachok, Z. Mixture contrast limited adaptive histogram equalization for underwater image enhancement. In Proceedings of the International Conference on Computer Applications Technology (ICCAT), Sousse, Tunisia, 20–22 January 2013; pp. 1–5. [Google Scholar]
  11. Zhang, W.; Jin, S.; Zhuang, P.; Liang, Z.; Li, C. Underwater image enhancement via piecewise color correction and dual prior optimized contrast enhancement. IEEE Signal Process. Lett. 2023, 30, 229–233. [Google Scholar] [CrossRef]
  12. Ancuti, C.; Ancuti, C.O.; Haber, T.; Bekaert, P. Enhancing underwater images and videos by fusion. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 81–88. [Google Scholar]
  13. Zhang, S.; Wang, T.; Dong, J.; Yu, H. Underwater image enhancement via extended multi-scale Retinex. Neurocomputing 2017, 245, 1–9. [Google Scholar] [CrossRef]
  14. Drews, J.; Nascimento, E.; Moraes, F.; Botelho, S.; Campos, M. Transmission estimation in underwater single images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, Sydney, NSW, Australia, 2–8 December 2013; pp. 825–830. [Google Scholar]
  15. Galdran, A.; Pardo, D.; Picon, A.; Alvarez-Gila, A. Automatic red-channel underwater image restoration. J. Vis. Commun. Image Represent. 2015, 26, 132–145. [Google Scholar] [CrossRef]
  16. Peng, Y.T.; Cosman, P.C. Underwater image restoration based on image blurriness and light absorption. IEEE Trans. Image Process. 2017, 26, 1579–1594. [Google Scholar] [CrossRef]
  17. Li, C.; Anwar, S.; Porikli, F. Underwater scene prior inspired deep underwater image and video enhancement. Pattern Recognit. 2020, 98, 107038. [Google Scholar] [CrossRef]
  18. Naik, A.; Swarnakar, A.; Mittal, K. Shallow-uwnet: Compressed model for underwater image enhancement (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 15853–15854. [Google Scholar]
  19. Sharma, P.; Bisht, I.; Sur, A. Wavelength-based attributed deep neural network for underwater image restoration. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–23. [Google Scholar] [CrossRef]
  20. Qi, Q.; Zhang, Y.; Tian, F.; Wu, Q.J.; Li, K.; Luan, X.; Song, D. Underwater image co-enhancement with correlation feature matching and joint learning. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1133–1147. [Google Scholar] [CrossRef]
  21. Yang, M.; Hu, K.; Du, Y.; Wei, Z.; Sheng, Z.; Hu, J. Underwater image enhancement based on conditional generative adversarial network. Signal Process. Image Commun. 2020, 81, 115723. [Google Scholar] [CrossRef]
  22. Islam, M.J.; Xia, Y.; Sattar, J. Fast underwater image enhancement for improved visual perception. IEEE Robot. Autom. Lett. 2020, 5, 3227–3234. [Google Scholar] [CrossRef]
  23. Li, C.; Guo, J. Underwater image enhancement by dehazing and color correction. J. Electron. Imaging 2015, 24, 033023. [Google Scholar] [CrossRef]
  24. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
  25. Qu, J.; Cao, X.; Jiang, S.; You, J.; Yu, Z. UIEFormer: Lightweight vision transformer for underwater image enhancement. IEEE J. Ocean. Eng. 2025, 50, 851–865. [Google Scholar] [CrossRef]
  26. Ren, T.; Xu, H.; Jiang, G.; Yu, M.; Zhang, X.; Wang, B.; Luo, T. Reinforced Swin-Convs transformer for simultaneous underwater sensing scene image enhancement and super-resolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4209616. [Google Scholar] [CrossRef]
  27. Liang, Y.; Li, L.; Zhou, Z.; Tian, L.; Xiao, X.; Zhang, H. Underwater image enhancement via adaptive bi-level color-based adjustment. IEEE Trans. Instrum. Meas. 2025, 74, 5018916. [Google Scholar] [CrossRef]
  28. Johnson, J.; Alahi, A.; Li, F.-F. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 694–711. [Google Scholar]
  29. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  30. Li, C.; Guo, C.; Ren, W.; Cong, R.; Hou, J.; Kwong, S.; Tao, D. An underwater image enhancement benchmark dataset and beyond. IEEE Trans. Image Process. 2019, 29, 4376–4389. [Google Scholar] [CrossRef] [PubMed]
  31. Islam, M.J.; Luo, P.; Sattar, J. Simultaneous enhancement and super-resolution of underwater imagery for improved visual perception. In Proceedings of the 16th Robotics: Science and Systems (RSS), Online, 12–16 July 2020. [Google Scholar]
  32. Peng, L.; Zhu, C.; Bian, L. U-shape transformer for underwater image enhancement. IEEE Trans. Image Process. 2023, 32, 3066–3079. [Google Scholar] [CrossRef]
  33. Korhonen, J.; You, J. Peak signal-to-noise ratio revisited: Is simple beautiful? In Proceedings of the 2012 Fourth International Workshop on Quality of Multimedia Experience, Melbourne, Australia, 5–7 July 2012; pp. 37–38. [Google Scholar]
  34. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  35. Panetta, K.; Gao, C.; Agaian, S. Human-visual-system-inspired underwater image quality measures. IEEE J. Ocean. Eng. 2015, 41, 541–551. [Google Scholar] [CrossRef]
  36. Yang, M.; Sowmya, A. An underwater color image quality evaluation metric. IEEE Trans. Image Process. 2015, 24, 6062–6071. [Google Scholar] [CrossRef] [PubMed]
  37. Wang, Y.; Li, N.; Li, Z.; Gu, Z.; Zheng, H.; Zheng, B.; Sun, M. An imaging-inspired no-reference underwater color image quality assessment metric. Comput. Electr. Eng. 2018, 70, 904–913. [Google Scholar] [CrossRef]
  38. He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2341–2353. [Google Scholar]
  39. Zhang, S.; Zhao, S.; An, D.; Li, D.; Zhao, R. Liteenhancenet: A lightweight network for real-time single underwater image enhancement. Expert Syst. Appl. 2024, 240, 122546. [Google Scholar] [CrossRef]
  40. Liu, Y.; Yao, F.; Wang, P.; Huang, F. DCSS-Net: Depth and color space synergy network for underwater image enhancement. Digit. Signal Process. 2026, 170, 105760. [Google Scholar] [CrossRef]
  41. Ma, H.; Zhou, J.; Kong, L.; Zhang, D.; Chen, G.; Jiang, Q. HisMamba: Positive noise guided structural–color collaborative modeling for underwater image enhancement. Neurocomputing 2026, 666, 132298. [Google Scholar] [CrossRef]
  42. Justus, D.; Brennan, J.; Bonner, S.; McGough, A.S. Predicting the computational cost of deep learning models. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 3873–3882. [Google Scholar]
  43. Islam, M.J.; Edge, C.; Xiao, Y.; Luo, P.; Mehtaz, M.; Morse, C. Semantic segmentation of underwater imagery: Dataset and benchmark. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 1769–1776. [Google Scholar]
  44. Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16901–16911. [Google Scholar]
Figure 1. Compact overall architecture flowchart of UCA-Net.
Figure 1. Compact overall architecture flowchart of UCA-Net.
Electronics 15 00318 g001
Figure 2. Architecture of UCA-Net. (a) Overall structure, (b) PTCHM, (c) DSCRC, (d) DCTB.
Figure 2. Architecture of UCA-Net. (a) Overall structure, (b) PTCHM, (c) DSCRC, (d) DCTB.
Electronics 15 00318 g002
Figure 3. Schematic diagram of the Composite Attention Module (CAM).
Figure 3. Schematic diagram of the Composite Attention Module (CAM).
Electronics 15 00318 g003
Figure 4. (a) Schematic diagram of the deformable convolution principle. (b) Structure diagram of deformable convolution.
Figure 4. (a) Schematic diagram of the deformable convolution principle. (b) Structure diagram of deformable convolution.
Electronics 15 00318 g004
Figure 5. Schematic diagram of the C (ADSB).
Figure 5. Schematic diagram of the C (ADSB).
Electronics 15 00318 g005
Figure 6. Schematic diagram of the Frequency–Domain Feature Fusion Module (FDFM).
Figure 6. Schematic diagram of the Frequency–Domain Feature Fusion Module (FDFM).
Electronics 15 00318 g006
Figure 7. Visual comparison results of different methods on the EUVP-T100 test set. (a) Input images. (b) Reference images, (c) DCP, (d) UWNet, (e) FUnIEGAN, (f) DeepWaveNet, (g) LitenhencedNet, (h) DCSS-Net, (i) HisMamba. (j) Ours.
Figure 7. Visual comparison results of different methods on the EUVP-T100 test set. (a) Input images. (b) Reference images, (c) DCP, (d) UWNet, (e) FUnIEGAN, (f) DeepWaveNet, (g) LitenhencedNet, (h) DCSS-Net, (i) HisMamba. (j) Ours.
Electronics 15 00318 g007
Figure 8. Visual comparison results of different methods on the EUVP-R60 test set. (a) Input images, (b) DCP, (c) UWNet, (d) FUnIEGAN, (e) DeepWaveNet, (f) LitenhencedNet, (g) DCSS-Net, (h) HisMamba. (i) Ours.
Figure 8. Visual comparison results of different methods on the EUVP-R60 test set. (a) Input images, (b) DCP, (c) UWNet, (d) FUnIEGAN, (e) DeepWaveNet, (f) LitenhencedNet, (g) DCSS-Net, (h) HisMamba. (i) Ours.
Electronics 15 00318 g008
Figure 9. Visual comparison results of different methods on the UIEB-T90 test set. (a) Input images. (b) Reference images, (c) DCP, (d) UWNet, (e) FUnIEGAN, (f) DeepWaveNet, (g) LitenhencedNet, (h) DCSS-Net, (i) HisMamba. (j) Ours.
Figure 9. Visual comparison results of different methods on the UIEB-T90 test set. (a) Input images. (b) Reference images, (c) DCP, (d) UWNet, (e) FUnIEGAN, (f) DeepWaveNet, (g) LitenhencedNet, (h) DCSS-Net, (i) HisMamba. (j) Ours.
Electronics 15 00318 g009
Figure 10. Visual comparison results of different methods on the UIEB-R60 test set. (a) Input images, (b) DCP, (c) UWNet, (d) FUnIEGAN, (e) DeepWaveNet, (f) LitenhencedNet, (g) DCSS-Net, (h) HisMamba. (i) Ours.
Figure 10. Visual comparison results of different methods on the UIEB-R60 test set. (a) Input images, (b) DCP, (c) UWNet, (d) FUnIEGAN, (e) DeepWaveNet, (f) LitenhencedNet, (g) DCSS-Net, (h) HisMamba. (i) Ours.
Electronics 15 00318 g010
Figure 11. Visual comparison results of different methods on the UFO-T120 test set. (a) Input images. (b) Reference images, (c) DCP, (d) UWNet, (e) FUnIEGAN, (f) DeepWaveNet, (g) LitenhencedNet, (h) DCSS-Net, (i) HisMamba. (j) Ours.
Figure 11. Visual comparison results of different methods on the UFO-T120 test set. (a) Input images. (b) Reference images, (c) DCP, (d) UWNet, (e) FUnIEGAN, (f) DeepWaveNet, (g) LitenhencedNet, (h) DCSS-Net, (i) HisMamba. (j) Ours.
Electronics 15 00318 g011
Figure 12. Visual comparison results of different methods on the LSUI-T100 test set. (a) Input images. (b) Reference images, (c) DCP, (d) UWNet, (e) FUnIEGAN, (f) DeepWaveNet, (g) LitenhencedNet, (h) DCSS-Net, (i) HisMamba. (j) Ours.
Figure 12. Visual comparison results of different methods on the LSUI-T100 test set. (a) Input images. (b) Reference images, (c) DCP, (d) UWNet, (e) FUnIEGAN, (f) DeepWaveNet, (g) LitenhencedNet, (h) DCSS-Net, (i) HisMamba. (j) Ours.
Electronics 15 00318 g012
Figure 13. Visual comparison results of different methods on the U45 test set. (a) Input images, (b) DCP, (c) UWNet, (d) FUnIEGAN, (e) DeepWaveNet, (f) LitenhencedNet, (g) DCSS-Net, (h) HisMamba. (i) Ours.
Figure 13. Visual comparison results of different methods on the U45 test set. (a) Input images, (b) DCP, (c) UWNet, (d) FUnIEGAN, (e) DeepWaveNet, (f) LitenhencedNet, (g) DCSS-Net, (h) HisMamba. (i) Ours.
Electronics 15 00318 g013
Figure 14. Visual comparison results of different methods on different blue and green partial pictures of the UIEB test set. (a) Input images. (b)reference images, (c) DCP, (d) UWNet, (e) FUnIEGAN, (f) DeepWaveNet, (g) LitenhencedNet, (h) DCSS-Net, (i) HisMamba. (j) Ours.
Figure 14. Visual comparison results of different methods on different blue and green partial pictures of the UIEB test set. (a) Input images. (b)reference images, (c) DCP, (d) UWNet, (e) FUnIEGAN, (f) DeepWaveNet, (g) LitenhencedNet, (h) DCSS-Net, (i) HisMamba. (j) Ours.
Electronics 15 00318 g014
Figure 15. Visual comparison of the color histograms of the enhanced images using different comparison methods. Here, the red, green and blue curves correspond to the histograms of the respective colors. (a) input images, (b) DCP, (c) UWNet, (d) FUnIEGAN, (e) DeepWaveNet, (f) LitenhencedNet, (g) DCSS-Net, (h) HisMamba, (i) ours.
Figure 15. Visual comparison of the color histograms of the enhanced images using different comparison methods. Here, the red, green and blue curves correspond to the histograms of the respective colors. (a) input images, (b) DCP, (c) UWNet, (d) FUnIEGAN, (e) DeepWaveNet, (f) LitenhencedNet, (g) DCSS-Net, (h) HisMamba, (i) ours.
Electronics 15 00318 g015
Figure 16. The detailed magnification and visual comparison of the enhanced image using different comparison methods. (a) DCP, (b) UWNet, (c) FUnIEGAN, (d) DeepWaveNet, (e) LitenhencedNet, (f) DCSS-Net, (g) HisMamba, (h) ours.
Figure 16. The detailed magnification and visual comparison of the enhanced image using different comparison methods. (a) DCP, (b) UWNet, (c) FUnIEGAN, (d) DeepWaveNet, (e) LitenhencedNet, (f) DCSS-Net, (g) HisMamba, (h) ours.
Electronics 15 00318 g016
Figure 17. Detection of underwater targets through different methods of YOLOv5, (a) DCP, (b) UWNet, (c) FUnIEGAN, (d) LitenhencedNet, (e) DeepWaveNet, (f) ours.
Figure 17. Detection of underwater targets through different methods of YOLOv5, (a) DCP, (b) UWNet, (c) FUnIEGAN, (d) LitenhencedNet, (e) DeepWaveNet, (f) ours.
Electronics 15 00318 g017
Figure 18. The segmentation tasks of the images enhanced by different methods were compared using Deeplabv3. (a) DCP, (b) UWNet, (c) FUnIEGAN, (d) LitenhencedNet, (e) DeepWaveNet, (f) ours.
Figure 18. The segmentation tasks of the images enhanced by different methods were compared using Deeplabv3. (a) DCP, (b) UWNet, (c) FUnIEGAN, (d) LitenhencedNet, (e) DeepWaveNet, (f) ours.
Electronics 15 00318 g018
Table 1. Quantitative comparisons of all reference indicators were conducted on the four test sets of EUVP-T100, UIEB-T90, UFO-T100, and LSUI-T100. The red and blue numbers, respectively, represent the best and sub-best results.
Table 1. Quantitative comparisons of all reference indicators were conducted on the four test sets of EUVP-T100, UIEB-T90, UFO-T100, and LSUI-T100. The red and blue numbers, respectively, represent the best and sub-best results.
DatasetMetricDCPUWNetFUnIEGANDeepWaveNetLitenhencedNetDCSS-NetHisMambaOurs
EUVP-T100PSNR13.8525.8225.9625.4220.6927.0326.9127.41
SSIM0.480.740.780.810.740.810.820.86
UIEB-T90PSNR14.6918.0319.5922.8923.0323.8324.5124.75
SSIM0.670.710.720.820.870.860.910.89
UFO-T120PSNR14.8225.2125.8320.7920.2425.6625.7326.86
SSIM0.580.750.770.740.730.800.820.85
LSUI-T100PSNR15.6024.7625.8926.7521.6926.4726.3927.26
SSIM0.640.820.840.880.830.830.820.87
Table 2. All reference indicators were quantitatively compared on three sets of test devices, namely EUVP-R50, UIEB-R60, and U45. The red and blue numbers, respectively, represent the best and sub-best results, while the green one is the third result.
Table 2. All reference indicators were quantitatively compared on three sets of test devices, namely EUVP-R50, UIEB-R60, and U45. The red and blue numbers, respectively, represent the best and sub-best results, while the green one is the third result.
DatasetMetricDCPUWNetFUnIEGANDeepWaveNetLitenhencedNetDCSS-NetHisMambaOurs
EUVP-R50UIQM1.543.032.923.123.033.013.072.98
UCIQE0.5710.5850.5940.5920.6190.6200.6120.606
CCF0.0880.9890.8390.9970.8240.8230.9830.895
UIEB-R60UIQM1.402.633.022.782.983.023.093.12
UCIQE0.5730.5680.5890.6080.6030.6130.6010.615
CCF0.2061.0871.0810.8560.8590.8770.9130.929
U45UIQM1.562.793.052.893.313.183.293.19
UCIQE0.5390.5280.5590.5330.5830.6220.6320.604
CCF0.0661.0150.8920.9510.8230.8790.9890.880
Table 3. The quantitative results of the network structure ablation study based on the average PSNR and SSIM values of the UIEB dataset.
Table 3. The quantitative results of the network structure ablation study based on the average PSNR and SSIM values of the UIEB dataset.
SettingDetailUIEB-T90
PSNR/SSIM
Full model-24.75/0.89
No. 1w/o DCRAC23.62/0.78
No. 2w/o DCTB23.57/0.82
No. 3w/o ADSB23.63/0.90
No. 4w/o CAM24.53/0.81
No. 5w/o DC24.60/0.85
No. 6w/o FDFM24.32/0.86
Table 4. The quantitative results of the loss function ablation study based on the average PSNR and SSIM values of the UIEB dataset.
Table 4. The quantitative results of the loss function ablation study based on the average PSNR and SSIM values of the UIEB dataset.
LossUIEB-T90
PSNR/SSIM
L1 Loss22.63/0.81
SSIM Loss22.82/0.80
Perceptual Loss23.15/0.83
L1 + SSIM23.48/0.86
L1 + Perceptual24.61/0.79
SSIM + Perceptual24.55/0.83
L1 + SSIM + Perceptual24.42/0.84
0.5L1 + 0.3SSIM + Perceptual24.51/0.87
0.3L1 + 0.1SSIM + Perceptual24.58/0.89
0.2L1 + 0.2SSIM + Perceptual24.75/0.89
Table 5. Quantitative comparison of the detection task. The best results are indicated in red. The second-best results are indicated in blue.
Table 5. Quantitative comparison of the detection task. The best results are indicated in red. The second-best results are indicated in blue.
MethodPrecision(%)mAP@50–95(%)
DCP79.344.2
UWNet82.943.5
FUnIEGAN76.741.6
LitenhencedNet t83.544.3
DeepWaveNet85.445.1
UCA-Net88.345.5
Table 6. Comparison of the complexity of different methods.
Table 6. Comparison of the complexity of different methods.
Method#Params(M)#MACS(G)
UWNet0.2221.71
FUnIEGAN7.0210.76
LitenhencedNet0.690.013
LA-Net5.15356.03
DeepWaveNet0.2718.18
U-shape Transformer31.59310.21
UCA-Net1.4419.26
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, C.; Zhou, J.; Wang, L.; Liu, G.; Ding, Z. UCA-Net: A Transformer-Based U-Shaped Underwater Enhancement Network with a Compound Attention Mechanism. Electronics 2026, 15, 318. https://doi.org/10.3390/electronics15020318

AMA Style

Yu C, Zhou J, Wang L, Liu G, Ding Z. UCA-Net: A Transformer-Based U-Shaped Underwater Enhancement Network with a Compound Attention Mechanism. Electronics. 2026; 15(2):318. https://doi.org/10.3390/electronics15020318

Chicago/Turabian Style

Yu, Cheng, Jian Zhou, Lin Wang, Guizhen Liu, and Zhongjun Ding. 2026. "UCA-Net: A Transformer-Based U-Shaped Underwater Enhancement Network with a Compound Attention Mechanism" Electronics 15, no. 2: 318. https://doi.org/10.3390/electronics15020318

APA Style

Yu, C., Zhou, J., Wang, L., Liu, G., & Ding, Z. (2026). UCA-Net: A Transformer-Based U-Shaped Underwater Enhancement Network with a Compound Attention Mechanism. Electronics, 15(2), 318. https://doi.org/10.3390/electronics15020318

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop