1. Introduction
Underwater image enhancement (UIE) is a key technical challenge in the field of computer vision. Its core objective is to effectively alleviate the degradation effects caused by the absorption, scattering, and backscattering of light during its propagation underwater through algorithmic means, thereby accurately restoring the visual information of underwater scenes. In recent years, with the rapid development of remotely operated vehicles (ROVs) [
1], autonomous underwater vehicles (AUVs) [
2], and marine exploration systems, high-quality underwater images have become a necessary condition in applications such as marine resource monitoring, underwater target detection, and disaster response. For example, in marine ecological research, clear underwater images are crucial for assessing the health of coral reefs and analyzing the behavior of fish; underwater archaeology relies on high-resolution images to identify details of sunken ships and cultural relics; and in the field of military reconnaissance, improving the ability to detect targets in turbid waters (such as submarines or mines) is particularly important. However, due to the unique optical characteristics of the underwater environment, images usually exhibit problems such as insufficient contrast, color distortion, and loss of details [
3], which pose significant difficulties for traditional image enhancement techniques. Therefore, there is an urgent need to develop innovative methods to address this challenge [
4,
5].
Underwater imaging technology has attracted sustained academic and industrial attention. Current traditional methodologies are divided into two distinct approaches: physics-based degradation modeling and non-physical image enhancement techniques [
6]. The former employs mathematical formulations to simulate light propagation phenomena, including wavelength-dependent absorption, multipath scattering, and refractive distortions, demonstrating efficacy in controlled scenarios with homogeneous media. Or, the degradation of underwater images can be corrected through traditional digital image processing methods such as fusion [
7]. However, its process performance deteriorates markedly in natural marine environments due to dynamic variations and heterogeneous optical properties. In contrast, data-driven enhancement methods bypass physical modeling by directly manipulating image attributes through contrast stretching, color constancy algorithms, and dehazing operations. While these techniques can improve visual perception under specific conditions, they intrinsically suffer from the inability to distinguish between inherent object features and medium-induced distortions due to insufficient incorporation of hydro-optical priors. These methodologies reveal intrinsic constraints in underwater image enhancement: traditional approaches struggle with parameterizing environmental variability, while data-driven enhancement techniques lack principled physical constraints.
The latest advances in deep learning have driven significant progress in underwater image restoration and other downsteam tasks [
8,
9,
10], effectively addressing challenges [
11]. Researchers have developed innovative neural architectures that can learn the mapping between degraded images and clear images through large-scale data training [
10]. Representative models such as and UWCNN [
12] adopt gated fusion networks with multi-scale feature extraction, in which local detail preservation through spatial decomposition works in synergy with global scene understanding achieved through context information fusion. These architectures have shown particular effects on color restoration, contrast enhancement, and noise reduction. Furthermore, some researchers have applied reinforcement learning to underwater image enhancement and constructed a vision perception-driven framework [
13,
14], bringing new breakthroughs to underwater image enhancement research. Methods based on diffusion models [
15,
16] have shown promising applications prospects in the field of underwater image enhancement. In addition, apart from diffusion models in the image domain, diffusion models based on the latent space have also effectively improved the effect of underwater image reconstruction. Among these methods, for example, the frequency domain-based latent diffusion model uses a lightweight parameter estimation network and extracts high-frequency and low-frequency prior information [
17]. There are also methods that effectively enhance underwater images with global feature priors to improve the stability of the model during the inference process [
18]. These methods effectively solve the common problems of mode collapse and training difficulty in Generative Adversarial Network (GAN) methods. In complex scenarios, such as poor lighting conditions and dense suspended particles, these methods demonstrate obvious advantages.
Although certain progress has been made in the above-mentioned technologies, due to the inherent variability of the underwater environment, excellent neural network algorithms usually need to take into account both the global and local information of the image simultaneously. The methods based on Transformer show better performance when establishing the precise correspondence between degraded images and clear images, mainly due to the strong ability of its self-attention mechanism in capturing long-range dependency relationships. However, the self-attention operation leads to the problem of quadratic computational complexity, and the quadratic growth of this complexity limits the scalability of the Transformer-based methods, especially when dealing with high-resolution underwater images.
To address this challenge, we propose UWMambaNet, a novel architecture that distinctly differs from existing Mamba-based approaches by integrating two key innovations. First, it proposes a W-shaped Mamba module, an efficient state-space sequence model (SSM) with linear time complexity, in place of Transformers, enabling scalable modeling of global dependencies. More importantly, while many existing Mamba models typically focus on singular-pathway designs, our core contribution lies in its novel dual-branch structure. This architecture intentionally decouples the optimization objectives: one branch focuses on enhancing fine-grained structural details, while the other specializes in preserving perceptual color fidelity. This careful separation focuses on and combines two aspects of image quality enhancement that are complementary but often get mixed up. The contributions of this article can be summarized as follows:
This paper proposes a novel dual-branch underwater image ehnahcement framework UWMambaNet: This framework introduces a novel dual-branch architecture specifically designed for comprehensive image enhancement. This framework effectively combines a Color Contrast Enhancement Branch and a Detail Enhancement Branch, allowing for simultaneous improvement of color fidelity and detail preservation. The integration of these branches through a sophisticated fusion strategy ensures that the network can produce high-quality enhanced images with balanced color and detail.
A color enhancement module with regional variability based on the Mamba is proposed: Within the Color Contrast Enhancement Branch, the network leverages both RGB and Lab color spaces to enhance color fidelity and contrast. The use of a Mamba block for feature fusion is particularly innovative, as it captures long-range dependencies and spatial relationships, enabling more effective color enhancement. This module’s ability to process and integrate features from multiple color spaces is a significant advancement in image processing.
This paper proposes a Mamba-based multi-scale detail enhancement branch: By integrating the Mamba block, it can effectively perform sequence modeling of spatial features and enhance the ability of the network to retain and enhance fine details throughout the image. The design of this module ensures that the enhanced image highlights important details while maintaining its natural appearance.
3. Method
From the perspective of causal inference, image degradation can be explained as being caused by two potential independent mechanisms: structural noise (such as blurring or resolution loss) and perceptual distortion (such as color shift or contrast loss). By decomposing the enhancement task into two independent branches, these factors can be modeled conditionally and given their respective inductive biases, thereby facilitating decoupling and improving generalization ability. This separation enables each branch to develop specialized reasoning and recovery mechanisms, thereby enhancing the interpretability and robustness of the model. The above analysis indicates that by decomposing the causal mechanism of image degradation, it is possible to design enhancement models more effectively. The framework proposed in this paper is shown in
Figure 1. This framework is a simple two-branch structure that processes the color and details of the image separately. During processing, a combination of CNN and Mamba is used to capture global and local effectiveness, helping the network to further restore the degraded image. Based on this theoretical framework, the design of the network structure and how it realizes the independent modeling and optimization of the two degradation mechanisms will be described in detail.
3.1. Mamba Block
Given a degraded underwater image
, we formulate the enhancement task as learning a non-linear mapping
using state-space modeling. The pixel sequence is obtained via raster scanning:
The degradation process is modeled as a linear time-invariant (LTI) system:
where
is the latent state,
governs the state transitions, and
,
,
are projection matrices.
The image is essentially a discrete signal, and the continuous-time system needs to be approximated by the discrete-time method. In this process, the zero-order hold (ZOH) technique is a common implementation method. The zero-order hold method assumes that within each sampling interval determined by the step size
, the input signal remains unchanged. To understand its principle in more depth, we can explore how the zero-order hold works in coordination with the step size
. This conversion process involves sampling the continuous signal at fixed intervals (also called the sampling period) defined by
. Within each sampling interval, the zero-order hold keeps the signal value constant, thereby generating a stepped approximation of the original continuous signal. This approximation method is particularly important when processing underwater images because it can simplify complex signal processing tasks while retaining key information. Therefore, we can conclude that the discretized form of this system is:
where the discretized parameters are computed via:
To handle color distortion, we employ independent SSMs for each RGB channel:
This architecture enables globally coherent image inpainting by leveraging its state-space model (SSM) to hierarchically integrate multi-scale visual semantics. At its core, a discretized recursive mechanism dynamically adjusts temporal decay rates through input-dependent parameterization, allowing selective retention of long-range dependencies as image patches are processed sequentially. This adaptive propagation of hidden states inherently captures global structural patterns while suppressing irrelevant local artifacts. In addition, the framework synergises these globally optimized representations with high-frequency local features via residual connections, achieving structural consistency in corrupted regions while preserving photorealistic texture details. Such a unified design ensures that the reconstructed pixels adhere to both the holistic scene context and localized visual continuity, thereby balancing semantic fidelity with perceptual naturalness across diverse inpainting scenarios.
3.2. Detail Enhancement Branch
The detail enhancement branch is dedicated to reconstructing and refining structural information that has been lost or degraded, such as texture, edges, and high-frequency detail. To achieve this, the branch combines local multi-scale convolutional encoding with long-range sequence modeling. Initially, the input image is processed through convolutional layers to extract base features:
In parallel, to preserve locality, the branch employs two parallel paths with distinct receptive fields. The first path uses 3 × 3 convolutions, while the second path uses 5 × 5 convolutions, allowing the network to capture both fine and contextual details:
To expand the receptive field and incorporate global context, we reshape the feature map into a sequence and apply a Mamba block:
The Mamba mechanism serves as a causal convolutional kernel with dynamic memory, enabling it to model long-distance relationships in spatial structure. The Mamba block is again utilized to process these multi-scale features, capturing long-range dependencies crucial for detail enhancement. The features are concatenated and passed through the Mamba block:
The enhanced features are then fused through a 1 × 1 convolution, followed by a 3 × 3 convolution to produce the final detail-enhanced output.
These extract the mid- and low-frequency components, respectively. The three features are fused:
where
denotes a fusion function involving linear projections and non-linearities.
3.3. Color and Contrast Restoration Branch
The Color and Contrast Enhancement Branch is a critical component of our dual-branch network, designed to improve the color fidelity and contrast of input images. To align with human perceptual characteristics, this branch utilizes both RGB and Lab color spaces. The Lab space separates luminance (L) from the chromatic components (a and b), which allows the model to better isolate and manipulate the perceptual color attributes. Using separate encoders, features are extracted from both spaces:
These feature maps are concatenated and compressed:
and passed through a Mamba block to model contextual contrast and global color dependencies:
The output is decoded into a restored color image:
3.4. Output Fusion and Residual Reconstruction
We integrate the outputs of the two branches to create a unified feature representation:
This fused output contains both perceptual and structural corrections. To ensure that the values remain within a valid range, a clamp function is applied. The final enhanced image is then obtained via a residual connection, with the clamp function ensuring that all pixel values are constrained appropriately. The final enhanced image is obtained via a residual connection:
3.5. Loss Function
To supervise training, we use a composite loss that encourages both pixel-wise fidelity and perceptual realism. The total loss is defined as:
The total loss function consists of three parts, each corresponding to different optimization goals as follows: The first part measures the pixel-level difference between the generated image and the target image through the Mean Squared Error (MSE), ensuring that the generated image is as close as possible to the real image in details, thereby retaining the structural information and clarity of the original image. The parameter controls the weight of this loss term; The second part is based on the high-level features extracted by the pre-trained VGG network, which is used to capture the semantic information of the image, ensuring that the generated image is not only close to the target image at the pixel level but also maintains consistency in high-level visual features (such as texture, color distribution, etc.). This feature is particularly important for cases where color distortion and contrast reduction are caused by the underwater environment. The parameter determines the importance of semantic feature matching; The third part is the pixel-level semantic loss (), which further strengthens the optimization at the pixel level. Different from , it may combine additional constraints (such as edge information or weight adjustment of specific regions) to better adapt to the characteristics of underwater images. The parameter can flexibly control the influence of this loss term on the final result.
The UWMambaNet integrates two complementary mechanisms to address the dual nature of image degradation. On one hand, the detail enhancement branch emphasizes high-frequency recovery through convolution and global dependency modeling via Mamba, enabling the reconstruction of sharpness and edge continuity. On the other hand, the color and contrast restoration branch leverages color space disentanglement and contextual encoding to improve visual aesthetics and color consistency. Together, these branches cooperate through feature fusion and residual reconstruction to yield images that are both structurally accurate and perceptually pleasing.
4. Experimental Analysis and Discussion
In order to evaluate the performance of the proposed model fairly, experiments in this paper are conducted on two public underwater image enhancement datasets. The following subsections will introduce the experimental settings, dataset introduction, evaluation metrics, and analysis of experimental results, respectively.
4.1. Experimental Environment and Setup Details
The UWMambaNet is implemented in PyTorch 1.13 on a system equipped with an NVIDIA V100 GPU boasting 32 GB of RAM and an Intel (R) Xeon (R) W-2255 CPU. We use the Adam Optimizer for training. The initial learning rate is set to 0.01, and the learning rate decays by 30% every 30 epochs. The training uses a batch size of 4 and is carried out for a total of 100 epochs. The input images are uniformly scaled to a resolution of 512 × 512. During the training process, the model checkpoint is saved every 10 epochs.
4.2. Experimental Datasets
In this experiment, we used the UIEB dataset [
36] and EUVP dataset [
37] for training and testing. These two datasets cover various underwater environments, water quality conditions, and lighting conditions, which helps enhance the generalization performance of the model. For testing purposes, we prepared two different datasets: one is a full-reference dataset, and the other is a no-reference dataset. In the training stage, we randomly selected 800 pairs of real-scene images from the UIEB dataset for training. This dataset contains a total of 890 reference images of real scenes. During the training process, all images were adjusted to a resolution of 640 × 640 pixels.
4.3. Evaluation Metrics
The complex optical characteristics of the underwater environment often lead to differences between traditional general image quality assessment indicators (such as PSNR and SSIM) and human subjective perception. To solve this problem, researchers have specifically developed evaluation indicators such as UIQM and UCIQE that are suitable for underwater conditions. These new indicators establish a multi-dimensional assessment framework by integrating parameters that have significant impacts on underwater vision, such as chroma, sharpness, and contrast. Currently, the no-reference assessment method has become the main means of image quality assessment. Among the commonly used no-reference assessment indicators, they include UCIQE [
38], UIQM [
39], CCF [
40], and FDUM [
41]. These indicators calculate various attributes of the image through specific algorithms, such as color, contrast, saturation, and fog density. The UCIQE metric places particular emphasis on evaluating aspects like chroma, saturation, and contrast. In contrast, UIQM is geared towards providing a comprehensive assessment of overall underwater image characteristics. CCF acts as a compound indicator that takes into account elements including contrast, color, and fog density. FDUM gauges the restoration quality of underwater imagery by applying a weighted calculation to parameters such as color richness, contrast, and clarity. Should there be an imbalance between enhancing brightness and introducing color distortion within an image, this could affect the scores given by UCIQE and FDUM, which, in turn, may compromise the accuracy of the ultimate evaluation result.
4.4. Results Analysis
In this section, to assess the effectiveness of the UWMambaNet introduced in this paper, we carried out both quantitative and qualitative experiments. A variety of currently available underwater image enhancement methods were chosen for comparison, such as Fusion [
42], Shallow-UWnet [
28], UColor [
43], UWCNN [
12], and NU
2Net [
44], HCLR-Net [
45], Semi-UIR [
46] and PUIE-Net [
47]. These methods cover widely recognized mainstream techniques for underwater image enhancement, along with algorithms based on advanced neural network architectures.
Our method utilizes a dual-branch approach and a network structure that combines global and local features, generating high UIQM scores for vivid colors and high UCIQE scores for complex details. Some scenarios even exceed the restoration effect of the reference images. When evaluating the datasets, we selected three main problematic underwater image capture scenarios: underwater color cast scenarios, underwater hazy scenarios, and underwater low-light scenarios. The underwater color cast scenarios can be further divided into blue color cast and green color cast. Taking the UIEB dataset [
36] and the EUVP dataset [
37] as the experimental objects, the results are shown in
Table 1,
Table 2 and
Table 3.
In the UIEB dataset’s visualization result
Figure 2, we found that our method was only 0.004 lower than Semi-UIR on the UIQM, and FDUM was only 0.003 lower than Semi-UIR, while leading in other indices. This also indicates the effectiveness of our method based on the dual-branch design. One branch focuses on restoring the true color and contrast of the image. This part can solve underwater degradation problems (such as color cast) in complex scenes. Compared with the general methods, this method can better ensure the effectiveness of color restoration. This way of combining global and local features can pay more attention to the restoration of details on the basis of considering the complete color semantics. In indicators such as UCIQE and CCF, our method is significantly better than other methods, and this phenomenon is also confirmed in the visualization results. In the first row of the visualization results, our proposed method can better restore the illumination information when restoring the details of coral reefs and correct the color cast at the same time, while methods like Fusion have problems such as incomplete color cast restoration or unbalanced illumination restoration. In the fog scene (the second row), our method can restore most of the details and eliminate the influence of fog noise on the overall picture. In the low-light scene in the fourth row, our method can well restore the details of the dark part without introducing other degradations such as overexposure. In addition, in the low-light scenes of the fourth and sixth rows of the visualization results, our method has a better visible range and darker details compared to other methods (for example, the dark part of the reef behind the fish in the fourth row and the seabed environment around the shark in the sixth row). At the same time, it can avoid the problem of overexposure, which also proves that the algorithm we propose can effectively improve the dynamic range of images. In the turbid environment of the fifth row, our method can more thoroughly remove the turbidity and blur caused by scattering compared to other algorithms. The visualization result even exceeds that of the reference picture, which further demonstrates that our algorithm has strong generalization ability and can handle a variety of different degradation scenarios. Furthermore, we conducted tests on recently popular diffusion models and Transformer-based enhancement models. The results are shown in
Figure 3 in detail. It can be seen that our method performs excellently in two mainstream scenarios of turbidity and color cast. There are no problems of incorrect color correction or introduction of other noises. The images restored by our method have colors that are closer to the normal visual results, and the dark details are richer. Moreover, the method proposed in this study demonstrates significant performance advantages in multiple metrics. It achieved the highest score (0.640) in the UCIQE metric, fully reflecting its outstanding performance in image enhancement. Meanwhile, our method also obtained the highest score in the key color fidelity metric CCF, indicating a breakthrough improvement in its color restoration ability. In addition, in the UIQM and FDUM metrics, our method lags behind the first place by only small margins (0.024 and 0.061), further verifying that this method has considerable competitiveness compared with the current state-of-the-art algorithm architectures. To prove the effectiveness of our method, we conducted a statistical significance analysis. By performing a paired
t-test with the Semi-UIR method (the metrics results of Semi-UIR are also superior to those of other methods compared in this paper), it can be concluded that the
p-value < 0.05. This indicates that the method we proposed shows a significant advantage in UCIQE. It also proves the rationality of the improvement of our method in the evaluation metrics.
In the EUVP dataset’s visualization result
Figure 4, our method has also achieved outstanding results. In terms of quantitative indicators, our method outperforms other methods in the three indicators of UCIQE, CCF, and FDUM, and is only 0.024 lower than Semi-UIR in UIQM. When further analyzing the visualization results, it can be found that our method can restore the colors in the image more accurately and significantly improve the clarity of image details. In contrast, some existing methods fail to completely eliminate the influence of the underwater environment on the image during processing. The results in the second row indicate that these methods have deficiencies in eliminating underwater blur or color shift. Our method successfully overcomes this problem, not only being able to effectively restore the true colors of objects, but also highlighting the vividness of the main color, making it closer to the visual effect in the natural state. Whether for the color adjustment of objects in complex scenes or the distinction and processing of background and foreground details, our method shows higher accuracy and reliability. For example, in the image in the third row, our method successfully restores the vivid color of the red coral and retains its texture details. These results indicate that our method is not only competitive in quantitative indicators but also can provide higher-quality image enhancement effects in practical applications.
In addition, to further explore the real-time processing performance and efficiency of the model, we further calculated the parameters and GFLOPS of the model proposed in this paper and other models. The results are shown in
Table 4. In terms of the key indicators of model efficiency, the method proposed in this paper is significantly superior to mainstream deep learning methods. Regarding the number of parameters (Param), although our method (1.01 M) is slightly higher than the lowest Shallow-UWnet (0.22 M), it is much lower than NU
2Net (3.15 M, −68%), UColor (157.24 M), and HCLR-Net (4.87 M, −79%). In terms of computational complexity (GFLOPS), our method (132.13 G) only requires 2.3% of the HCLR-Net method (5651.99 G). At the same time, it is reduced by 68.8% and 70.2% compared with PUIE-Net (423.04 G) and UColor (443.84 G), respectively. Although Shallow-UWnet has the lowest computational amount (43.26 G), its performance is lower than that of our method. The method proposed in this paper achieves a balance between accuracy and efficiency at a relatively low computational cost, and has the best deployment potential while maintaining high performance. For edge devices or low computing power platforms [
48], this algorithm can be better applied.
4.5. Ablation Experiments
In this section, we conduct an ablation experiment to evaluate the module performance of the proposed UWMambaNet framework. This framework consists of two core components, namely, the detail enhancement module and the color correction module, both of which play significant roles in the overall performance. By removing each module sequentially, we analyze its impact on the overall ability and effectiveness of the framework, the visualization result is shown in
Figure 5. The experimental results show that the method proposed in this paper can significantly improve the visualization quality of underwater images. Specifically, under the complete framework, the processed underwater images exhibit excellent performance in color restoration and detail enhancement, which fully validates the effectiveness of this framework in addressing the problems of low contrast and color distortion in underwater images. However, when the color correction module in the framework is removed, the color restoration ability of the images significantly decreases. This phenomenon indicates that the color correction module plays a key role in the framework, and its main function is to correct the color deviation caused by the absorption and scattering effects of water. For example, after removing the color correction module in the second row of images, a significant color degradation problem occurs, manifested as an obvious blue shift phenomenon. The existence of the color correction module helps rebalance the color distribution in the image, thereby improving the consistency between the processed image and the real scene.
On the other hand, when the detail branch is removed, both the image’s detail performance and color quality are compromised. This occurs because the detail branch not only enhances texture and edge information but also plays a role in fine-tuning colors. In essence, there exists a synergistic relationship between the detail branch and the color branch; their collaboration is essential to achieving optimal image enhancement. For instance, in the first row of haze images, omitting the detail branch during processing results in incomplete removal of haze-like degradation and introduces a subtle red color cast, which detracts from the overall quality. This demonstrates the interdependence of the color and detail branches as described in this paper.
The above visualized results are reflected in the quantitative experimental results
Table 5. When the color or detail branches are not used, all the indicators are much lower than the method proposed in this paper, which further confirms the effectiveness of our method. To sum up, the framework proposed in this paper successfully addresses multiple challenges in underwater image visualization through reasonable division of labor and collaboration among modules. The results of the ablation experiments fully demonstrate the importance of each module and its contribution to the overall performance, and also provide a valuable reference direction for future research.