SARM: Scene-Aware Retinex Mamba for Underwater Image Enhancement

Fu, Zhanbo; Yang, Shuang; Sun, Aiguo; Xiong, Rongjun; Chen, Nengcheng

doi:10.3390/rs18101652

Open AccessArticle

SARM: Scene-Aware Retinex Mamba for Underwater Image Enhancement

by

Zhanbo Fu

¹

,

Shuang Yang

²

,

Aiguo Sun

^2,3,

Rongjun Xiong

^2,4 and

Nengcheng Chen

^1,2,*

¹

School of Future Technology, China University of Geosciences, Wuhan 430074, China

²

National Engineering Research Center for Geographic Information System, School of Geography and Information Engineering, China University of Geosciences, Wuhan 430074, China

³

Changjiang Waterway Institute of Planning and Design, Wuhan 430040, China

⁴

Changhang Testing Technology (Wuhan) Company, Wuhan 430040, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1652; https://doi.org/10.3390/rs18101652

Submission received: 13 March 2026 / Revised: 14 May 2026 / Accepted: 18 May 2026 / Published: 20 May 2026

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

SARM achieves the deep integration of Retinex physical priors and state space models (SSMs), providing a new perspective for self-supervised underwater image enhancement.
Scene-aware adaptation and global linear complexity modeling techniques yield significant image quality improvements across multiple underwater visual benchmarks.

What are the implications of the main findings?

The prior-guided mechanism provides an effective paradigm for tackling paired data scarcity and highly heterogeneous degradations in real waters, highlighting the importance of physical laws in deep feature modeling.
The framework proposes an efficient and highly generalizable enhancement strategy that can serve as a visual preprocessing front-end for marine edge devices, enabling stable performance gains in downstream underwater tasks such as feature matching and edge extraction.

Abstract

Underwater image enhancement is essential for marine visual perception tasks. However, the highly heterogeneous optical degradations in real-world waters, the scarcity of paired training data, and the inherent dilemma for existing models in balancing long-range dependency modeling with computational overhead pose significant challenges. To address these issues, this paper proposes a prior-guided, self-supervised underwater image enhancement framework called Scene-Aware Retinex Mamba (SARM). This framework seamlessly integrates Retinex theoretical priors with state space models (SSMs) and operates without paired supervision by employing a prior-guided pseudo-labeling strategy to guide network optimization. Architecturally, SARM deeply couples the physical Retinex prior with SSM. Its core module integrates multi-color space features and leverages a 2D selective scan mechanism to achieve global context modeling with linear complexity

O (H W)

, effectively removing complex color casts and suppressing non-uniform scattering noise. To further overcome the generalization bottlenecks in cross-domain underwater testing, this paper introduces a Scene-Aware Adapter (SAA), which facilitates dynamic loss scheduling and adaptive feature gating by quantifying scene-specific degradation characteristics. Comprehensive evaluations on multiple benchmark datasets, including UIEB, EUVP, and UCCS, demonstrate that SARM achieves state-of-the-art subjective and objective enhancement quality (e.g., yielding a URanker score of 2.491 and a CCF score of 35.76), while maintaining an ultra-fast inference speed of 136.52 FPS on the UIEB dataset. Furthermore, extended experiments reveal that SARM can significantly boost the performance of downstream vision tasks, validating its potential as a robust preprocessing module for various practical marine vision applications.

Keywords:

underwater image enhancement; state space model; Retinex theory; self-supervised learning; scene-aware adaptation; marine visual perception

1. Introduction

As human exploration of the ocean advances, practical applications such as underwater robot navigation, subsea infrastructure inspection, and marine ecological monitoring place stringent demands on high-quality underwater visual perception. However, owing to the complex selective absorption and scattering effects of the water medium, raw underwater images typically suffer from severe low contrast, blurred details, and blue-green color casts. This highly heterogeneous optical degradation severely restricts the accuracy and robustness of downstream high-level vision tasks, such as object detection and image segmentation. Consequently, as a crucial preprocessing frontend for restoring clear geometric structures and natural colors, underwater image enhancement (UIE) technology has garnered extensive attention in the fields of computer vision and marine engineering.

Early UIE research primarily relies on two traditional paradigms: physical inversion and non-physical enhancement. Physical models attempt to inversely deduce the light propagation process; for example, Hsieh and Chang [1] combined depth and backscatter estimation to achieve attenuation restoration, while other methods explored adaptive color adjustment and multi-scale wavelet fusion strategies [2,3]. Such methods possess rigorous optical interpretability, but their generalization ability is often limited in dynamic real-world marine environments due to the failure of prior assumptions. To avoid complex parameter estimation, non-physical models perform corrections directly in the pixel domain. For instance, Liu et al. [4] constructed a Retinex model with dual illumination and structural constraints to suppress noise amplification, and other researchers have successively proposed variational frameworks based on illumination compensation or distribution remapping [5,6]. Nevertheless, when dealing with extreme color casts and strong scattering scenarios, the parameter robustness of traditional non-physical methods remains limited.

With the rise of the data-driven paradigm, Convolutional Neural Networks (CNNs) have significantly improved the reconstruction stability of local high-frequency features; for instance, Kong et al. [7] designed a lightweight frequency-domain network, MUFFNet. However, constrained by fixed local receptive fields, traditional CNNs struggle to capture the long-range contextual information required to correct global color casts. To compensate for the deficiencies of local modeling, Transformer architectures with global perception capabilities have been introduced into this field. Wu et al. [8] constructed a Multi-Scale Conv-Transformer (MSCT) to integrate local convolutions with global self-attention mechanisms, and subsequent studies have further explored its illumination adaptability across multiple color spaces [9].

Furthermore, generative architectures have demonstrated outstanding performance in texture synthesis and severe color distortion correction. Generative Adversarial Networks (GANs) [10,11] have enhanced the detail representation of unknown low-light waters, while diffusion models have been widely applied owing to their powerful distribution fitting capabilities. For example, SeaDiff, designed by Bi et al. [12], deeply embeds physical degradation priors into the diffusion denoising process to improve generation quality, and other variants combining generative priors have also shown certain potential [13]. In recent years, large vision models have increasingly been employed to extract zero-shot semantic features for guiding image enhancement [14]. For instance, the CLIP-UIE framework proposed by Liu et al. [15] leverages the vision-language semantic priors of large models to effectively guide color reconstruction and detail restoration in complex underwater scenes.

Despite the significant progress achieved by existing methods in underwater image enhancement, three critical challenges remain in practical applications. First, acquiring pixel-level aligned real-world underwater paired data is prohibitively expensive, whereas relying on synthetic data often introduces domain shift, thereby limiting the generalization capability of models in complex, real-world aquatic environments. Second, traditional lightweight CNNs struggle to capture the global context required to correct severe regional color casts. Conversely, methods based on Transformers or diffusion models, while possessing global modeling capabilities, suffer from prohibitively high inference latency, making it difficult to meet the real-time requirements of underwater platforms. Finally, the majority of existing mainstream methods are purely data-driven and lack explicit modeling of underwater optical degradation processes (such as non-uniform scattering and selective absorption). Consequently, these models lack adaptive regulation capabilities when confronted with dynamic marine environments. Recently, state space models (e.g., Mamba) [16,17], which feature both a global receptive field and a linear computational complexity, have provided a novel perspective for resolving the conflict between computational efficiency and global perception. However, research on applying the Mamba architecture to the field of underwater image enhancement (UIE) is still in its infancy. Despite recent preliminary attempts such as D2Mamba [18,19], the engineering robustness of this architecture in complex marine environments requires further improvement. On the one hand, most existing methods focus on the sequential modeling of generic features, often failing to effectively integrate the Mamba mechanism with underwater-specific physical degradation priors (e.g., wavelength-dependent attenuation and non-uniform scattering). On the other hand, when confronted with extreme low-light conditions or severe color casts in deep-water regions, purely sequential scanning mechanisms still face the risk of losing local high-frequency details while capturing global context. Therefore, exploring how to enhance the adaptive restoration capability for dynamic and heterogeneous water bodies through physical prior guidance—while maintaining the advantage of linear complexity—represents a highly promising research direction.

To address the aforementioned limitations, this paper proposes a prior-guided self-supervised underwater image enhancement framework, known as Scene-Aware Retinex Mamba (SARM). Under the premise of strictly controlling computational overhead, this framework integrates Retinex physical theory with the state space model, achieving the unification of physical prior constraints, long-range dependency modeling, and dynamic scene adaptation within a single network. The main contributions of this paper are summarized as follows:

A prior-guided self-supervised underwater image enhancement framework, SARM, is proposed. By innovatively embedding the Retinex physical prior into the Mamba architecture, this framework effectively balances the trade-off between long-range dependency modeling and computational overhead without the need for real paired training data. Consequently, it offers an efficient enhancement paradigm for underwater platforms with limited computational resources.
An illumination decoupling mechanism based on multi-color space analysis and adaptive residual modulation is proposed. To address the issue where the direct division operation in traditional Retinex models is prone to causing numerical instability and amplifying high-frequency noise in dark waters, we introduce a residual modulation strategy in the feature domain. By leveraging the complementary features of the RGB, LAB, and HSV color spaces, this mechanism achieves a more stable decoupling of water attenuation from the underlying reflectance, thereby providing a robust illumination prior for the denoising network.
A Retinex-Mamba global denoising architecture with linear computational complexity is constructed. To tackle the prohibitively high inference latency of existing global models, our method injects photometric priors into the state space model. Utilizing the 2D Selective Scan (SS2D) mechanism and a dual-track parallel encoding strategy, the network achieves global context modeling while maintaining an $O (H W)$ computational complexity. This effectively removes underwater color casts and suppresses non-uniform scattering noise.
An unpaired data-driven Scene-Aware Adapter (SAA) and a dual adaptive routing mechanism are designed. To alleviate the challenges of model generalization in highly heterogeneous waters, we utilize high-quality physical pseudo-labels to provide self-supervised constraints during the training phase. Concurrently, by quantitatively assessing the degradation characteristics of the samples (e.g., color cast, low illumination, and blur), we implement dynamic loss scheduling and feature gating, significantly enhancing the model’s scene adaptability in unknown dynamic marine environments.
An effective balance between high inference speed and high-quality visual perception is achieved. Extensive experiments on benchmark datasets including UIEB, EUVP, and UCCS demonstrate that, under the unpaired data setting, SARM achieves superior subjective and objective image quality (e.g., CCF 35.76, URanker 2.491). Furthermore, the model achieves an ultra-fast inference speed of 136.52 FPS and substantially improves the performance of downstream underwater vision tasks, such as feature matching, demonstrating significant potential for real-world deployment.

The remainder of this paper is organized as follows: Section 2 systematically reviews related technologies in the field of underwater image enhancement. Section 3 elaborates on the overall architecture of the proposed SARM framework. Section 4 presents the experiments and result analysis of our method. Finally, Section 5 concludes the paper and discusses future prospects.

2. Related Work

Research on Underwater Image Enhancement (UIE) can be broadly categorized into three major paradigms: physics-based methods, traditional image signal processing-based methods, and deep learning-based end-to-end methods. In recent years, with the widespread application of underwater robots and autonomous underwater vehicles (AUVs), acquiring clear, color-authentic, and detail-rich images in complex and dynamically changing marine environments has emerged as a prominent research topic at the intersection of computer vision and marine engineering. The following sections systematically review and evaluate the relevant advancements along these three main threads.

2.1. Traditional Underwater Image Enhancement Methods

Depending on whether they rely on underwater imaging models, traditional methods are divided into physical and non-physical models. The core difference lies in whether they invert the degradation process through optical laws.

2.1.1. Physical Model-Based Methods

Physics-based methods aim to reconstruct clear images by inversely deducing the propagation laws of light in water, and their primary challenge lies in the accurate estimation of optical parameters. The early Jaffe–McGlamery [20] model incorporated water absorption and forward/backward scattering into the imaging equation, providing a foundational framework for such inversion algorithms. Subsequently, Duntley [21] explicitly introduced wavelength-dependent attenuation coefficients into this model to adapt to different water bodies, such as blue-green or turbid waters; however, this inevitably increased the dimensionality of parameter estimation and computational overhead.

To simplify the implementation process, Drews et al. [22] improved the Dark Channel Prior (DCP) by constraining it to the blue and green channels, proposing the Underwater DCP (UDCP) method. This strategy achieves favorable dehazing effects when processing conventional turbid waters but is prone to generating artifacts such as halos in non-uniform scattering scenarios. Furthermore, the Sea-thru model proposed by Akkaynak and Treibitz [23] utilizes the depth map of the scene to accurately separate backscatter and calculates wavelength-dependent light attenuation, thereby achieving highly realistic color recovery. Nevertheless, such operations based on strict physical inversion heavily rely on additional depth priors and are computationally time-consuming, making them difficult to deploy on real-time platforms. From an objective standpoint, although physics-based methods possess clear theoretical interpretability, their robustness in complex and ever-changing practical marine environments is rather limited due to their excessive sensitivity to dynamic optical parameters.

2.1.2. Non-Physical Model-Based Methods

Non-physical model-based methods typically perform visual feature adjustments directly in the image pixel domain. For instance, the color balance and multi-scale fusion framework proposed by Ancuti et al. [24] utilizes white balance and contrast enhancement to extract derived inputs for weighted fusion. Its advantage lies in the algorithm’s ability to improve visual quality without requiring complex physical parameters; however, when processing regions with extremely low contrast, it is prone to amplifying background noise due to weight stretching. To alleviate this problem, Zhang et al. [25] proposed WWPF, which separates low-frequency colors and high-frequency details in the wavelet domain and performs constrained enhancement. This approach maintains the naturalness of the image while suppressing noise. However, the selection of wavelet bases may still introduce certain ringing artifacts in texturally complex regions.

In recent years, some research has begun to shift towards combined strategies. For example, Zhao et al. [26] proposed a scheme based on adaptive white balance and multiple restoration image fusion, which first performs color correction and then fuses multiple enhancement features, achieving good quantitative performance in various complex degradation scenarios. Following this line of thought, Li et al. [27] further proposed an underwater image enhancement framework that integrates adaptive color restoration and dehazing techniques. This method effectively solves the serious color distortion and scattering blur problems in deep-water scenarios by finely compensating for the attenuation differences of different color channels. However, because such techniques essentially decouple from the physical constraints of underwater imaging laws, they often tend to cause edge blurring when facing highly turbid or strong scattering waters. Overall, although non-physical methods offer higher computational efficiency and are easier to deploy on embedded devices, their lack of modeling the underwater optical degradation process typically makes it difficult to achieve an ideal balance among dehazing, edge preservation, and noise suppression when processing heavily degraded images.

2.2. Deep Learning-Based Underwater Image Enhancement Methods

Deep learning technologies, which learn degradation mappings through data-driven approaches, now constitute the primary framework in the field of underwater image enhancement. The development of this field generally follows two main lines: the evolution of feature modeling approaches represented by CNNs, Transformers, and GANs; and the recently emerged cutting-edge architectures, such as Diffusion Models and Mamba, which focus on physical interpretability and sequence modeling efficiency.

In early explorations, Convolutional Neural Networks (CNNs) significantly improved feature extraction efficiency and local scene adaptability for underwater images. For example, Ucolor [28], proposed by Li et al., adopted a multi-color space encoder-decoder architecture, initially establishing the performance advantage of end-to-end networks over traditional physical methods. During the same period, although FUnIE-GAN [29] by Islam et al. introduced adversarial training, its generator essentially still relied on convolution operations, achieving an enhancement rate of 30 FPS with 7.02 M parameters, establishing an early benchmark for lightweight research. However, the inherent local receptive field characteristics of conventional CNNs make it difficult for them to establish effective long-range spatial dependencies when processing large-scale, non-uniform scattering scenes such as estuaries.

To overcome the limitations of local features, Transformer architectures have been introduced into this field due to their global perception capabilities. The U-shape Transformer constructed by Peng et al. [30] embedded window attention blocks into the encoder-decoder, significantly improving the Underwater Image Quality Measure (UIQM) on the LSUI dataset. Wang et al. developed Transformer-UNet [31], which integrates dual channel-spatial attention, achieving a Structural Similarity (SSIM) of 0.924 on the UFO-120 dataset. In addition, in response to the motion blur challenge commonly encountered in underwater dynamic shooting environments, Li et al. [32] designed a deblurring network based on a cascaded attention mechanism. Through deep aggregation of multi-scale features and adaptive allocation of spatial-channel weights, the edge sharpness and visual clarity of moving targets were significantly improved, further expanding the application boundaries of the attention mechanism in complex underwater degradation scenarios. Although global modeling brings significant improvements in image quality, the computational complexity of the self-attention mechanism becomes a severe bottleneck when processing high-resolution inputs. For example, the inference frame rate of the latter drops below 10 FPS on AUV embedded devices, making it difficult to meet the real-time operational requirements of underwater vehicles.

Considering the high acquisition cost of real underwater paired supervision data, some studies have shifted to unsupervised or semi-supervised mapping. Yan et al. [33] designed a model-driven UW-CycleGAN, which integrates physical imaging priors into an unsupervised cycle-generative network, improving restoration metrics without requiring real reference images. Zhang et al. constructed MSSCE-GAN [34] through residual dense blocks, balancing detail generation and inference speed at specific resolutions. Such adversarial generation-based strategies alleviate the data scarcity problem but are prone to over-smoothing image details or generating unrealistic texture variations in highly turbid waters.

In recent years, with the increasing demand for the interpretability of physical laws, Diffusion Models have gradually gained prominence. DiffWater [35] avoids the color cast distortions commonly generated by traditional diffusion models by introducing color channel compensation conditions. Song et al. [36] deeply embedded the light attenuation coefficient into the frequency-domain latent diffusion process, achieving a 35% increase in detail fidelity in estuary scenes, finding a better balance between data fitting and physical law constraints.

To find a better solution between global receptive fields and computational efficiency, Mamba architectures, represented by structured State Space Models (SSMs) (e.g., S4 [37]), have shown immense potential. Such models reduce the complexity of the attention mechanism to linear through dynamic state transitions. Vmamba [38], constructed by Liu et al., achieved accuracy comparable to Transformers on terrestrial dehazing benchmarks while drastically reducing memory consumption. Theoretically, Mamba aligns well with the computational limits of edge platforms. However, in complex marine scenes with extreme low light or rich particulate noise in the deep sea, whether its dynamic state parameters can converge stably has not yet been publicly verified, and the engineering robustness of the model on underwater datasets such as UIEB or EUVP remains an open question.

Existing research trends indicate that while improving feature representation capabilities (e.g., global modeling, domain prior fusion), underwater image enhancement methods are also gradually adapting to the computational limits of edge devices (low latency, low memory footprint). The current main challenge lies in the fact that heavy generative architectures, represented by diffusion models, are constrained by high inference latency, whereas lightweight networks based on conventional operators exhibit weak generalization capabilities when facing extremely heterogeneous water bodies. How to balance model interpretability and the real-time inference rate of linear complexity in complex composite degradation scenes is a key issue that needs to be resolved currently.

3. Methodolgy

3.1. Overall Architecture

This paper proposes a prior-guided self-supervised underwater image enhancement framework, known as Scene-Aware Retinex Mamba (SARM). This framework deeply integrates the classical Retinex physical theory (i.e., the observed image I can be decomposed into the product of reflectance R and illumination L,

I = R \cdot L

) with the State Space Model (SSM), and its overall computational pipeline is illustrated in Figure 1. To address issues such as the scarcity of real-world paired data, the difficulty of modeling long-range dependencies, and the inability of a single network to adapt to dynamic environments, the proposed method designs four core modules. Given an input degraded image

I_{i n} \in R^{B \times 3 \times H \times W}

, its data flow is as follows:

(1) Illumination Estimator (Section 3.2): This receives

I_{i n}

, outputs the illumination compensation map

M_{c o m p}

and illumination prior feature

F_{i l l u}

through multi-color space analysis, and thereby generates the residual-guided input

I_{g u i d e d}

. (2) Mamba Denoiser (Section 3.3): This receives

I_{g u i d e d}

and

F_{i l l u}

, utilizes the 2D selective scan mechanism to accomplish global denoising reconstruction, and outputs the full-band enhanced feature

I_{o u t}^{M D}

. (3) Scene-Aware Adapter (Section 3.4): This quantifies the degradation metrics of the input image, and predicts the dynamic gating weight

α = {[α_{1}, α_{2}, α_{3}]}^{T}

during the inference stage to fuse outputs from different levels, obtaining the final enhanced image

I_{f i n a l}

. (4) Pseudo-Label Generation Mechanism based on HFM Statistical Priors (Section 3.5): This offline generates the pseudo-label

\tilde{J}

and, combined with the dynamic loss weights

{[w_{c}, w_{l}, w_{b}]}^{T}

output by the adapter, provides stable self-supervised constraints for the network.

3.2. Illumination Estimator

In the classical Retinex framework, an observed image is typically decoupled into illumination and reflectance (i.e.,

I = R \cdot L

). However, applying it directly to underwater scenarios faces two primary limitations: first, a single color space struggles to comprehensively represent the complex chromatic distortion caused by the selective absorption of water; second, traditional methods mostly rely on pixel-wise division (

R = I ⊘ L

) when stripping the illumination. Since the illumination values in extremely dark waters often approach zero, forced division is prone to causing numerical instability, thereby amplifying the underlying strong scattering noise. To resolve the aforementioned issues, this paper constructs an Illumination Estimator based on multi-color space analysis and adaptive residual modulation. It extracts global color priors in the feature domain and provides robust photometric guidance for the subsequent denoising network through a residual gating mechanism.

(1): Multi-Color Space Feature Extraction

Underwater light field degradation is affected by both illuminance attenuation and spectral selective absorption, and a single color space often cannot fully represent this complex physical distortion. Therefore, this module combines three complementary color spaces—RGB, LAB, and HSV—for feature encoding. Specifically, the RGB space preserves the original channel energy distribution of the image, providing a physical benchmark for extracting the basic illuminance structure and local contrast. The LAB space decouples lightness (

L^{*}

) from chromaticity, enabling the network to independently extract underlying illuminance priors and perform high-intensity illumination compensation when processing extremely dark deep-sea regions, thus avoiding color distortion caused by directly brightening in the RGB domain. Furthermore, the hue and saturation dimensions of the HSV space directly reflect the spectral shift caused by the attenuation of specific wavelengths (such as red light) underwater. By independently extracting features in the HSV domain, the network can capture pure color shift direction signals, providing a reliable feature basis for subsequent prediction of typical color shift types.

Given an input image

I \in R^{B \times 3 \times H \times W}

, we first calculate the pixel-wise average intensity

μ_{c} (x, y) = \frac{1}{3} \sum_{c = 1}^{3} I_{c} (x, y)

in the RGB branch, which serves as the spatial illumination prior map

μ_{c} \in R^{B \times 1 \times H \times W}

. Subsequently, it is concatenated with the original image along the channel dimension and fed into the feature extraction network

E_{rgb}

. Simultaneously, the input image is transformed into the LAB and HSV spaces, respectively, to extract features independently. The computational processes of the three parallel branches are expressed as follows:

F_{rgb} = E_{rgb} (Concat (I, μ_{c})) \in R^{B \times C_{m} \times H \times W},

(1)

F_{lab} = E_{lab} (RGB 2 LAB (I)) \in R^{B \times C_{m} \times H \times W},

(2)

F_{hsv} = E_{hsv} (RGB 2 HSV (I)) \in R^{B \times C_{m} \times H \times W} .

(3)

(2): Adaptive Color Cast Correction and Prior Mapping

To address the highly heterogeneous color cast phenomena in real-world waters (e.g., deep-sea blue cast or near-shore green cast), this paper designs an adaptive correction mechanism during the feature fusion stage. After concatenating the aforementioned features into

F_{multi} \in R^{B \times 3 C_{m} \times H \times W}

, a classification network built upon a Multi-Layer Perceptron (MLP) combined with Global Average Pooling (GAP) is utilized to predict the posterior probability distribution

P_{bias}

of six typical color cast types:

P_{bias} = σ (F_{classifier} (GAP (F_{multi}))) \in R^{B \times 6} .

(4)

Specifically, the six predefined typical color casts encompass the RGB primary colors and their complementary counterparts (i.e., red, green, blue, cyan, magenta, and yellow). Serving as orthogonal photometric anchors within the feature space, these baseline colors can comprehensively characterize the spectral shift patterns on the hue circle. For instance, the severe absorption of single-band red light by water typically manifests as a cyan-blue color cast, whereas the joint attenuation across multiple spectral bands corresponds to mixed color casts such as magenta.

In terms of specific implementation, this module employs a physics-driven implicit soft-classification mechanism. Rather than relying on pre-annotated color cast category labels, the network utilizes a Sigmoid activation function to generate multi-channel, non-mutually exclusive probability responses. This design enables the model to adaptively fit the highly complex, non-linear mixed color casts encountered in real-world waters through a linear combination of the aforementioned six baseline anchors. The weight optimization of the classifier is entirely driven in an end-to-end manner by gradients derived from the downstream color fidelity loss and physical pseudo-labels. Furthermore, to prevent color correction oscillations induced by random parameter initialization during the early training phase, the bias term of the classification layer is initialized to a negative constant (e.g., −2.0). This strategy ensures that the network’s initial correction intensity approaches zero, thereby guaranteeing the stability and progressiveness of the model’s optimization process.

Concurrently, the correction network outputs a corresponding set of physical correction parameters

Θ = [α, β, γ, δ] \in R^{B \times 12}

(representing the gain, offset, gamma, and saturation factors, respectively). In the theoretical model, these parameters correspond to the inverse mapping

{\hat{I}}^{c} = δ^{c} \cdot {(α^{c} \cdot I^{c} + β^{c})}^{γ^{c}}

. To maintain the continuity of gradient propagation, we transform this mapping into the feature space and perform dynamic Feature Recalibration on the fused features using

Θ

, thereby obtaining the corrected illumination features

F_{illu}

.

Furthermore, to prevent gradient oscillations caused by inaccurate parameters during the early training stage, a linear warmup factor

λ_{warmup} = min (t / T_{warmup}, 1)

controlled by the training step t is introduced, scaling the probability distribution as

P_{bias} \leftarrow P_{bias} \cdot λ_{warmup}

. Finally, the network generates the Illumination Compensation Map for subsequent modules through a mapping head:

M_{comp} = σ (F_{map} (F_{illu})) \in {[0, 1]}^{B \times 3 \times H \times W} .

(5)

(3): Adaptive Illumination Residual Modulation

After obtaining the Illumination Compensation Map

M_{comp}

, to circumvent the numerical instability caused by the division operation, this paper employs a residual modulation strategy to construct the feature-guided input

I_{guided}

:

I_{guided} = I_{i n} \oplus (I_{i n} \otimes M_{comp}),

(6)

where ⊗ and ⊕ denote pixel-wise multiplication and addition, respectively. In this structure, the compensation term

(I_{i n} \otimes M_{comp})

acts as an attention mask based on illumination intensity. In dark regions with severe illumination attenuation, the response value of

M_{comp}

is relatively high, and the network applies stronger residual compensation to restore degraded details; conversely, in regions with relatively sufficient illumination,

M_{comp}

approaches zero, and the operation naturally degrades into an identity mapping to prevent local overexposure. This modulation process effectively avoids the noise amplification phenomenon caused by traditional division operations, providing contrast-balanced input features for the subsequent Mamba Denoiser.

3.3. Mamba Denoiser

Under the Retinex theoretical framework, the observed image is decoupled into illumination and reflectance. After completing the residual modulation of the illumination distribution at the front end, the core task of the network is to reconstruct the underlying reflectance, which characterizes the intrinsic properties of the objects. However, the forward and backward non-uniform scattering caused by suspended particles in the water typically generates complex spatially variant noise in the images. Traditional Convolutional Neural Networks (CNNs), limited by their local receptive fields, are prone to losing deep texture details when filtering out such strong scattering noise. Conversely, models with global perception capabilities (e.g., Transformers) are constrained by their quadratic computational complexity of

O ({(H W)}^{2})

, making it difficult to meet the real-time inference requirements for high-resolution underwater images. To address the aforementioned conflict between feature representation and computational efficiency, this paper designs the Mamba Denoiser (MD). Guided by physical photometric priors, this module introduces the State Space Model (SSM) with a linear computational complexity of

O (H W)

. Combined with a dual-track parallel encoding strategy, it achieves feature reconstruction with long-range spatial dependencies while effectively controlling the computational overhead.

(1): Dual-Track Parallel Encoding

Conventional U-Net architectures typically perform downsampling solely on single-dimensional image features. When processing features guided by external priors, this single-track aggregation paradigm tends to dilute the injected physical information during multi-scale transformations. To this end, the MD module introduces a dual-track parallel downsampling strategy in the encoding stage.

Let the initial input main feature of the denoiser be

F_{i m g}^{0} = H_{i n} (I_{g u i d e d})

, and the initial illumination feature be

F_{i l l u}^{0} = F_{i l l u}

(where

H_{i n}

denotes the initial projection convolutional layer). At the l-th scale level of the encoder, the network first facilitates the interaction between the main backbone features and the illumination features via the Interleaved Group Attention Block (IGAB):

{\hat{F}}_{i m g}^{l} = F_{I G A B}^{l} (F_{i m g}^{l}, F_{i l l u}^{l}),

(7)

After feature fusion, to maintain the independence of the physical prior information during spatial dimensionality reduction, the network employs two independent strided convolution operators (with a kernel size of

4 \times 4

and a stride of 2), coupled with Layer Normalization and a nonlinear activation function, to perform downsampling operations on the main features and illumination features, respectively:

F_{i m g}^{l + 1} = δ (N_{i m g} (C o n v_{↓ 2} ({\hat{F}}_{i m g}^{l}))),

(8)

F_{i l l u}^{l + 1} = δ (N_{i l l u} (C o n v_{↓ 2} (F_{i l l u}^{l}))) .

(9)

The parallel downsampling paths avoid excessive aliasing between the environmental illumination prior and the main features during the dimensionality reduction process, ensuring that the network can obtain photometric constraints strictly aligned with the spatial scales in the subsequent decoding stage.

(2): Physical Prior Injection

The Interleaved Group Attention Block (IGAB) serves as the core computational unit spanning across all scale levels. As illustrated in Figure 2, the internal computational flow of this module is decoupled into three stages: prior injection, linear projection, and state space modeling.

During the prior injection stage, for a given level’s main feature

F_{i m g} \in R^{B \times C \times H \times W}

and the corresponding illumination feature

F_{i l l u} \in R^{B \times C \times H \times W}

, the IGAB employs element-wise residual addition to guide the illumination distribution information into the main backbone feature stream:

F_{i n j e c t} = F_{i m g} \oplus F_{i l l u} .

(10)

After undergoing normalization and activation processing, the injected feature

F_{i n j e c t}

is mapped into a high-dimensional hidden feature space

X \in R^{B \times 2 C \times H \times W}

via a linear projection layer. Subsequently, this high-dimensional feature is uniformly split along the channel dimension into two parallel branches:

X_{1}, X_{2} = P_{s p l i t} (W_{p r o j} F_{i n j e c t} + B_{p r o j}) .

(11)

Here,

X_{1}

enters the 2D selective scan module as the main backbone information, whereas

X_{2}

passes through a SiLU activation function to serve as a gating branch, which is utilized for the final feature multiplication modulation.

(3): Two-Dimensional Selective Scan and Dynamic Gating

To establish a global receptive field with a linear computational complexity of

O (H W)

, the IGAB integrates a 2D Selective Scan (SS2D) operator internally. This operator is built upon the continuous-time State Space Model (SSM). For a 1D input sequence

x (t) \in R

, the continuous SSM maps it to an output

y (t) \in R

via a hidden state

h (t) \in R^{N}

. Its continuous-time differential equations are defined as

\dot{h} (t) = A h (t) + B x (t),

(12)

y (t) = C h (t) + D x (t) .

(13)

To adapt to discrete digital image features, the model introduces a time-scale parameter

Δ

and employs the Zero-Order Hold (ZOH) rule to perform a discretization approximation on the system matrices

A

and

B

:

\bar{A} = exp (Δ A),

(14)

\bar{B} = {(Δ A)}^{- 1} (exp (Δ A) - I) \cdot Δ B .

(15)

The discretized sequence state transition equations are updated as

h_{i} = \bar{A} h_{i - 1} + \bar{B} x_{i},

(16)

y_{i} = C h_{i} + D x_{i} .

(17)

Given that the standard SSM can only process 1D causal sequences, SS2D introduces a 2D Unfolding operation. The input feature map

X_{1}

is flattened into independent sequences

x^{(k)} = Φ_{k} (X_{1}), k \in {1, 2, 3, 4}

along four orthogonal directions (horizontal forward/backward, vertical forward/backward). These sequences are independently fed into the discretized SSM blocks for scanning, generating response sequences

y^{(k)}

. Subsequently, the system executes a 2D Merging operation, reorganizing the four 1D responses back into the 2D spatial dimension and fusing them:

Y_{s c a n} = \sum_{k = 1}^{4} Φ_{k}^{- 1} (y^{(k)}) .

(18)

Finally,

Y_{s c a n}

undergoes element-wise Hadamard product modulation with the gating branch

X_{2}

, followed by a reverse linear projection to form a residual closed loop with the initial input

F_{i m g}

:

F_{o u t} = W_{o u t} (Y_{s c a n} ⊙ SiLU (X_{2})) + F_{i m g} .

(19)

This transformation process fundamentally differs from traditional linear affine mappings. First,

Y_{s c a n}

is the result of nonlinear aggregation of the global context by the selective scan operator. During the discretization of the SSM, the system matrices

\bar{A}

and

\bar{B}

are not spatially constant; rather, they are data-dependent parameters dynamically generated from the input feature

X_{1}

. This mechanism allows the scanning operator to adaptively adjust the state transition weights according to the degradation characteristics of different regions in the underwater image, thereby achieving dynamic modeling of complex spatially variant degradation. Second, branch

X_{2}

carries the gating signal regulated by the illumination prior. By executing an element-wise Hadamard product with

Y_{s c a n}

, it achieves pixel-wise recalibration of the global response features.

Such a parallel interactive architecture based on scanning and gating effectively compensates for the limited receptive fields of local convolutional windows. Under the constraint of the illumination prior, the network aggregates global contextual dependencies while filtering out strong scattering noise. Consequently, under the premise of maintaining linear computational complexity, it significantly improves the recovery accuracy of high-frequency structural features in severely degraded scenarios.

3.4. Scene-Aware Adapter

The optical degradation in real-world marine environments exhibits high spatial heterogeneity (e.g., spectral absorption dominating in deep waters and suspended particle scattering dominating in shallow waters). Most existing underwater enhancement networks adopt fixed parameters and static loss weights, making them highly susceptible to local overexposure or under-correction when facing unknown waters. To alleviate the limited generalization of models in dynamic environments, this paper introduces the Scene-Aware Adapter (SAA). By quantitatively evaluating the degradation characteristics of the samples, this module achieves dynamic loss scheduling and adaptive feature gating covering both the training and inference stages.

(1): Physical Degradation Quantization and Index Modeling

To objectively measure the heterogeneous degradation status of underwater images, the SAA constructs a three-channel physical degradation vector

v = {[C I, L I, B I]}^{⊤}

. To reduce the interference of the initial color cast on feature evaluation, all indices are based on the initially physically brightened image

I_{c o r r}

and normalized to the

[0, 1]

interval.

Chrominance Index (CI): Since high-gradient regions are less affected by water scattering, the system calculates the grayscale gradient magnitude of the image

G (x, y) = ∥ \nabla I_{c o r r} {(x, y) ∥}_{2}

. Based on this, a high-confidence clear pixel set is defined:

Ω_{clear} = \{(x, y) ∣ G (x, y) > μ_{G} + 0.8 σ_{G}\},

(20)

where

μ_{G}

and

σ_{G}

are the global mean and standard deviation of G, respectively. Here, setting the threshold to

μ_{G} + 0.8 σ_{G}

is based on the statistical prior that natural image gradients typically follow a pseudo-Gaussian distribution [39]. Drawing inspiration from classical adaptive local thresholding theory [40], the mean

μ_{G}

serves as the baseline level of the gradients, while the introduction of the statistical offset

0.8 σ_{G}

is designed to encompass the central mass of the gradient distribution. Through this adaptive mechanism, the model can effectively filter out the predominantly smooth background and precisely lock onto the salient high-frequency edges located in the long-tail region. CI is quantized as the ratio of the L1 color cast energy to the brightness energy within this clear region:

C I = \frac{\frac{1}{| Ω_{clear} |} \sum_{(x, y) \in Ω_{clear}} {∥ I_{c o r r} (x, y) - μ_{RGB} ∥}_{1}}{\frac{1}{| Ω_{clear} |} \sum_{(x, y) \in Ω_{clear}} {∥ I_{c o r r} (x, y) ∥}_{1} + ε},

(21)

where the extremely small value

ε = 10^{- 8}

is utilized to prevent division by zero. This index, in the form of an energy ratio, objectively reflects the inherent chromatic distortion degree of the scene.

Low-Light Index (LI): A dark pixel set

Ω_{dark} = \{(x, y) ∣ L^{*} (x, y) < τ_{dark}\}

is defined on the

L^{*}

channel of the CIELAB color space, where the adaptive dark threshold is

τ_{dark} = max (μ_{L^{*}}, 15)

. Setting the hard lower bound to 15 is based on the objective properties of the CIELAB space, i.e.,

L^{*} < 15

belongs to the absolute dark region where the Human Visual System (HVS) can hardly distinguish details [41,42].

To eliminate the interference of smooth backgrounds, the system takes the intersection of the dark region, the low-gradient mask, and the low-contrast mask as the actual low-light degradation region

Ω_{degrad}

. LI is defined as the convex combination of the global proportion and the spatially weighted proportion of the degradation region:

L I = 0.7 \frac{| Ω_{degrad} |}{H W} + 0.3 \frac{\sum_{(x, y) \in Ω_{degrad}} ω (x, y)}{\sum_{(x, y)} ω (x, y)},

(22)

where the spatial weight

ω (x, y) = 1 - 0.3 \frac{dist (x, y, center)}{\max - dist}

assigns a higher evaluation weight to the visual center of the image.

Blur Index (BI): Aiming at the forward scattering of suspended particles, multi-scale Gaussian blur residuals

R_{s} = I_{c o r r} - G_{σ_{s}} * I_{c o r r}, σ_{s} \in {1, 2, 3}

are constructed. Its weighted high-frequency energy is defined as

E_{high} = \sum_{s = 1}^{3} α_{s} {∥ R_{s} ∥}_{1}

. BI integrates the proportion of low-frequency smooth regions, the energy loss rate, and the attenuation ratio of edge density:

B I = 0.4 \frac{| Ω_{low - high} \cap Ω_{low - var} |}{H W} + 0.4 (1 - \frac{E_{high}}{σ_{gray}}) + 0.2 (1 - ρ_{edge}),

(23)

where

ρ_{edge}

is the proportion of edge pixels extracted by the Canny operator. All three terms are dimensionless ratios, ensuring that the final

B I \in [0, 1]

.

(2): Global Anchor Matching and Dynamic Loss Scheduling

After obtaining the degradation vector

v

, the traditional online clustering strategy is prone to causing weight oscillations under mini-batch training. Therefore, the system adopts a matching strategy based on Global Degradation Anchors. A 2D orthogonal decision space is constructed using CI and BI, and the space is divided into four optical subspaces (e.g., deep-clear, shallow-turbid) by an empirical threshold (

τ_{d e g} = 0.5

). Each sample is mapped to the corresponding quadrant anchor and obtains a base penalty weight

w = {[w_{c o l o r}, w_{l o w l i g h t}, w_{b l u r}]}^{⊤}

initialized by domain priors.

During the model training phase, the SAA acts as the global scheduling center. The system applies hard threshold truncation based on the weight

w

and constructs a scene-adaptive penalty term

L_{s c e n e}

combined with the degradation descriptor

v

:

L_{s c e n e} = \sum_{j \in {c, l, b}} I (w_{j} > τ) \cdot w_{j} \cdot L_{j} \cdot (1 + v_{j}),

(24)

where

τ = 0.3

is the activation threshold. When a certain degradation characteristic is significantly prominent, the network activates the corresponding specific loss and amplifies the penalty intensity using the sample’s own degradation index

v_{j}

. This mechanism enables the network to adaptively adjust the optimization focus according to the degradation distribution of the current batch, effectively alleviating the training shift caused by long-tailed degraded samples.

(3): Adaptive Feature Gated Fusion in the Inference Stage

In addition to the loss scheduling at the training end, the SAA introduces a lightweight weight perception module

F_{g a t e}

during the forward inference stage. Conditioned on the deep illumination feature

F_{i l l u}

, this module predicts a 3D gating vector

α = {[α_{1}, α_{2}, α_{3}]}^{⊤}

(constrained to the standard simplex via Softmax activation, satisfying

\sum α_{i} = 1

). Subsequently, the system performs pixel-wise weighted fusion on image signals from three different abstraction levels:

I_{f i n a l} = α_{1} I_{i n} + α_{2} I_{o u t}^{M D} + α_{3} I_{c o r r}^{i l l u},

(25)

where

I_{i n}

is the original degraded input,

I_{o u t}^{M D}

is the full-band output of the Mamba Denoiser, and

I_{c o r r}^{i l l u}

is the primary corrected image based on the illumination prior. This design allows the model to dynamically allocate branch weights according to the optical characteristics of the input image without significantly increasing computational overhead. For instance, it tends to increase the proportion of the denoising branch

α_{2}

in turbid waters, while automatically increasing the response of the physical correction branch

α_{3}

in deep waters with extreme darkness. This adaptive gating mechanism at the inference end further enhances the zero-shot generalization capability of the model in unknown dynamic marine environments.

3.5. Pseudo-Label Generation and Caching

In real-world underwater environments, acquiring strictly spatially aligned clear-degraded image pairs is highly challenging. Relying entirely on purely data-driven unsupervised learning is highly prone to introducing structural distortions or local artifacts in the generated results. To provide supervision signals with regularization significance for the state space model, this paper introduces a Prior-Guided Pseudo-Label Generation mechanism at the front end of the framework. Specifically, this module employs the classical Hybrid Fusion Method (HFM) [43] as the core prior extraction operator. It extracts reliable color and structural priors from a single image to generate a reference image, guiding the stable convergence of the downstream Mamba network.

(1): Multi-Scale Prior Fusion

Based on the image statistics logic of the aforementioned operator, given a degraded training image

I_{i n}

, the pseudo-label generation process first constructs two complementary feature branches in parallel by incorporating domain priors. The first is the color correction branch

I_{c o l o r}

, which utilizes Gray-World white balance and dynamic color compensation to mitigate the selective absorption effect of water. The second is the contrast enhancement branch

I_{d e t a i l}

, which employs Contrast Limited Adaptive Histogram Equalization (CLAHE) and gamma correction to restore high-frequency textures suppressed by backward scattering.

Subsequently, the system calculates the normalized feature weight maps

W_{k}, k \in {c o l o r, d e t a i l}

for these two branches at the pixel level across the dimensions of Laplacian Contrast, Local Saliency, and Exposure. To circumvent the edge seam artifacts potentially caused by direct linear weighting (i.e.,

\sum W_{k} \otimes I_{k}

), this method adopts a multi-scale Laplacian Pyramid to perform frequency band fusion between the images and the weight maps processed by a Gaussian Pyramid:

\tilde{J} = \sum_{l = 1}^{N} P_{u p} (\sum_{k \in {c o l o r, d e t a i l}} G^{l} (W_{k}) ⊙ L^{l} (I_{k})),

(26)

where

G^{l} (\cdot)

and

L^{l} (\cdot)

denote the pyramid extraction operators at the l-th level, respectively, and

P_{u p} (\cdot)

is the pyramid reconstruction function. This fusion generates a color-balanced and edge-sharp reference image

\tilde{J}

.

It should be noted that introducing the HFM operator to generate pseudo-labels is not equivalent to constructing a traditional knowledge distillation paradigm. To prevent the model’s performance from being constrained by the upper bound of the pseudo-label generator, the generated

\tilde{J}

merely serves as a heuristic prior constraint for the network. Traditional handcrafted fusion algorithms rely on local filtering and are prone to producing local artifacts or color discontinuities when handling extreme non-uniform degradation. In contrast, leveraging the global contextual receptive field of SS2D, the Mamba network can adaptively filter prior information during the optimization process—namely, it absorbs macroscopic color balance and topological guidance while filtering out local flaws of the pseudo-labels in the high-dimensional feature space, thereby avoiding mechanical pixel-level fitting.

This global calibration mechanism enables SARM to effectively overcome the inherent performance limitations of a single pseudo-label operator. As demonstrated by the quantitative experiments in the subsequent Section 4.2.1, the comprehensive performance of SARM across three benchmark datasets is superior to that of the prior-providing HFM algorithm. Particularly, on the CCF metric for evaluating color restoration and the URanker metric for reflecting deep perceptual quality, our method demonstrates significant superiority. This proves that the network does not fall into pixel-level overfitting to the prior labels; instead, based on macroscopic guidance, it learns a more generalizable underwater physical degradation inversion mapping.

(2): Memory Hash Caching

Although the prior-fused images compensate for the flaw that self-supervised learning is prone to falling into local optima, multi-scale pyramid fusion involves intensive pixel-level computations. If they are generated in real time during the training phase or loaded via traditional disk I/O, it would cause significant I/O latency and computational resource idling. To optimize data throughput during the training process, this paper introduces a memory hash caching mechanism at the initialization stage of the data loader.

Prior to training, the system offline generates the physical pseudo-labels

\tilde{J}

for the entire dataset and preloads them into RAM in a low-bitwidth format (uint8), constructing an index dictionary based on the hash values of the image files:

C a c h e [hash (I_{i n})] = {\tilde{J}}_{uint 8},

(27)

During the network’s forward propagation phase, the model retrieves the corresponding pseudo-labels through hash addressing with

O (1)

complexity and dynamically converts them into floating-point tensors before entering the loss calculation. This system-level optimization effectively alleviates the data loading bottleneck, improving training efficiency and the stability of the optimization process without increasing the network’s inference computational overhead.

3.6. Dynamic Composite Loss

After obtaining the offline-cached heuristic pseudo-labels

\tilde{J}

and the full-band output

I_{o u t}^{M D}

of the Mamba Denoiser, this paper designs multi-dimensional perceptual and structural constraint losses to synchronously improve the pixel-level accuracy, color fidelity, and detail clarity of the reconstructed images. Concurrently, addressing the limitations of traditional fixed-weight strategies in dynamic environments, a dynamic composite optimization objective is constructed in combination with the Scene-Aware Adapter (SAA).

(1): Multi-Dimensional Perceptual Constraints

This part comprises the base reconstruction loss

L_{b a s e}

and the color perceptual loss

L_{c o l o r}

. The base reconstruction loss fuses the Mean Squared Error (MSE) and the VGG-19 (relu5_1 layer) perceptual loss, forcing the network to align with the global topological structure while stripping away the interference of underlying noise:

L_{b a s e} = \frac{1}{N} ∥ I_{o u t}^{M D} - \tilde{J} ∥_{2}^{2} + \frac{λ_{v g g}}{C_{j} H_{j} W_{j}} {∥ Φ_{j} (I_{o u t}^{M D}) - Φ_{j} (\tilde{J}) ∥}_{1},

(28)

Aiming at the complex color distortion caused by the selective absorption of water, the color perceptual loss constructs a comprehensive constraint from four dimensions: color distribution, channel balance, local gradient, and color temperature:

L_{c o l o r} = λ_{h i s t} L_{h i s t} + λ_{w b} L_{w b} + λ_{c o n s i s t} L_{c o n s i s t} + λ_{t e m p} L_{t e m p},

(29)

where

L_{h i s t}

utilizes a Gaussian-kernel soft histogram to match the global color distribution;

L_{w b}

corrects RGB channel imbalance by constraining the ratio of single-channel to cross-channel means;

L_{c o n s i s t}

limits the horizontal and vertical L1 gradient differences between the predicted image and the pseudo-label to maintain local color smoothness; and

L_{t e m p}

constrains the deviations of the image on the B-Y and R-C hue axes to correct the color temperature.

(2): Structural Clarity Loss

Addressing the blur degradation caused by the scattering of suspended particles, this loss combines the Sobel operator with the first-order spatial derivative

\nabla_{x, y}

to construct a complementary gradient regularization constraint:

L_{c l a r i t y} = \frac{1}{N} ∥ Sobel (I_{o u t}^{M D}) - Sobel (\tilde{J}) ∥_{1} + 0.5 \sum_{d \in {x, y}} {∥ \nabla_{d} I_{o u t}^{M D} - \nabla_{d} \tilde{J} ∥}_{1} .

(30)

In strongly scattering underwater environments, conventional second-order derivatives (e.g., Laplacian) are overly sensitive to high-frequency abrupt changes, making them prone to amplifying background granular noise. By contrast, the Sobel operator inherently possesses local smoothing properties, enabling it to extract coherent geometric contours and provide stable mid-to-low-frequency structural guidance. To compensate for the loss of ultra-high-frequency details caused by the smoothing operation, the first-order spatial derivative

\nabla_{x, y}

, which is sensitive to adjacent pixels, is further introduced. Under the premise that the Sobel operator anchors the macroscopic structural prior,

\nabla_{x, y}

serves as an effective high-frequency compensation term to restore fine-grained textures. This joint constraint mechanism effectively balances noise suppression and high-frequency structural fidelity.

(3): Scene-Adaptive Dynamic Total Loss

Traditional training paradigms typically adopt fixed loss weights. When processing natural underwater images with significant heterogeneity, fixed optimization objectives are prone to inducing gradient conflicts (e.g., insufficient color cast correction in deep waters or excessive texture sharpening in shallow waters). Based on this, this paper adopts a mini-batch-level nonlinear dynamic loss scheduling strategy. Relying on the dynamic penalty weight

w = {[w_{c}, w_{l}, w_{b}]}^{⊤}

and the degradation index vector

v = {[C I, L I, B I]}^{⊤}

output by the SAA module, the composite total loss function is defined as

L_{t o t a l} = L_{b a s e} + λ_{s} \sum_{k \in {c, l, b}} I (w_{k} > τ_{k}) \cdot w_{k} \cdot L_{k} \cdot (1 + γ v_{k}),

(31)

where

I (\cdot)

is the indicator function;

L_{c}

,

L_{l}

, and

L_{b}

correspond to the specific losses for color, low light, and clarity, respectively; and

γ

is the severity amplification factor. To adapt to the statistical distribution differences of various degradation types, the activation thresholds for each branch are set to

τ_{c} = 0.4

,

τ_{l} = 0.3

, and

τ_{b} = 0.3

, respectively. The higher color threshold (0.4) is utilized to prevent the network from over-correcting samples with slight color casts, whereas the lower low-light and blur thresholds (0.3) ensure that the model remains sensitive to detail degradation accompanied by irreversible information loss. This adaptive mechanism enables the network to dynamically focus on the most severe degradation dimension in the current batch, thereby reducing the reliance on manual parameter tuning and enhancing the framework’s robustness in real-world complex marine environments.

Furthermore, beyond the aforementioned cross-sample adaptive adjustment driven by scene features, the proposed framework introduces a Convergence-Aware Decay mechanism along the temporal dimension of the training process. Given the highly complex nature of underwater degradation, maintaining fixed, high penalty weights during the later stages of training makes the network susceptible to gradient oscillations. Specifically, during each iteration, the framework dynamically monitors the magnitude of the base reconstruction loss,

L_{b a s e}

(i.e., the sum of the MSE and VGG losses). When

L_{b a s e}

drops below a predefined threshold (e.g., 0.1), it indicates that the model has completed the global structural fitting and transitioned into a fine-grained tuning phase. At this point, the system automatically decays the weight of the clarity-specific loss,

L_{b}

, to 50% of its initial value (and further down to 30% if

L_{b a s e} < 0.05

), while smoothly scaling the overall weight of the color perception module to 80%. This dual dynamic scheduling strategy—coupling spatial scene perception with temporal convergence decay —not only alleviates the burden of manual hyperparameter tuning but also ensures that the model rapidly fits the degradation distribution in the early stages and robustly converges to the optimal solution in the later stages.

4. Experiments

4.1. Experimental Settings

Implementation Details: All model training and evaluation approaches in this paper were deployed on a workstation equipped with an AMD Ryzen 9 7845HX CPU (3.00 GHz), 64 GB RAM, and an NVIDIA GeForce RTX 4060 GPU (8 GB). The network optimization employed the Adam optimizer with an initial learning rate set to

1 \times 10^{- 4}

. The total training epochs were set to 300, with a batch size of 8, and model weights were recorded every 2 epochs. To strike a balance between computational overhead and feature preservation, the input images were uniformly resized to a resolution of

256 \times 256

.

Datasets: To evaluate the effectiveness and generalization performance of the proposed method, three representative underwater benchmark datasets encompassing various water quality conditions and degradation types were selected for the experiments. During the training phase, the Underwater Image Enhancement Benchmark (UIEB) and the Enhancing Underwater Visual Perception (EUVP) [6] datasets were utilized. Specifically, 800 pairs of real-scene images were randomly sampled from UIEB; simultaneously, paired and unpaired samples from the EUVP dataset (including subsets such as Underwater Dark, ImageNet, and Scenes) were introduced. The diverse illumination and degradation conditions help mitigate the domain shift problem during real-world deployment. The test set consists of three independent parts: first, the remaining 90 real reference images from UIEB, used to test performance in conventional scenes; second, 515 unseen samples from the EUVP dataset, to examine robustness in mixed degradation scenarios; third, the UCCS underwater color cast dataset containing 300 images. This dataset covers near-shore and deep-sea waters with extreme blue-green or yellow-brown color casts, serving as a zero-shot test set to primarily evaluate the model’s color cast correction capability.

Evaluation Metrics: Due to the unique optical scattering and absorption characteristics of underwater environments, traditional general-purpose image quality evaluation metrics (e.g., PSNR and SSIM) occasionally fail to completely align with human subjective visual perception. Consequently, this paper adopts a multi-dimensional No-Reference evaluation system. To ensure the comprehensiveness of the assessment, it is stipulated that higher values indicate superior image quality across all metrics:

UCIQE and UIQM: As the most classical quantitative standards in the field of underwater vision, UCIQE utilizes a linear combination of chroma standard deviation ( $σ_{c}$ ), luminance contrast ( $c o n_{l}$ ), and saturation mean ( $μ_{s}$ ) to quantify chromatic distortion, calculated as $UCIQE = c_{1} σ_{c} + c_{2} c o n_{l} + c_{3} μ_{s}$ ; meanwhile, UIQM (Underwater Image Quality Measure) performs a weighted summation of three components, namely color richness (UICM), sharpness (UISM), and contrast (UIConM), with its typical formulation being $UIQM = c_{1} UICM + c_{2} UISM + c_{3} UIConM$ . Both indices comprehensively reflect the fundamental capabilities of the model in color constancy and contrast stretching.
CCF and FDUM: To further evaluate dehazing and detail fidelity, the Colorfulness–Contrast–Fog (CCF) index is introduced, which penalizes residual color casts by evaluating the mapping relationship between local chromatic variance and fog density; FDUM objectively quantifies the model’s suppression effect on mid-to-low-frequency blur caused by underwater suspended particles by analyzing the high-frequency energy distribution in the frequency domain.
URanker: Considering that traditional hand-crafted metrics are prone to evaluation bias when evaluating complex generative architectures, this paper additionally introduces URanker, a perceptual metric benchmark based on deep neural networks. By undergoing alignment training on a large-scale human visual preference dataset, this model can reflect genuine human subjective aesthetics more accurately and robustly.

4.2. Comparison with State-of-the-Art Methods

To systematically evaluate the performance of the SARM framework, this paper conducts a comprehensive comparison with 13 state-of-the-art (SOTA) underwater image enhancement methods on the UIEB, EUVP, and UCCS datasets. The selected baseline models strictly encompass the primary technological evolution trajectories within this field: they include not only classical traditional hybrid enhancement methods (e.g., Fusion [24]), but also CNN-based multi-scale feature fusion networks (e.g., NU²Net [44], HFM [43], and the deep pyramid architecture PyUIE [45]). Furthermore, the comparison cohort incorporates models based on contrastive learning and semi-supervised strategies (e.g., Semi-UIR [46], HCLR [47], and HSR [48]), alongside heavy generative architectures and large vision models that have demonstrated robust performance in recent years. These encompass Transformer-based networks (e.g., Phaseformer [49], WWPF [25]), cutting-edge diffusion model-based approaches (e.g., DCD [50]), and models introducing cross-modal visual features (e.g., CLIP-UIE [15]). Specifically, to validate the relative superiority of the proposed framework within the domain of State Space Models (SSMs), we additionally incorporate the latest Mamba-based visual enhancement networks (e.g., D2Mamba [18] and FMambaIR [19]).

4.2.1. Quantitative Analysis

The objective quantitative comparison results of different methods on three benchmark datasets (UIEB, UCCS, and EUVP) are summarized in Table 1. Regarding the URanker metric, which emphasizes alignment with human visual perception, SARM achieves the highest scores across all three datasets. Simultaneously, on the CCF metric evaluating comprehensive color fidelity, SARM ranks first on the UCCS and EUVP datasets while maintaining highly competitive and outstanding performance on UIEB. This indicates that the multi-color space illumination estimation mechanism can effectively adapt to optical attenuation in unknown waters, demonstrating robust generalization capabilities.

It should be noted that certain comparative algorithms, represented by WWPF, CLIP-UIE, and HFM, achieve higher scores on traditional no-reference metrics such as FDUM or UCIQE. From the perspective of computational mechanisms, these metrics rely heavily on chroma variance and local contrast. Consequently, they are easily catered to by over-enhancement processing that drastically stretches pixel distributions, which frequently introduces local overexposure and unnatural color artifacts in complex, real-world marine scenes. Furthermore, research by Guo et al. [44] explicitly points out that traditional evaluation metrics based on handcrafted features and statistical regression often fail to accurately reflect genuine human visual perception quality.

Considering the physical characteristics of underwater imaging, SARM introduces the Retinex prior as a strict physical constraint. It focuses on restoring natural scene tones that conform to physical laws rather than blindly pursuing the numerical maximization of specific metrics. Therefore, the slight concessions made by our method on certain traditional metrics represent a reasonable trade-off to avoid color distortion caused by over-enhancement. Overall, the quantitative and qualitative results demonstrate that SARM strikes a more reasonable balance between enhancing image visibility and maintaining physical fidelity.

4.2.2. Computational Efficiency Analysis

To verify the practical application value of the proposed framework, Table 2 quantitatively compares the parameter counts and FLOPs of different models at a

256 \times 256

resolution. SARM, with 8.10 M parameters and 37.85 G FLOPs, maintains a moderate computational scale. Compared to heavy models such as DCD and CLIP-UIE, whose parameters approach one hundred million, and HCLR, which has an immense computational load, SARM’s theoretical overhead achieves an optimized balance between computational resources and model scale.

Although lightweight models like Semi-UIR and D2Mamba have lower theoretical parameter counts, such compression of model capacity often comes at the expense of color reconstruction accuracy and generalization capability in complex underwater scenes. This limitation is evidenced by the image quality metrics in Table 1. The reason SARM maintains its current parameter scale is its deep integration of Retinex physical priors and multi-color space feature extraction mechanisms, which are indispensable for high-fidelity color restoration. This indicates that in underwater image enhancement tasks, reasonable physical modeling overhead and balanced model capacity are necessary conditions for maintaining robust color mapping capabilities.

To further confirm the architectural efficiency of our framework, we conducted a controlled empirical study under identical experimental conditions using three baseline models based on CNN, Transformer, and Mamba architectures; detailed comparative analysis is provided in Section 4.3. Benefiting from the

O (H W)

linear complexity advantage of State Space Models (SSMs) in processing global context, SARM achieves an actual inference speed of 136.52 FPS. Overall, while maintaining high-quality enhancement results, SARM offers superior advantages in flexible deployment, meeting the requirements of underwater robotic platforms for real-time visual perception tasks.

4.2.3. Qualitative Analysis

In practical underwater visual tasks, intuitive visual quality assessment is the primary criterion for verifying the engineering reliability of algorithms. This section first conducts a qualitative evaluation of the enhancement effects of different methods through multi-scene visualization results and underlying pixel distributions. Combined with the comparison results in Figure 3, Figure 4 and Figure 5, the proposed method demonstrates advantages primarily in the following three aspects:

Accurate Color Correction and Artifact Suppression: When processing highly heterogeneous color-cast waters, existing methods frequently exhibit insufficient correction or introduce unnatural tones. For example, Phaseformer and NU²Net still retain a certain cyan-green color cast when processing blue-cast images, while Fusion and HFM significantly improve contrast, though they introduce reddish or purplish color artifacts in the background water body or on rocks (e.g., the lobster background in Figure 5). Observing the scenes in the 5th and 6th rows of Figure 4, reddish or orange tones also appear in the results of DCD and CLIP-UIE. In contrast, SARM utilizes multi-color space illumination estimation and the Retinex mechanism to successfully separate the selective absorption effect of the water body. This enables SARM to suppress color shifts while restoring the grayish-white color of rocks and the white sand background.
Dehazing and Detail Enhancement: Backward scattering in highly turbid waters drastically reduces image visibility. Conventional methods (such as Phaseformer and SHR) often exhibit incomplete dehazing and an overall dark visual effect in such scenes (e.g., the diver scene in the 2nd row of Figure 3). Meanwhile, models constrained by local receptive fields are prone to losing texture details when processing complex topological structures. Leveraging the global context modeling capability of the State Space Model (Mamba) combined with the dynamic weight adjustment of the Scene-Aware Adapter (SAA), SARM eliminates the global fogging phenomenon. As can be seen from the 4th row of Figure 3, the model maintains smooth spatial transitions while improving the distinguishability of high-frequency textures on rock surfaces.
Physical Realism and Over-exposure Suppression: In shallow-water highlight or complex light source scenes, some deep models lacking physical constraints are prone to highlight clipping when forcibly stretching local contrast (e.g., the over-exposure phenomenon of CLIP-UIE in certain regions). Due to the introduction of prior-based physical pseudo-labels to provide stable regularization supervision, SARM better follows the distribution laws of natural underwater illumination when brightening dark details (e.g., the dark parts of the lobster), avoiding over-saturation and local whitening. This visual performance also lays the foundation for the subsequent improvement of comprehensive perceptual metrics.

From the perspective of underlying pixel distributions, the RGB color histograms in Figure 6 visually reflect the color restoration performance of various methods. Due to the spectral absorption of the water column, the red light energy in the raw images is severely attenuated. Observing the histograms in the fourth row, it is evident that the enhancement results of baseline methods such as Phaseformer, DCD, and Fusion exhibit a high degree of overlap among the three-channel curves. However, this absolute cross-channel overlap is typically a byproduct of overfitting the Gray World Assumption. This comes at the expense of sacrificing the local color richness of the image, resulting in unnatural visual degradation toward grayscale (as illustrated in the third row of Figure 6).

In contrast, SARM presents a more rational pixel distribution profile. It not only effectively broadens the dynamic range of the R channel to compensate for the red light attenuation but also ensures that the RGB curves maintain independent and continuous envelope shapes across the [0, 255] interval, without being forcibly squeezed into overlap. This distributional characteristic demonstrates that, while effectively correcting the underwater color cast, SARM thoroughly preserves the original natural color gradients of the scene, thereby avoiding the color homogenization induced by over-enhancement.

Overall, qualitative and quantitative experimental results demonstrate that the proposed SARM method can effectively enhance image quality across various underwater scenarios, verifying the effectiveness of the proposed framework in practical engineering applications.

4.3. Ablation Study

To verify the effectiveness of each module in the SARM framework and their contributions to the “prior-data dual-driven” design, we conduct an orthogonal ablation study on the UIEB and EUVP datasets. The experiments primarily investigate two core issues: first, verifying whether the Mamba architecture can effectively model long-range spatial dependencies while maintaining low inference latency; second, analyzing the changes in the model’s robustness against specific degradations, such as spectral distortion and non-uniform fogging, after removing specific prior-guided components (e.g., the Illumination Estimator, the Scene-Aware Adapter, and pseudo-labels). The experiments adopt a Leave-One-Out Strategy, which involves stripping away the core modules one by one from the complete model. The evaluation dimensions cover signal fidelity (FDUM, PSNR, SSIM), color restoration (UCIQE, CCF), perceptual quality (UIQM, URanker), and inference efficiency (FPS). Detailed data are presented in Table 2 and visual results are shown in Figure 7.

4.3.1. Architecture Analysis

This section primarily investigates the impacts of different backbone networks on the system’s foundational performance and inference efficiency. Under a pure backbone network setting without any auxiliary physical modules loaded (Rows 1–3), we compare the Base-Mamba adopted in this paper with representative CNN and Transformer baseline architectures. As shown in Table 3, although the Transformer-based baseline model (Baseline-Trans) achieves 0.6068 on the signal fidelity metric FDUM, outperforming the traditional CNN architecture (0.5996), this comes at the cost of a significantly increased computational load. Because the self-attention mechanism introduces a computational complexity of

O ({(H W)}^{2})

that grows quadratically with image resolution, its inference speed is only 31.29 FPS, making it difficult to meet the real-time requirements of edge devices such as Autonomous Underwater Vehicles (AUVs).

In contrast, Base-Mamba introduces the linear complexity characteristic

O (H W)

of State Space Models (SSMs). Under the same hardware environment, it achieves an inference rate of 142.35 FPS, which is approximately 4.5 times faster than the Transformer architecture and also outperforms the lightweight CNN (115.42 FPS). Furthermore, the Base-Mamba without physical constraints performs relatively weakly on the CCF (25.65) and FDUM (0.5583) metrics, indicating that a purely data-driven SSM architecture lacks sufficient Inductive Bias when processing underwater degradation.

4.3.2. Ablation of Core Modules

Based on the Base-Mamba backbone network, this section quantitatively analyzes the specific contributions of different components to network performance by individually stripping each physical constraint module from the Full Model.

(1) Illumination Estimator: After removing this module (w/o Retinex, Row 4), the model’s frequency-domain uniformity metric FDUM decreases from 0.6340 to 0.6219. This decline indicates that, without an explicit illumination decomposition mechanism, the network’s ability to process non-uniform underwater illumination fields (such as local light spots or depth attenuation) is somewhat limited. By extracting illumination priors, the illumination estimation module assists the network in maintaining the global structure of the image to a certain extent, alleviating local over-exposure or the loss of texture in dark regions.
(2) Scene-Aware Adapter: When this adapter is removed (w/o Scene, Row 5), the color correlation metric CCF decreases from 35.76 in the Full Model to 27.45. This performance degradation indicates that static, fixed loss weights have limited generalization capability when dealing with highly heterogeneous natural water body distributions. Facing underwater environments with different turbidities and color cast tendencies, a single optimization objective struggles to achieve an adaptive balance between dehazing intensity and color correction. By constructing a dynamic mapping between degradation features and loss weights, this adapter enhances the network’s targeted adjustment capability for complex scenes, thereby improving the final color restoration.
(3) Physical Pseudo-Label Constraints: In the absence of real paired data, the selection of supervision signals has a decisive impact on the generation results. After removing the physical pseudo-labels (w/o Pseudo, Row 6), the model’s CCF metric drops to 23.70, and the perceptual metric URanker also decreases from 2.491 to 2.044. This data comparison demonstrates that relying solely on conventional self-supervised adversarial learning is prone to producing uncontrolled color shifts and local distortions during the feature mapping process. Introducing pseudo-labels generated based on physical fusion mechanisms provides the network with relatively reliable physical regularization constraints. Consequently, while improving local contrast, it better maintains a color distribution that conforms to natural laws.

Synthesizing all ablation results, the proposed full model (SARM (Full), Row 7) achieves superior performance across multiple key metrics (e.g., FDUM 0.6340, CCF 35.76) while maintaining an inference rate of 136.52 FPS.

4.4. Application Study

We apply SARM as a preprocessing technique in downstream tasks such as underwater image keypoint detection, edge recognition, and pixel-level segmentation. The relevant visual comparisons and statistical results are shown in Figure 8, Figure 9 and Figure 10.

First, we extract the feature points of the raw underwater images and the enhanced images using the Scale-Invariant Feature Transform (SIFT) algorithm. Considering the standardized physical size of the images generated by the model (

256 \times 256

pixels), the absolute number of feature points is inherently limited. However, in highly turbid raw images (e.g., the underwater arch scene in the 4th column), the algorithm can barely detect 191 feature points, whereas after enhancement by SARM, thanks to the recovery of high-frequency details, the feature points accurately cover the arch structure, with the total number significantly increasing to 761. Similarly, the feature points for the diver and measurement grid in the 1st column also increase from 637 to 1344. This indicates that the proposed framework can provide more abundant keypoints for downstream feature matching tasks. In Canny edge detection, edges extracted from raw images often exhibit obvious discontinuities or even complete loss (e.g., the raw arch edges in the 4th column are almost entirely black). In contrast, the enhancement results of SARM can extract more coherent and complete geometric contours. As can be seen from the enhanced edge maps, the straight boundaries of the seabed grid (1st column), the skeletal structure of the shipwreck (2nd column), and the independent contours of the fish school (5th column) are all clearly extracted. These specific details further prove the advantage of the model in recovering high-frequency geometric features of underwater images.

To further objectively evaluate the impact of this method on pixel-level classification tasks, we perform Fuzzy C-Means (FCMs) clustering segmentation on the raw images and the enhanced results. As shown in Figure 10, due to severe blue-green color casts and low contrast, the segmentation results of the raw images exhibit obvious regional merging: the diver in the 1st column is difficult to distinguish from the background water body, the shipwreck and arch structures in the 2nd and 4th columns are lost over large areas, and the clay pots in the 3rd column blend into the sandy seabed. In contrast, SARM restores the color and local contrast of the images, enabling the segmentation algorithm to accurately separate the diver’s contour and restore the hollow structure of the arch. When processing the dense clay pots and fish schools in the 3rd and 5th columns, the boundaries of the enhanced segmentation maps are clearer, and the target merging phenomenon is significantly improved. This demonstrates that the proposed framework can significantly improve underwater image quality, providing a more reliable preprocessing input for downstream visual tasks.

5. Conclusions

To address the complex degradation issues of underwater images, this paper proposes a prior-guided and scene-aware underwater image enhancement framework, known as SARM. This network integrates the Retinex theory with the State Space Model (SSM), achieving global feature aggregation while maintaining a linear computational complexity of

O (H W)

. In terms of specific mechanisms, the combination of multi-color space illumination estimation and the dual-track Mamba Denoiser effectively decouples heterogeneous color casts and suppresses spatially variant noise. Furthermore, the Scene-Aware Adapter (SAA), combined with physical pseudo-labels, achieves dynamic loss scheduling and feature gating, enhancing the model’s generalization performance in unknown dynamic waters.

To verify the effectiveness of the proposed model, this paper conducts detailed ablation studies and comprehensive comparative experiments. The leave-one-out ablation analysis confirms the necessity of each core module in handling composite degradation. Quantitative results on the UIEB, EUVP, and UCCS datasets demonstrate that SARM outperforms existing methods in terms of color restoration (CCF 35.76) and perceptual quality (URanker 2.491), while possessing a high inference speed of 136.52 FPS. Moreover, in downstream vision task evaluations such as SIFT feature matching, Canny edge extraction, and FCM pixel segmentation, the images enhanced by SARM effectively improve the structural discriminability of targets, verifying the engineering feasibility of this framework as a real-time visual preprocessing front-end for edge devices.

Although this framework demonstrates favorable robustness in single-frame image restoration, certain limitations remain. First, the Scene-Aware Adapter currently relies on manually quantized statistical degradation indices (e.g., color, low-light, and blur indices). Its representational generalization capability may be limited when facing non-linear distortions such as extreme light sources in the deep sea or severe turbidity. Second, the current model has not been optimized for continuous video streams. When dealing with complex scenes containing water surface flickering or dynamic light sources, the adaptive mechanism relying solely on spatial features is prone to inducing transient correction deviations. Future research will explore the introduction of learnable implicit degradation representations to replace explicit physical quantities, thereby improving the model’s adaptability to long-tailed degraded samples. Concurrently, we plan to extend the framework to video-level enhancement tasks, attempting to introduce temporal state transitions and cross-frame consistency constraints to resolve the inter-frame discontinuity issues caused by dynamic water flow disturbances.

Author Contributions

Conceptualization, Z.F., S.Y., A.S., R.X., and N.C.; methodology, Z.F., S.Y. and N.C.; software, Z.F.; validation, Z.F.; formal analysis, Z.F., S.Y., and N.C.; investigation, Z.F.; resources, A.S., R.X., and N.C.; data curation, Z.F.; writing—original draft preparation, Z.F.; writing—review and editing, Z.F., S.Y., and N.C.; visualization, Z.F.; supervision, A.S., R.X., and N.C.; project administration, N.C.; funding acquisition, A.S., R.X., and N.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Science and Technology Project of the Changjiang River Administration of Navigational Affairs (MOT) (Grant No. 2025-020-6-Z-Y), the Key Research and Development Project of Hubei Province (Grant No. 2024BCB101), and the Science and Technology Project of Hubei Provincial Department of Transportation (Grant No. 2024-81-3-3).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their valuable comments and suggestions for improving the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest. Author Rongjun Xiong declare no commercial or financial ties that could influence the objectivity of this work.

References

Hsieh, Y.Z.; Chang, M.C. Underwater image enhancement and attenuation restoration based on depth and backscatter estimation. IEEE Trans. Comput. Imaging 2025, 11, 321–332. [Google Scholar] [CrossRef]
Liang, Y.; Li, L.; Zhou, Z.; Tian, L.; Xiao, X.; Zhang, H. Underwater image enhancement via adaptive bi-level color-based adjustment. IEEE Trans. Instrum. Meas. 2025, 74, 5018916. [Google Scholar] [CrossRef]
Zhang, W.; Liu, Q.; Lu, H.; Wang, J.; Liang, J. Underwater image enhancement via wavelet decomposition fusion of advantage contrast. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 7807–7820. [Google Scholar] [CrossRef]
Liu, S.; Zheng, Y.; Li, J.; Lu, H.; An, D.; Shen, Z.; Wang, Z. Turbid underwater image enhancement with illumination-constrained and structure-preserved retinex model. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 10844–10861. [Google Scholar] [CrossRef]
Chang, H.H.; Kuan, P.Y. Underwater image enhancement using illuminant intensity compensation with foreground edge map rectification. IEEE J. Ocean. Eng. 2025, 50, 835–850. [Google Scholar] [CrossRef]
Zhou, J.; Wang, S.; Lin, Z.; Jiang, Q.; Sohel, F. A pixel distribution remapping and multi-prior retinex variational model for underwater image enhancement. IEEE Trans. Multimed. 2024, 26, 7838–7849. [Google Scholar] [CrossRef]
Kong, D.; Zhang, Y.; Zhao, X.; Wang, Y.; Cai, L. MUFFNet: Lightweight dynamic underwater image enhancement network based on multi-scale frequency. Front. Mar. Sci. 2025, 12, 1541265. [Google Scholar] [CrossRef]
Wu, Z.; Ji, P.; Chen, K.; Gao, F.; Zhao, H.; Sun, X. MSCT: Multi-Scale Conv-Transformer for Underwater Image Enhancement. IEEE Multimed. 2025, 32, 105–114. [Google Scholar] [CrossRef]
Lu, L.; Wu, D.; Wang, L.; Zhang, W.; Liu, T. Underwater image enhancement based on transformer, attention and multi-color-space inputs. IEEE Access 2025, 13, 103682–103696. [Google Scholar] [CrossRef]
Liu, X.; Xu, H.; Ju, Y.; Wang, S.; Liu, C.; Chen, L. DA-GAN: Dual-Attention GAN for Underwater Image Enhancement with Contrast and Color Correction. IEEE Trans. Geosci. Remote Sens. 2026, 64, 4200816. [Google Scholar] [CrossRef]
Kong, D.; Mao, J.; Zhang, Y.; Zhao, X.; Wang, Y.; Wang, S. Dual-Domain Adaptive Synergy GAN for Enhancing Low-Light Underwater Images. J. Mar. Sci. Eng. 2025, 13, 1092. [Google Scholar] [CrossRef]
Bi, H.; Chen, L.; Cao, J.; Wang, J.; Sun, J.; Rao, Y.; Dong, J. SeaDiff: Underwater image enhancement with degradation-aware diffusion model. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 12212–12226. [Google Scholar] [CrossRef]
Ou, Y.; Esmaeilzehi, A.; Ahmad, M.O.; Swamy, M. UADiff: A deep underwater image enhancement network using generative diffusion prior and uncertainty-aware learning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4208114. [Google Scholar] [CrossRef]
Cao, J.; Zeng, Z.; Zhang, X.; Zhang, H.; Fan, C.; Jiang, G.; Lin, W. Unveiling the underwater world: CLIP perception model-guided underwater image enhancement. Pattern Recognit. 2025, 162, 111395. [Google Scholar] [CrossRef]
Liu, S.; Li, K.; Ding, Y.; Qi, Q. Underwater image enhancement by diffusion model with customized clip-classifier. Pattern Recognit. 2025, 112232. [Google Scholar] [CrossRef]
Zhang, Y.; Yu, X.; Cai, Z. Uwmambanet: Dual-branch underwater image reconstruction based on w-shaped mamba. Mathematics 2025, 13, 2153. [Google Scholar] [CrossRef]
Fang, Y.; Sun, H.; Li, Y.; Yuan, S.; Zhao, F. Symmetry-Constrained Dual-Path Physics-Guided Mamba Network: Balancing Performance and Efficiency in Underwater Image Enhancement. Symmetry 2025, 17, 1742. [Google Scholar] [CrossRef]
Pramanick, A.; Roy, S.; Sur, A. D2Mamba: Dual Domain Guided Informed Search in State Space Model for Underwater Image Enhancement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 6–10 March 2026; pp. 7126–7136. [Google Scholar]
Luan, X.; Fan, H.; Wang, Q.; Yang, N.; Liu, S.; Li, X.; Tang, Y. FMambaIR: A hybrid state-space model and frequency domain for image restoration. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4201614. [Google Scholar] [CrossRef]
Jaffe, J.S. Underwater optical imaging: The past, the present, and the prospects. IEEE J. Ocean. Eng. 2014, 40, 683–700. [Google Scholar] [CrossRef]
Duntley, S.Q. Light in the sea. J. Opt. Soc. Am. 1963, 53, 214–233. [Google Scholar] [CrossRef]
Drews, P.; Nascimento, E.; Moraes, F.; Botelho, S.; Campos, M. Transmission estimation in underwater single images. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 1–8 December 2013; pp. 825–830. [Google Scholar]
Akkaynak, D.; Treibitz, T. Sea-thru: A method for removing water from underwater images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1682–1691. [Google Scholar]
Ancuti, C.O.; Ancuti, C.; De Vleeschouwer, C.; Bekaert, P. Color balance and fusion for underwater image enhancement. IEEE Trans. Image Process. 2017, 27, 379–393. [Google Scholar] [CrossRef]
Zhang, W.; Zhou, L.; Zhuang, P.; Li, G.; Pan, X.; Zhao, W.; Li, C. Underwater image enhancement via weighted wavelet visual perception fusion. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 2469–2483. [Google Scholar] [CrossRef]
Zhao, G.; Xiao, Y.; Huang, C.; Wang, Z.; Wu, H. Underwater image enhancement via adaptive white-balancing and multi-restoration image fusion. Opt. Rev. 2025, 32, 76–92. [Google Scholar] [CrossRef]
Li, T.; Rong, S.; Zhao, W.; Chen, L.; Liu, Y.; Zhou, H.; He, B. Underwater image enhancement using adaptive color restoration and dehazing. Opt. Express 2022, 30, 6216–6235. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Anwar, S.; Hou, J.; Cong, R.; Guo, C.; Ren, W. Underwater image enhancement via medium transmission-guided multi-color space embedding. IEEE Trans. Image Process. 2021, 30, 4985–5000. [Google Scholar] [CrossRef] [PubMed]
Islam, M.J.; Xia, Y.; Sattar, J. Fast underwater image enhancement for improved visual perception. IEEE Robot. Autom. Lett. 2020, 5, 3227–3234. [Google Scholar] [CrossRef]
Peng, L.; Zhu, C.; Bian, L. U-shape transformer for underwater image enhancement. IEEE Trans. Image Process. 2023, 32, 3066–3079. [Google Scholar] [CrossRef]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 17683–17693. [Google Scholar]
Li, T.; Rong, S.; Chen, L.; Zhou, H.; He, B. Underwater motion deblurring based on cascaded attention mechanism. IEEE J. Ocean. Eng. 2022, 49, 262–278. [Google Scholar] [CrossRef]
Yan, H.; Zhang, Z.; Xu, J.; Wang, T.; An, P.; Wang, A.; Duan, Y. UW-CycleGAN: Model-driven CycleGAN for underwater image restoration. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4207517. [Google Scholar] [CrossRef]
Zhang, L.; Chen, Y.; Lan, J.; Niu, Y. MSSCE-GAN: Multi-Scale Structural and Color Enhanced Generative Adversarial Network for Unpaired Underwater Image Enhancement. In Proceedings of the 2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC), Qingdao, China, 17–19 November 2023; pp. 837–841. [Google Scholar]
Guan, M.; Xu, H.; Jiang, G.; Yu, M.; Chen, Y.; Luo, T.; Zhang, X. DiffWater: Underwater image enhancement based on conditional denoising diffusion probabilistic model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 2319–2335. [Google Scholar] [CrossRef]
Song, J.; Xu, H.; Jiang, G.; Yu, M.; Chen, Y.; Luo, T.; Song, Y. Frequency domain-based latent diffusion model for underwater image enhancement. Pattern Recognit. 2025, 160, 111198. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Hou, X.; Zhang, L. Saliency detection: A spectral residual approach. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
Sauvola, J.; Pietikäinen, M. Adaptive document image binarization. Pattern Recognit. 2000, 33, 225–236. [Google Scholar] [CrossRef]
Guo, X.; Li, Y.; Ling, H. LIME: Low-light image enhancement via illumination map estimation. IEEE Trans. Image Process. 2016, 26, 982–993. [Google Scholar] [CrossRef] [PubMed]
Reinhard, E.; Stark, M.; Shirley, P.; Ferwerda, J. Photographic tone reproduction for digital images. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2; ACM: New York, NY, USA, 2023; pp. 661–670. [Google Scholar]
An, S.; Xu, L.; Deng, Z.; Zhang, H. HFM: A hybrid fusion method for underwater image enhancement. Eng. Appl. Artif. Intell. 2024, 127, 107219. [Google Scholar] [CrossRef]
Guo, C.; Wu, R.; Jin, X.; Han, L.; Zhang, W.; Chai, Z.; Li, C. Underwater ranker: Learn which is better and how to be better. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 702–709. [Google Scholar]
Jiang, W.; Tan, Y.; Qiu, Z.; Wang, Z.; Yu, Y.; Jiang, Q. PyUIE: A Coarse-to-Fine Deep Pyramid Network for Underwater Image Enhancement. IEEE Trans. Multimed. 2026, 28, 3054–3067. [Google Scholar] [CrossRef]
Huang, S.; Wang, K.; Liu, H.; Chen, J.; Li, Y. Contrastive semi-supervised learning for underwater image restoration via reliable bank. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18145–18155. [Google Scholar]
Zhou, J.; Sun, J.; Li, C.; Jiang, Q.; Zhou, M.; Lam, K.M.; Zhang, W.; Fu, X. HCLR-Net: Hybrid contrastive learning regularization with locally randomized perturbation for underwater image enhancement. Int. J. Comput. Vis. 2024, 132, 4132–4156. [Google Scholar] [CrossRef]
Yu, M.; Shen, L.; Yu, Y.; Zhang, Y.; Le, R. Task-Driven Underwater Image Enhancement via Hierarchical Semantic Refinement. IEEE Trans. Image Process. 2026, 35, 42–56. [Google Scholar] [CrossRef]
Khan, R.; Negi, A.; Kulkarni, A.; Phutke, S.S.; Vipparthi, S.K.; Murala, S. Phaseformer: Phase-based attention mechanism for underwater image restoration and beyond. In Proceedings of the Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 26 February–6 March 2025; pp. 9600–9611. [Google Scholar]
Fan, G.; Zhou, Y.; Zhou, J.; Ju, Y.; Chen, G.Y.; Li, J.; Kot, A.C. DCD-UIE: Decoupled Chromatic Diffusion Model for Underwater Image Enhancement. IEEE Trans. Image Process. 2026, 35, 449–464. [Google Scholar] [CrossRef]

Figure 1. The overall network architecture of the proposed SARM framework.

Figure 2. Internal network structure of the Interleaved Group Attention Block (IGAB), which integrates the illumination prior and performs global contextual scanning via the 2D Selective Scan (SS2D) mechanism.

Figure 3. Qualitative comparisons of different methods on the UIEB benchmark dataset. The raw images exhibit significant contrast degradation and detail loss. Phaseformer and CLIP-UIE still suffer from residual noise in the dark, deep-water regions. In contrast, SARM more thoroughly removes the global haze effect. Moreover, while preserving the natural background tones, it significantly enhances high-frequency texture details, such as the underwater rocks (rows 3 and 4) and the measurement grid (row 1).

Figure 4. Qualitative comparison results on the UCCS color cast dataset. This dataset features scenes with extreme blue and green color casts. When processing such extreme color casts, DCD and Fusion are prone to introducing unnatural reddish or orange artifacts. SARM more effectively suppresses the interference of various heterogeneous color casts, largely restoring the relatively natural physical colors of marine organisms and reefs, thereby demonstrating more robust tonal stability.

Figure 5. Visual enhancement comparisons of different methods on the EUVP dataset. In complex scenes featuring fish schools (row 2) and lobsters (row 3), SARM, leveraging the dynamic adjustment of the Scene-Aware Adapter (SAA), illuminates details in dark regions while effectively suppressing highlight clipping. Furthermore, SARM strikes a more reasonable balance between object edge fidelity and color saturation, generating images that align more closely with human subjective visual perception.

Figure 6. Comparison of RGB color histogram distributions for different enhancement methods. The first and third rows show examples of enhanced images, while the second and fourth rows display the corresponding pixel distribution curves for the three channels. The raw underwater images exhibit significant red-light attenuation. Phaseformer tends to forcibly align the three channels during pixel stretching, resulting in a loss of natural color and a grayscale appearance, or unnatural truncation peaks in the histograms. In contrast, SARM effectively compensates for the energy in the red channel while preserving reasonable morphological differences between the three channels, maintaining a natural color presentation.

Figure 7. Qualitative comparison of ablation studies on the core components of the SARM framework.

Figure 8. Impact of enhancement on the SIFT keypoint detection task. Compared to raw underwater images, SARM significantly increases the number of valid extracted feature points and accurately covers target structures.

Figure 9. Comparison of Canny edge detection results. SARM enables the extraction of more coherent and complete geometric contours, verifying its effectiveness in preserving high-frequency structural features.

Figure 10. Comparison of pixel-level clustering segmentation results based on Fuzzy C-Means (FCMs). By restoring image color and contrast, SARM enables the algorithm to more accurately isolate target contours and significantly improves segmentation boundaries.

Table 1. Quantitative comparison results of different enhancement methods on the UIEB, UCCS, and EUVP benchmark datasets.

Methods	URanker ↑			CCF ↑			UCIQE ↑			UIQM ↑			FDUM ↑
Methods	UIEB	UCCS	EUVP	UIEB	UCCS	EUVP	UIEB	UCCS	EUVP	UIEB	UCCS	EUVP	UIEB	UCCS	EUVP
PyUIE	2.112	2.269	2.493	29.2996	29.4248	35.1891	0.6023	0.5976	0.6144	1.3508	1.3808	1.3881	0.5671	0.4702	0.5369
Phaseformer	1.441	0.272	1.527	18.1431	11.5598	17.9430	0.5773	0.4851	0.5791	1.1403	0.7416	1.0839	0.4462	0.1913	0.3394
Fusion	1.377	1.494	2.285	20.0503	20.9589	28.3084	0.5923	0.5407	0.5887	1.3454	1.2888	1.4045	0.5386	0.3278	0.4862
CLIP-UIE	2.371	2.214	2.712	34.8420	27.1263	40.8142	0.6227	0.5601	0.6305	1.4609	1.3152	1.4316	0.7246	0.4461	0.5893
DCD	1.232	1.565	2.667	29.8477	25.7823	27.3271	0.6201	0.5654	0.5867	1.2435	1.2636	1.3995	0.5972	0.4144	0.5950
HCLR	1.689	2.532	1.080	26.3193	19.5395	36.9285	0.6128	0.5288	0.6176	1.3600	1.1630	1.4106	0.6086	0.3581	0.5547
HFM	1.192	2.016	2.807	32.2691	32.9097	27.3271	0.6269	0.5741	0.5867	1.4786	1.4035	1.3995	0.6481	0.4311	0.5950
NU²Net	1.753	1.632	2.568	20.6933	20.1309	29.1227	0.5984	0.5534	0.6056	1.2588	1.2152	1.3886	0.5095	0.3697	0.5234
Semi-UIR	1.684	1.389	2.689	27.8894	21.7851	40.0093	0.6166	0.5537	0.6174	1.3860	1.2728	1.4713	0.6308	0.4130	0.6247
HSR	1.906	1.844	2.438	22.8217	19.8717	27.6385	0.5688	0.5085	0.5879	1.3574	1.0659	1.3480	0.5406	0.2354	0.4604
WWPF	2.050	2.179	2.818	38.7457	34.0076	51.6859	0.6142	0.5853	0.6175	1.5273	1.4581	0.4815	0.7047	0.5080	0.6390
D2Mamba	2.105	1.980	1.900	25.7562	22.9233	27.1353	0.6012	0.5565	0.5823	1.4937	1.3074	1.4097	0.6904	0.4027	0.5548
FmambaIR	2.260	1.869	2.576	32.9522	22.5091	37.1278	0.6194	0.5402	0.6271	1.5237	1.2084	1.4112	0.7825	0.3802	0.5697
SARM (Ours)	2.491	2.804	3.100	35.7609	38.2005	52.5657	0.6214	0.5615	0.6423	1.4758	1.4031	1.4491	0.6340	0.4033	0.5956

Note: Red, blue, and green indicate the top three performances for each metric, respectively. ↑ denotes that a higher value represents better image quality.

Table 2. Quantitative comparison of computational complexity (Params and FLOPs) for different algorithms.

Metrics	DCD	CLIP-UIE	HCLR	HSR	Semi-UIR	Phaseformer	D2Mamba	FmambaIR	SARM (Ours)
Params (M)	97.82	97.81	4.87	14.56	1.67	1.77	4.25	4.20	8.10
FLOPs (G)	360.13	358.67	5651.99	386.75	36.43	13.00	19.37	11.81	37.85

Table 3. Quantitative ablation analysis of core components and backbone architectures in the SARM framework. The first three rows compare the performance of different backbones, while the subsequent rows verify the contribution of each physical module through leave-one-out experiments.

ID	Method/Variant	Backbone	Fidelity	Color		Perceptual		Efficiency
ID	Method/Variant	Backbone	FDUM ↑	UCIQE ↑	CCF ↑	UIQM ↑	URanker ↑	FPS ↑
1	Baseline-CNN	U-Net	0.5996	0.5992	27.3529	1.3984	2.214	115.42
2	Baseline-Trans	Swin-T	0.6068	0.5991	28.2320	1.4021	2.225	31.29
3	Base-Mamba	SSM	0.5583	0.5752	25.6510	1.4117	2.187	142.35
4	w/o Estimator	SSM	0.6219	0.6067	34.7141	1.4746	2.499	136.80
5	w/o SAA	SSM	0.6303	0.6089	27.4480	1.4711	2.514	136.25
6	w/o Pseudo	SSM	0.5994	0.5930	23.7037	1.3790	2.044	135.92
7	Ours (Full)	SSM	0.6340	0.6132	35.7609	1.4758	2.491	136.52

Note: Bold indicates the best performance for each metric. ↑ denotes that a higher value represents better image quality or higher efficiency.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fu, Z.; Yang, S.; Sun, A.; Xiong, R.; Chen, N. SARM: Scene-Aware Retinex Mamba for Underwater Image Enhancement. Remote Sens. 2026, 18, 1652. https://doi.org/10.3390/rs18101652

AMA Style

Fu Z, Yang S, Sun A, Xiong R, Chen N. SARM: Scene-Aware Retinex Mamba for Underwater Image Enhancement. Remote Sensing. 2026; 18(10):1652. https://doi.org/10.3390/rs18101652

Chicago/Turabian Style

Fu, Zhanbo, Shuang Yang, Aiguo Sun, Rongjun Xiong, and Nengcheng Chen. 2026. "SARM: Scene-Aware Retinex Mamba for Underwater Image Enhancement" Remote Sensing 18, no. 10: 1652. https://doi.org/10.3390/rs18101652

APA Style

Fu, Z., Yang, S., Sun, A., Xiong, R., & Chen, N. (2026). SARM: Scene-Aware Retinex Mamba for Underwater Image Enhancement. Remote Sensing, 18(10), 1652. https://doi.org/10.3390/rs18101652

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

SARM: Scene-Aware Retinex Mamba for Underwater Image Enhancement

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Traditional Underwater Image Enhancement Methods

2.1.1. Physical Model-Based Methods

2.1.2. Non-Physical Model-Based Methods

2.2. Deep Learning-Based Underwater Image Enhancement Methods

3. Methodolgy

3.1. Overall Architecture

3.2. Illumination Estimator

3.3. Mamba Denoiser

3.4. Scene-Aware Adapter

3.5. Pseudo-Label Generation and Caching

3.6. Dynamic Composite Loss

4. Experiments

4.1. Experimental Settings

4.2. Comparison with State-of-the-Art Methods

4.2.1. Quantitative Analysis

4.2.2. Computational Efficiency Analysis

4.2.3. Qualitative Analysis

4.3. Ablation Study

4.3.1. Architecture Analysis

4.3.2. Ablation of Core Modules

4.4. Application Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI