MambaDPF-Net: A Dual-Path Fusion Network with Selective State Space Modeling for Robust Low-Light Image Enhancement

Zikang Zhang; Songfeng Yin

doi:10.3390/electronics14224533

and

¹

Electronic Engineering Institute, National University of Defense Technology, Hefei 230601, China

²

Hefei Institute for Public Safety Research, Tsinghua University, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Electronics2025, 14(22), 4533;https://doi.org/10.3390/electronics14224533

This article belongs to the Special Issue 2D/3D Industrial Visual Inspection and Intelligent Image Processing

Version Notes

Order Reprints

Review Reports

Abstract

Low-light images commonly suffer from insufficient contrast, noise accumulation, and colour shifts, which impair human perception and subsequent visual tasks. We propose MambaDPF-Net—a dual-path fusion framework based on the retinal effect, adhering to a ‘decoupling–denoising–coupling’ paradigm while incorporating sharpening priors for texture stabilisation. Specifically, the decoupling branch estimates illumination and reflectance through dual-scale feature aggregation with physically interpretable constraints; the denoising branch primarily performs noise reduction in the reflectance domain, employing an illumination-aware modulation mechanism to prevent excessive smoothing in low-SNR regions; the coupling branch utilises a selective state space module (Mamba) to adaptively fuse spatio-temporal representations, achieving non-local interactions and cross-region long-range dependency modelling with near-linear complexity. Extensive experiments on public datasets demonstrate that this method achieves state-of-the-art performance on metrics such as PSNR and SSIM, excels in non-reference evaluations, and produces natural colours with enhanced details. This validates the proposed approach’s effectiveness and robustness.

Keywords:

low-light image enhancement; Retinex; dual-path fusion; frequency–spatial fusion; Mamba

1. Introduction

Imaging in low-light conditions degrades contrast, amplifies sensor noise, and introduces color bias, which undermines both visual quality and the reliability of detection and segmentation in downstream tasks [1,2,3,4]. Classical Retinex-based methods model the image as the product of illumination and reflectance, and achieve dynamic range compression by estimating a smooth illumination and a detail-preserving reflectance. Despite progress from SSR/MSR/MSRCR [5,6,7] to more robust priors, simplifying assumptions and hand-crafted pipelines often struggle with complex noise and non-uniform illumination, leading to color instability and insufficient denoising.

Data-driven approaches have substantially improved robustness by learning decomposition and enhancement in an end-to-end manner [8,9,10,11,12,13,14]. Representative methods such as RetinexNet [15], KinD [16], RUAS [17], DeepUPE [18], and Transformer-based Retinexformer enhance global consistency and detail preservation to varying degrees. However, they typically operate within a single domain or weakly couple cross-domain priors, leaving a gap in jointly modeling long-range dependencies and cross-region interactions under heterogeneous illumination while simultaneously controlling noise and preserving textures.

We address these gaps with MambaDPF-Net, a Retinex-guided dual-path fusion network. Our design introduces a sharpening prior to stabilize edges under low illumination, decouples illumination and reflectance with dual-scale aggregation to obtain interpretable components, and performs illumination-aware denoising primarily in the reflectance domain. A selective state space coupling block adaptively fuses spatial- and frequency-domain features to capture non-local interactions at near-linear complexity. This unified framework improves global–local consistency, reduces artifacts in low-SNR regions, and preserves color fidelity. We further provide comprehensive ablations, cross-dataset generalization, and complexity-throughput analysis to substantiate the practicality of our approach.

The contributions of this paper are outlined below:

We propose MambaDPF-Net, a dual-path fusion network for low-light enhancement guided by the Retinex model, establishing an integrated framework where sharpening, decoupling, denoising, and coupling sub-networks collaborate synergistically.
The decoupling branch employs dual-scale feature aggregation to robustly estimate illumination and reflectance maps, achieving physically interpretable component representations.
A dedicated denoising branch for the reflectance domain is constructed, incorporating illumination noise correction to suppress artefacts while avoiding excessive smoothing.
The coupling branch incorporates a Mamba selective state space module. This mechanism is uniquely suited for the non-uniform nature of low-light images, enabling content-aware fusion: it dynamically models long-range dependencies in structured, well-lit regions while simultaneously suppressing noise propagation from dark, low-signal areas.

2. Related Work

Low-light image enhancement (LLIE) has been extensively studied, with methods evolving from traditional signal processing techniques to sophisticated deep learning architectures.

2.1. Traditional Methods

Early approaches to LLIE were dominated by two main categories. Histogram Equalization (HE) and its variants, such as Contrast Limited Adaptive Histogram Equalization (CLAHE), aim to improve contrast by redistributing pixel intensities. While simple and fast, they often amplify background noise and can lead to unnatural-looking results.

The second category is grounded in Retinex theory [19], which models an image as the product of an illumination map and a reflectance map. Methods like Single-Scale Retinex (SSR) [5] and Multi-Scale Retinex (MSR) [6] estimate the illumination component to recover the reflectance, which is assumed to be the enhanced image. Subsequent works, such as LIME [20], introduced more robust structural priors for illumination map estimation. However, these methods rely on hand-crafted priors and heuristics, often struggling with severe noise, color distortion, and halo artifacts around sharp edges.

2.2. Deep Learning-Based Methods

With the advent of deep learning, data-driven methods have become the dominant paradigm in LLIE, demonstrating superior performance and robustness.

CNN-based Approaches: Convolutional Neural Networks (CNNs) have been widely adopted [21,22,23,24]. Early works like LLNet [11] used autoencoders for direct end-to-end enhancement. A significant number of methods combine deep learning with Retinex theory. For instance, RetinexNet [25] and KinD [16] use CNNs to learn the decomposition into illumination and reflectance, followed by separate adjustment and denoising steps. While effective, these methods often suffer from the limited receptive fields of CNNs, making it difficult to model global illumination variations, and their multi-stage pipelines can be complex to optimize. More recent zero-reference methods like Zero-DCE [9] and its successor [26] reformulate enhancement as a curve estimation problem, offering impressive efficiency. However, without physical constraints, they may sometimes produce results with color deviations or unnatural contrast.

Transformer-based Approaches: To address the locality of CNNs, Transformer-based models like Retinexformer have been introduced. By leveraging self-attention, they can capture long-range dependencies, leading to better global consistency and reduced artifacts. Their primary drawback, however, is the quadratic computational complexity with respect to image resolution, which limits their efficiency and practical deployment on resource-constrained devices.

Recent Advances and Emerging Architectures: The field continues to evolve rapidly, with researchers exploring novel domains and architectures. Recognizing that the frequency domain is adept at capturing global structural information, recent works like Freqspatnet [3] propose to learn collaboratively across both spatial and frequency domains. They aim to leverage the complementary characteristics of each domain—spatial for texture and local details, frequency for global structure. As an alternative to Transformers, State Space Models (SSMs) like Mamba have recently gained attention for their ability to model long-range dependencies with linear complexity. In the context of LLIE, recent work such as MambaLLIE [27] has demonstrated the potential of SSMs for efficient and effective enhancement. While these methods are promising, they are still in their infancy, and their integration within a physically grounded framework that explicitly handles noise and color fidelity has not been fully explored.

In summary, despite significant progress, existing methods face a persistent trade-off: CNNs are efficient but local; Transformers are global but computationally expensive. Furthermore, most methods operate within a single domain (e.g., spatial or frequency) or lack a robust mechanism to jointly model cross-domain interactions and long-range dependencies. This gap motivates the design of a unified framework that can efficiently capture global context while being grounded in a physical model, which is the primary focus of our proposed work.

In addition to methods specifically designed for low-light conditions, the broader field of image enhancement continues to see rapid advancements. For instance, recent work in optical imaging has explored novel frameworks for image restoration and detail enhancement. A study in [28] introduced a physics-informed model that leverages wave-optical principles to correct aberrations, achieving high-fidelity image recovery. Similarly, research in [29] proposed an advanced network architecture tailored for removing complex noise patterns specific to certain laser imaging systems. Furthermore, a lightweight framework for real-time enhancement was recently presented in [30], focusing on computational efficiency for dynamic scenes. While these methods demonstrate excellent performance in their specific application domains, our MambaDPF-Net addresses a different and unique set of challenges inherent to low-light photography. Unlike approaches that target sensor-specific noise or optical aberrations, our work focuses on the holistic problem of non-uniform illumination, color distortion, and signal-dependent noise. By integrating the physically grounded Retinex model with the selective state-space capabilities of Mamba, our dual-path fusion architecture provides a specialized solution that distinguishes it from these recent, yet distinct, advancements in the broader image enhancement landscape.

3. Approach

3.1. Overview

For the input low-illuminance image I, we estimate illumination L and reflectance R based on the retinal effect formula

I = R \otimes L

. This network comprises four synergistic branches: the sharpening subnetwork enhances high-frequency texture representation by incorporating edge gradient priors, providing a robust structural foundation for subsequent processing; the decoupling subnetwork innovatively adopts a dual-scale feature aggregation mechanism, achieving robust estimation of illumination and reflection components while maintaining physical interpretability; The denoising subnetwork directly adopts the proven R2RNet architecture, focusing on noise suppression in the reflective domain. Its multi-stage residual learning mechanism effectively balances artefact removal with detail preservation. The coupling subnetwork employs an attention mechanism and selective state space module to fuse spatio-temporal features, achieving globally consistent reconstruction. The final enhanced output is obtained by coupling the corrected illumination with denoised reflectance under reconstruction consistency constraints. The overall architecture of the proposed MambaDPF-Net is illustrated in Figure 1.

Figure 1. The proposed network architecture for the MambaDPF-Net. The network is composed of four distinct sub-networks: the sharp network, the decoupling network, the denoising network, and the coupling network. The sharp network’s role is to enhance the details of the edges. The decoupling network’s task involves separating the input low-light images into illumination and reflection components. Subsequently, a denoising network is employed to diminish noise within the reflectivity map. Finally, the coupling network integrates the illumination map and the noise-reduced reflection map from the decoupled network to generate an enhanced output.

3.2. Sharp-Net

To achieve detail enhancement, this paper employs a ‘contrast pre-enhancement—multi-directional edge extraction—adaptive fusion’ workflow within the sharpening subnetwork. The effect of this process is illustrated in Figure 2. It begins by applying Contrast-Limited Adaptive Histogram Equalisation (CLAHE) for local contrast enhancement. This technique amplifies feature gradients in underexposed regions while simultaneously suppressing noise amplification through a preset clipping threshold. Subsequently, an eight-directional Sobel [31] operator is applied to compute gradients across a comprehensive set of orientations (θ ∈ {0°, 45°, …, 315°}). The final gradient magnitude at each pixel,

G (x, y)

, is determined by the maximum response across all directions, as defined by:

G (x, y) = \max_{θ} | G_{θ} (x, y) |,

(1)

where

G_{θ} (x, y)

represents the gradient response at pixel coordinates

(x, y)

for a given orientation

θ

. The resultant raw edge map is then refined using morphological opening and closing operations to suppress isolated noise pixels and connect discontinuous edge segments. This refined map serves as a spatial confidence map,

W

, to guide the final fusion. The enhanced image,

I_{enhanced}

, is synthesized through a detail-preserving fusion process:

I_{enhanced} = I_{original} + W_{smooth} ⊙ (I_{CLAHE} - I_{original}),

(2)

Here,

I_{original}

is the input image,

I_{CLAHE}

is the contrast-enhanced intermediate image,

⊙

denotes element-wise multiplication, and

W_{smooth}

is the confidence map W after undergoing Gaussian smoothing to ensure spatial coherence and prevent halo artifacts.

Figure 2. Visualization of the intermediate and final outputs of the Sharp-Net module. From left to right, the columns display: (a) original low-light image, (b) CLAHE pre-enhanced image, (c) eight-directional Sobel gradient map, (d) refined edge map after morphological operations, (e) edge saliency map after non-maximum suppression, (f) final enhanced image, and (g) the corresponding adaptive weight map used for fusion. Each row demonstrates the process on a different input image, showcasing the robustness of the method across various scenes.

3.3. Decouple-Net

The Decouple-Net is designed to robustly decompose low-light images into illumination and reflectance components, which is crucial for subsequent enhancement and denoising processes. Building upon Retinex theory, our network employs a dual-scale decomposition strategy to capture both global illumination trends and fine-grained structural details, ensuring physically interpretable and high-quality component maps.

As illustrated in Figure 3, the Decouple-Net processes dual-scale inputs: the primary resolution at the main scale and 1.5× upsampled resolution at the auxiliary scale. This two-stage decomposition framework first extracts finer-grained structural information at the auxiliary scale before returning to the main scale for robust refinement. Both scales employ a cascaded “Residual Module (RM) + Efficient Channel Attention (ECA)” architecture, where each RM incorporates stacked convolutional kernels with sizes {1, 3, 3, 3, 1} and channel counts {64, 128, 256, 128, 64} (see Figure 4, bottom right). A 64 × 1 × 1 convolutional layer is added at the shortcut connection to ensure feature dimension matching. The RM structure effectively suppresses redundant information while emphasizing key channel responses through the ECA mechanism, which adaptively recalibrates channel-wise feature weights based on global context.

Figure 3. The network processes inputs at two scales: main scale (×1) and auxiliary scale (×1.5). Both scales utilize cascaded Residual Modules (RMs) with Efficient Channel Attention (ECA). The Bimodal Integration Unit (BIU) fuses cross-scale features. Detailed structures of the BIU and RM are shown in the bottom left and right insets, respectively. The network outputs decoupled reflectance (R_Decouple-low) and illumination (I_Decouple-low) maps.

Figure 4. High-level architecture of the Couple-Net. It consists of three main components: SDEM, FDAM, and DIM. These modules process features in parallel, and their outputs are fused by DIM for the final reconstruction.

Concurrently, an attention-aware Bimodal Integration Unit (BIU, Figure 3, bottom left) is employed to align, recalibrate, and gate-fuse cross-scale features. The BIU utilizes max pooling and average pooling to capture multi-scale contextual information, which is then processed through convolutional layers and concatenated to generate attention weights. This mechanism effectively injects auxiliary-scale high-frequency details into main-scale illumination estimation, enabling more precise modeling of illumination and reflectance maps.

During training, the Decouple-Net learns decomposition using paired low-light (at both primary and auxiliary scales) and normal-light images. The network leverages the shared reflection prior, i.e., the constraint that low-light and normal-light images of the same scene should share the same reflectance map. This unsupervised learning paradigm eliminates the need for ground-truth illumination or reflectance maps. Instead, the network learns from the consistency of reflectance and the smoothness of illumination, embedded into the loss function. Importantly, the reflection and illumination maps derived from normal-light images serve solely as references during the decomposition process and are excluded from subsequent training and inference stages to prevent information leakage.

3.4. Denoise-Net

The primary function of the Denoise-Net is to address the noise amplification that often occurs in the reflectance map (R) after decomposition. This step is crucial for preventing noise from degrading the final enhanced image. To ensure effective noise suppression without sacrificing critical image details, we adopt the Denoise-Net architecture consistent with that described in the R2RNet paper, leveraging its proven efficacy in balancing noise removal and detail preservation.

3.5. Couple-Net

The Couple-Net constitutes the final stage of our framework, meticulously designed to reconstruct the enhanced image by synergistically fusing features from both the spatial and frequency domains. Acknowledging that spatial information (e.g., local textures, edges) and frequency information (e.g., global brightness, periodic patterns) are complementary, the Couple-Net executes a sophisticated workflow. As illustrated in Figure 4, this network is composed of three core components operating in parallel: the Spatial Domain Enhancement Module (SDEM), the Frequency Domain Augmentation Module (FDAM), and the Dual-Domain Information Integration Module (DIM), which ultimately merges their outputs.

3.5.1. Spatial Domain Enhancement Module (SDEM)

The SDEM, whose architecture is detailed in Figure 5, is tasked with enhancing the spatial details of the denoised reflectance map. It is built upon a U-Net-like encoder–decoder structure to effectively capture multi-scale contextual information. The process begins with a 3 × 3 dilated convolution, which expands the initial receptive field without adding extra parameters. The encoder path comprises four downsampling stages. In each stage, a 2 × 2 stride convolution is used for downsampling, followed by a Residual Module (RM) and an Efficient Channel Attention (ECA) block. This design choice purposefully avoids max-pooling layers to prevent the irreversible loss of feature information. Symmetrically, the decoder path uses 2 × 2 deconvolution layers to upsample the features. A long skip connection links the feature maps from the first RM in the encoder to the last RM in the decoder, ensuring that low-level, high-resolution features are preserved and reused for fine-grained reconstruction. The output of this module is a spatially refined feature map, denoted as S.

Figure 5. Architecture of the Spatial Domain Enhancement Module (SDEM). It features a U-Net structure equipped with stride convolutions for downsampling, Residual Modules (RMs), and Efficient Channel Attention (ECA) blocks.

3.5.2. Frequency Domain Augmentation Module (FDAM)

The FDAM, depicted in Figure 6, is engineered to augment the feature representation within the frequency domain. This domain is particularly adept at capturing global periodic patterns and can mitigate potential information loss that occurs during domain transformations. The module first converts spatial features into the frequency domain using the Fast Fourier Transform (FFT). All subsequent operations, including convolutions and residual connections, are performed using complex-valued arithmetic. This approach preserves both the magnitude and phase information, which is critical for a complete representation.

Figure 6. Architecture of the Frequency Domain Augmentation Module (FDAM). It operates in the frequency domain using FFT, Complex Residual Blocks, and complex convolutions to process features.

A complex convolution operation is mathematically defined as follows:

(a + i b) * (c + i d) = (a c - b d) + i (a d + b c),

(3)

In this equation, the variables represent the components of the complex numbers being convolved. Specifically,

a

is the real part of the input feature and

c

is the real part of the convolutional kernel. Correspondingly,

b

is the imaginary part of the input feature and

d

is the imaginary part of the kernel. This operation ensures that the rich information encoded in the complex domain is fully leveraged. The FDAM also employs an encoder–decoder structure with Complex ResBlocks to process these features, ultimately producing a frequency-augmented feature map, F.

3.5.3. Dual-Domain Information Integration Module (DIM)

The DIM, detailed in Figure 7, serves as the intelligent core of the Couple-Net. Its primary function is to fuse the spatial features S from SDEM with the frequency features F from FDAM in a sophisticated, multi-stage manner.

Figure 7. Architecture of the Dual-Domain Information Integration Module (DIM). It fuses spatial (S) and frequency (F) features using a combination of local and global attention, a central SS2D (Mamba) block for long-range modeling, and subsequent pixel/channel attention mechanisms for refinement.

The fusion process unfolds as follows:

Initial Attention Gating: The input features S and F first pass through separate Local Channel Attention blocks to recalibrate their channel-wise responses independently. Concurrently, a Global Interaction Attention mechanism is applied between them to explicitly model cross-domain dependencies.

Long-Range Dependency Modeling: The concatenated features from the initial gating stage are then fed into a 2D Selective State Space (SS2D) module. This module, an adaptation of the Mamba architecture for 2D visual data, is pivotal for our fusion task. Unlike Transformers which apply uniform attention, the SS2D module leverages a selective state space model. This allows it to dynamically modulate the propagation of information across the image based on local feature characteristics. For low-light enhancement, this means it can establish long-range connections between salient structural features in brighter areas while simultaneously preventing noise amplification from darker regions, achieving a more context-aware and robust feature fusion. It captures global context with near-linear computational complexity, overcoming the quadratic complexity limitations of standard self-attention mechanisms in Transformers.

Hierarchical Feature Refinement: The output from the SS2D module undergoes a final refinement step. It is processed through parallel pixel attention and channel attention blocks, which work together to adaptively highlight the most salient spatial locations and informative feature channels.

The resulting features are fused and passed through a final convolution layer to produce the integrated feature map. This map is then processed by a shallow block of 3 × 3 and 1 × 1 convolutions to generate the final, high-quality enhanced image.

3.6. Multi-Task Training Learning Framework

To enhance the convergence stability and visual consistency of low-light enhancement, this paper employs a multi-task joint training strategy, simultaneously incorporating decoupling, denoising, and coupling objectives into the optimisation process. The overall loss function comprises

L_{Decouple}

,

L_{Denoise}

, and

L_{Couple}

, each branch consisting of both content and perceptual components. The content term emphasises measurable consistency at the pixel and structural levels, while the perceptual term constrains high-level semantic and subjective quality. This dual approach collaboratively prevents overfitting or excessive smoothing that may arise from relying solely on a single metric.

The content loss function is set to L2 and executed on three subnets, which are defined as follows:

l_{Decouple_con} = \frac{1}{2 n} (\sum_{i = 1}^{n} (R_{Decouple_low} \cdot I_{Decouple_low} - P_{input_low})^{2},

(4)

l_{Denoise_con} = \frac{1}{n} \sum_{i = 1}^{n} (R_{Denoise_low} - R_{Decouple_low})^{2},

(5)

l_{Denoise_con} = \frac{1}{n} \sum_{i = 1}^{n} (C_{aug_image} - P_{input_nor})^{2},

(6)

where

P_{input_low}

,

P_{input_nor}

are the input low-light image and normal image, respectively.

The perceptual loss is subsequently calculated using the features derived from the VGG-16 pre-trained model, as defined below:

\begin{matrix} l_{Decouple_per} = \frac{1}{2 n} (\sum_{i = 1}^{n} (R_{Decouple_low} \cdot I_{Decouple_low}) \\ - VGG (R_{Decouple_low} \cdot I_{Decouple_low}))^{2} \end{matrix},

(7)

l_{Denoise_per} = \frac{1}{n} \sum_{i = 1}^{n} (R_{Denoise_low} - VGG (R_{Denoise_low}))^{2},

(8)

In summary, the formula for calculating the final loss of DPF-Net within a multi-task learning training framework is defined as follows:

\begin{array}{l} L = L_{Decouple} + L_{Denoise} + L_{Couple} \\ = l_{Decouple_con} + 0.1 \times l_{Decouple_per} + l_{Denoise_con} + 0.1 \times l_{Denoise_per} + l_{Denoise_con} \end{array}

(9)

where

R_{Decouple_low}

and

I_{Decouple_low}

represent the reflectance and illumination maps estimated by the Decouple-Net from the low-light input, respectively.

R_{Denoise_low}

is the denoised reflectance map produced by the Denoise-Net.

C_{aug_image}

denotes the enhanced image from the Couple-Net.

VGG (\cdot)

refers to the feature extractor from the pre-trained VGG-16 network.

4. Experiments

4.1. Realization Details

The experiments described herein employ a neural network constructed upon the Windows operating system and PyTorch 2.1 deep learning framework. Utilising publicly available LSRW and LOLv datasets, the model converged efficiently after 40 training epochs on a 3080Ti GPU, demonstrating outstanding performance across validation sets including LOL, LSRW, LIME, DICM, and VV. Regarding parameter configuration, the Adam optimiser was employed for gradient updates with an initial learning rate of 0.001. To ensure model convergence, the L2 loss function was selected during training. The batch size was fixed at 6, with image cropping dimensions set to 112 pixels. Additionally, a learning rate decay strategy was implemented, reducing the learning rate to 10% of its previous value every 10 training epochs.

Datasets. For quantitative evaluation, we utilized two paired datasets: the LOL dataset, which is divided into a training set of 485 low/normal-light image pairs and a testing set of 15 pairs, and the LSRW dataset, consisting of 5600 pairs for training and 100 for testing. To assess the generalization capabilities of our method on real-world unpaired data, our qualitative evaluations were performed on the LIME (10 images), VV (24 images), and DICM (69 images) datasets.

4.2. Comparison with State-of-the-Art Methods on Real Datasets

To comprehensively evaluate the performance of our proposed MambaDPF-Net, we conducted extensive comparisons with several state-of-the-art (SOTA) low-light image enhancement methods. The evaluation is twofold: quantitative analysis on paired datasets and qualitative analysis on unpaired real-world datasets.

For the quantitative assessment, we benchmarked our model against prominent methods on three widely used paired datasets: LOL-v1, LOL-v2(real), and LSRW. The results summarized in Table 1 and the effects shown in Figure 8 clearly demonstrate the superiority of our approach. As shown in the table, our method consistently achieves the highest scores in both PSNR and SSIM across all three datasets.

Table 1. Quantitative comparison with state-of-the-art methods on the LOL-v1, LOL-v2 and LSRW datasets. The best results are highlighted in bold.

Figure 8. In the LOL dataset, a comparison of the visualization results for image 55, and in the LSRM dataset, for image 2060, from left to right, the images are as follows: low-light, Zero-DCE, Retinexnet, R2RNet, CNTNet, Retinexformer, Ours, normal light.

Notably, when compared to the current state-of-the-art method, Retinexformer, our model shows significant improvements. On the LOL-v2(real) dataset, our method surpasses Retinexformer by 0.57 dB in PSNR and 0.013 in SSIM. The performance gain is even more pronounced on the LSRW dataset, where our approach achieves an increase of 0.719 dB in PSNR and a substantial 0.252 in SSIM, showcasing its robustness. On the LOL-v1 dataset, our method also maintains a competitive edge, outperforming all other listed techniques. These consistent quantitative improvements validate the effectiveness of our proposed network architecture in restoring high-fidelity images from low-light conditions.

While quantitative metrics on paired datasets provide a valuable measure of reconstruction fidelity, evaluating performance on real-world, unpaired datasets is crucial for assessing a method’s practical utility and generalization capabilities. To this end, we further conducted qualitative comparisons on the challenging DICM and VV datasets, which feature complex lighting conditions and lack ground-truth references.

As illustrated in Figure 9 and in Table 2, our method exhibits remarkable generalization ability on these challenging real-world images. In the DICM example (top row), our approach effectively enhances the brightness of the indoor corridor while successfully preserving the details visible through the glass doors without introducing over-exposure artifacts. Compared to Retinexformer, which produces a slightly washed-out result with less distinct colors, our method yields a more balanced contrast and superior color fidelity.

Figure 9. Qualitative comparison of the DICM and VV datasets. From left to right: Low-light Input, Retinexformer, and our method. Our proposed method demonstrates superior performance in restoring details and preserving natural colors in challenging real-world scenarios.

Table 2. NIQE scores of the different methods on the LIME, VV and DICM dataset, with the best results highlighted in bold.

For the severely backlit image from the VV dataset (bottom row), our model demonstrates its strength in handling extreme dynamic ranges. It successfully illuminates the subject in the foreground while retaining the rich details and vibrant colors of the sky and sea. In contrast, the competing method struggles to balance the scene, leading to a loss of texture in the highlight regions of the sky. These visual comparisons underscore the robustness and superior performance of our model, confirming its effectiveness in handling diverse and complex lighting conditions found in practical applications.

4.3. Ablation Experiment

To meticulously validate the contribution of each key component in our MambaDPF-Net, we conducted a comprehensive ablation study on the LOL dataset. We started with a baseline model and progressively integrated our proposed modules. The specific configurations are as follows, with results presented in Table 3:

Baseline: A simplified version of our network, using the basic U-Net structure from R2RNet as the backbone for both decomposition and coupling, without the sharpening prior, BIU, SDEM, FDAM, and SS2D.

Baseline + Sharp: The sharpening prior (Sharp-Net) is added to the baseline.

Baseline + Sharp + BIU: The Bimodal Integration Unit (BIU) is added to the decouple-net of the previous model.

Ours (MambaDPF-Net): The SS2D (Mamba) module is added to the DIM, replacing a standard convolution-based fusion block.

Table 3. Ablation study on the LOL(total) dataset.

Methods	LOL(PSNR)	LOL(SSIM)	LSRW(PSNR)	LSRW(SSIM)
Baseline	18.131	0.712	20.207	0.816
Baseline + Sharp	18.374	0.728	20.211	0.816
Baseline + Sharp + BIU	22.222	0.831	20.216	0.817
Ours	26.24	0.943	20.259	0.838

4.4. Complexity and Efficiency Analysis

To assess the computational profile of our model, we benchmarked its complexity and inference speed against key counterparts on an NVIDIA 3080 Ti GPU, using a uniform input size of 112 × 112 pixels. The results are presented in Table 4.

Table 4. Complexity and efficiency comparison.

The analysis highlights the exceptional efficiency of our proposed MambaDPF-Net. Our model achieves a real-time speed of 30 FPS with a computational load of only 13.7 GFLOPs. This is particularly noteworthy when compared to Retinexformer, which requires significantly more computation (15.6 GFLOPs) to attain its speed.

Crucially, our MambaDPF-Net strikes a superior performance-efficiency trade-off. While delivering substantially higher enhancement quality (as evidenced by PSNR/SSIM in Table 1 and Table 2), our model operates at a much lower computational budget than Retinexformer. This demonstrates that our architecture is not only more powerful but also more computationally efficient, establishing it as a highly practical and advanced solution for real-world deployment.

5. Discussion

In summary, this paper constructs a more robust low-light enhancement pathway atop the R2RNet framework: through adaptive histogram remapping coupled with modified Sobel gradient guidance, it performs prior correction on input distribution and edge information; dual-branch scaling captures global illuminance trends and fine-grained reflective details; while the cross-domain integration unit organically couples frequency-domain structure with spatial texture, mitigating the persistent trade-off between noise amplification in dark regions and detail loss. The joint training strategy establishes complementary constraints at the pixel, structural, and perceptual levels, enabling more balanced improvements in contrast, colour constancy, and local sharpness within the reconstructed results. Experiments demonstrate consistent superiority over the original R2RNet across benchmarks, including LOL, LSRW, LIME, and VV, delivering consistent subjective and objective gains with faster convergence and reduced artefacts.

6. Conclusions

We proposed a Retinex-guided dual-path fusion network with selective state space modeling for low-light enhancement. By combining a sharpening prior, illumination–reflectance decoupling, illumination-aware denoising and frequency–spatial coupling via Mamba, the method achieves consistent improvements over recent Retinex and Transformer baselines, with practical complexity scaling. Comprehensive evaluations, ablations and efficiency analyses substantiate the effectiveness and robustness of the design.

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: Z.Z.; data collection: Z.Z.; analysis and interpretation of results: Z.Z. and S.Y.; draft manuscript preparation: S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available from the author, [Zikang Zhang].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNNs	Convolutional Neural Networks
BIU	Bimodal Integration Unit
RM	Residual Module
SDEM	Spatial Domain Enhancement Module
FDAM	Frequency Domain Augmentation Module
DIM	Dual-Domain Information Integration Module
SFCB	Spatial-Frequency Conversion Block
DRB	Detail Recovery Block
CAM	Cross-attention Module
LCA	Local Channel Attention
GISA	Global Interaction Semantic Attention

References

Zhang, B.; Shu, D.; Fu, P.; Yao, S.; Chong, C.; Zhao, X.; Yang, H. Multi-Feature Fusion Yolo Approach for Fault Detection and Location of Train Running Section. Electronics 2025, 14, 3430. [Google Scholar] [CrossRef]
Rodríguez-Lira, D.-C.; Córdova-Esparza, D.-M.; Terven, J.; Romero-González, J.-A.; Alvarez-Alvarado, J.M.; González-Barbosa, J.-J.; Ramírez-Pedraza, A. Recent Developments in Image-Based 3d Reconstruction Using Deep Learning: Methodologies and Applications. Electronics 2025, 14, 3032. [Google Scholar] [CrossRef]
Guan, Y.; Liu, M.; Chen, X.; Wang, X.; Luan, X. Freqspatnet: Frequency and Spatial Dual-Domain Collaborative Learning for Low-Light Image Enhancement. Electronics 2025, 14, 2220. [Google Scholar] [CrossRef]
Sun, Y.; Hu, S.; Xie, K.; Wen, C.; Zhang, W.; He, J. Enhanced Deblurring for Smart Cabinets in Dynamic and Low-Light Scenarios. Electronics 2025, 14, 488. [Google Scholar] [CrossRef]
Choi, D.H.; Jang, I.H.; Kim, M.H.; Kim, N.C. Color Image Enhancement Based on Single-Scale Retinex with a Jnd-Based Nonlinear Filter. In Proceedings of the 2007 IEEE International Symposium on Circuits and Systems (ISCAS), New Orleans, LA, USA, 27–30 May 2007. [Google Scholar]
Rahman, Z.; Jobson, D.J.; Woodell, G.A. Multi-Scale Retinex for Color Image Enhancement. In Proceedings of the 3rd IEEE International Conference on Image Processing, Lausanne, Switzerland, 19 September 1996; pp. 1003–1006. [Google Scholar]
Parthasarathy, S.; Sankaran, P. An Automated Multi Scale Retinex with Color Restoration for Image Enhancement. In Proceedings of the 2012 National Conference on Communications (NCC), Kharagpur, India, 3–5 February 2012. [Google Scholar]
Fu, Y.; Hong, Y.; Chen, L.; You, S. Le-Gan: Unsupervised Low-Light Image Enhancement Network Using Attention Module and Identity Invariant Loss. Knowl. Based Syst. 2022, 240, 108010. [Google Scholar] [CrossRef]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1780–1789. [Google Scholar]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep Light Enhancement without Paired Supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Lore, K.G.; Akintayo, A.; Sarkar, S. Llnet: A Deep Autoencoder Approach to Natural Low-Light Image Enhancement. Pattern Recognit. 2017, 61, 650–662. [Google Scholar] [CrossRef]
Lv, F.; Lu, F.; Wu, J.; Lim, C.S. Mbllen: Low-Light Image/Video Enhancement Using Cnns. In Proceedings of the British Machine Vision Conference (BMVC 2018), Newcastle, UK, 3–6 September 2018. [Google Scholar]
Moran, S.; Marza, P.; McDonagh, S.; Parisot, S.; Slabaugh, G. Deeplpf: Deep Local Parametric Filters for Image Enhancement. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Sharma, A.; Tan, R.T. Nighttime Visibility Enhancement by Increasing the Dynamic Range and Suppression of Light Effects. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021; pp. 11977–11986. [Google Scholar]
Hai, J.; Hao, Y.; Zou, F.; Lin, F.; Han, S. Advanced Retinexnet: A Fully Convolutional Network for Low-Light Image Enhancement. Signal Process. Image Commun. 2023, 112, 116916. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, J.; Guo, X. Kindling the Darkness: A Practical Low-Light Image Enhancer. In Proceedings of the 27th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1632–1640. [Google Scholar]
Liu, R.; Ma, L.; Zhang, J.; Fan, X.; Luo, Z. Retinex-Inspired Unrolling with Cooperative Prior Architecture Search for Low-Light Image Enhancement. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021. [Google Scholar]
Wang, R.; Zhang, Q.; Fu, C.W.; Shen, X.; Zheng, W.S.; Jia, J. Underexposed Photo Enhancement Using Deep Illumination Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Jobson, D.J.; Rahman, Z.; Woodell, G.A. A Multiscale Retinex for Bridging the Gap between Color Images and the Human Observation of Scenes. IEEE Trans. Image Process. 1997, 6, 965–976. [Google Scholar] [CrossRef] [PubMed]
Guo, X.; Li, Y.; Ling, H. Lime: Low-Light Image Enhancement Via Illumination Map Estimation. IEEE Trans. Image Process. 2017, 26, 982–993. [Google Scholar] [CrossRef] [PubMed]
Fu, X.; Zeng, D.; Huang, Y.; Liao, Y.; Ding, X.; Paisley, J. A Fusion-Based Enhancing Method for Weakly Illuminated Images. Signal Process. 2016, 129, 82–96. [Google Scholar] [CrossRef]
Li, M.; Liu, J.; Yang, W.; Sun, X.; Guo, Z. Structure-Revealing Low-Light Image Enhancement Via Robust Retinex Model. IEEE Trans. Image Process. 2018, 27, 2828–2841. [Google Scholar] [CrossRef]
Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep Retinex Decomposition for Low-Light Enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar] [CrossRef]
Subramani, B.; Veluchamy, M. Fuzzy Gray Level Difference Histogram Equalization for Medical Image Enhancement. J. Med. Syst. 2020, 44, 103. [Google Scholar] [CrossRef]
Weng, J.; Yan, Z.; Tai, Y.; Qian, J.; Yang, J.; Li, J. Mamballie: An Efficient Low-Light Image Enhancement Model Based on State Space. arXiv 2024, arXiv:2405.16105v1. [Google Scholar]
Li, C.; Guo, C.; Loy, C.C. Learning to Enhance Low-Light Image Via Zero-Reference Deep Curve Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4225–4238. [Google Scholar] [CrossRef] [PubMed]
Yang, S.; Zhou, D. Cybernetics Efficient Low-Light Image Enhancement with Model Parameters Scaled Down to 0.02M. Int. J. Mach. Learn. Cyber. 2024, 15, 1575–1589. [Google Scholar] [CrossRef]
Zhou, H.; Zeng, X.; Lin, B.; Li, D.; Ali Shah, S.A.; Liu, B.; Guo, K.; Guo, Z. Polarization Motivating High-Performance Weak Targets’ Imaging Based on a Dual-Discriminator Gan. Opt. Express 2024, 32, 3835–3851. [Google Scholar] [CrossRef]
Fan, X.; Ding, M.; Lv, T.; Sun, X.; Lin, B.; Guo, Z. Meta-Dnet-Upi: Efficient Underwater Polarization Imaging Combining Deformable Convolutional Networks and Meta-Learning. Opt. Laser Technol. 2025, 187, 112900. [Google Scholar] [CrossRef]
Lin, B.; Qiao, L.; Fan, X.; Guo, Z. Large-Range Polarization Scattering Imaging with an Unsupervised Multi-Task Dynamic-Modulated Framework. Opt. Lett. 2025, 50, 3413–3416. [Google Scholar] [CrossRef]
Chen, S.; Yang, X. An Enhanced Adaptive Sobel Edge Detector Based on Improved Genetic Algorithm and Non-Maximum Suppression. In Proceedings of the 2021 China Automation Congress (CAC), Beijing, China, 22–24 October 2021. [Google Scholar]
Hai, J.; Xuan, Z.; Yang, R.; Hao, Y.; Zou, F.; Lin, F.; Han, S. R2rnet: Low-Light Image Enhancement Via Real-Low to Real-Normal Network. J. Vis. Commun. Image Represent. 2023, 90, 103712. [Google Scholar] [CrossRef]
Cai, Y.; Bian, H.; Lin, J.; Wang, H.; Timofte, R.; Zhang, Y. Retinexformer: One-Stage Retinex-Based Transformer for Low-Light Image Enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023. [Google Scholar]

Figure 1. The proposed network architecture for the MambaDPF-Net. The network is composed of four distinct sub-networks: the sharp network, the decoupling network, the denoising network, and the coupling network. The sharp network’s role is to enhance the details of the edges. The decoupling network’s task involves separating the input low-light images into illumination and reflection components. Subsequently, a denoising network is employed to diminish noise within the reflectivity map. Finally, the coupling network integrates the illumination map and the noise-reduced reflection map from the decoupled network to generate an enhanced output.

Figure 3. The network processes inputs at two scales: main scale (×1) and auxiliary scale (×1.5). Both scales utilize cascaded Residual Modules (RMs) with Efficient Channel Attention (ECA). The Bimodal Integration Unit (BIU) fuses cross-scale features. Detailed structures of the BIU and RM are shown in the bottom left and right insets, respectively. The network outputs decoupled reflectance (R_Decouple-low) and illumination (I_Decouple-low) maps.

Figure 4. High-level architecture of the Couple-Net. It consists of three main components: SDEM, FDAM, and DIM. These modules process features in parallel, and their outputs are fused by DIM for the final reconstruction.

Figure 5. Architecture of the Spatial Domain Enhancement Module (SDEM). It features a U-Net structure equipped with stride convolutions for downsampling, Residual Modules (RMs), and Efficient Channel Attention (ECA) blocks.

Figure 6. Architecture of the Frequency Domain Augmentation Module (FDAM). It operates in the frequency domain using FFT, Complex Residual Blocks, and complex convolutions to process features.

Figure 7. Architecture of the Dual-Domain Information Integration Module (DIM). It fuses spatial (S) and frequency (F) features using a combination of local and global attention, a central SS2D (Mamba) block for long-range modeling, and subsequent pixel/channel attention mechanisms for refinement.

Figure 8. In the LOL dataset, a comparison of the visualization results for image 55, and in the LSRM dataset, for image 2060, from left to right, the images are as follows: low-light, Zero-DCE, Retinexnet, R2RNet, CNTNet, Retinexformer, Ours, normal light.

Figure 9. Qualitative comparison of the DICM and VV datasets. From left to right: Low-light Input, Retinexformer, and our method. Our proposed method demonstrates superior performance in restoring details and preserving natural colors in challenging real-world scenarios.

Table 1. Quantitative comparison with state-of-the-art methods on the LOL-v1, LOL-v2 and LSRW datasets. The best results are highlighted in bold.

Methods	LOL-v1		LOL-v2 (real)		LSRW
Methods	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
RetinexNet [23]	16.77	0.560	15.47	0.567	16.76	0.566
R2RNet [32]	20.21	0.816	18.96	0.772	20.20	0.820
Retinexformer [33]	25.16	0.845	25.67	0.930	19.54	0.586
Ours	25.31	0.849	26.24	0.943	20.259	0.838

Table 2. NIQE scores of the different methods on the LIME, VV and DICM dataset, with the best results highlighted in bold.

Methods	LIME	VV	DICM
RetinexNet [23]	4.361	3.816	4.209
Zero-DCE [9]	3.912	3.217	2.835
R2RNet [32]	3.176	3.093	3.503
Proposed	3.042	3.009	2.713

Table 4. Complexity and efficiency comparison.

Methods	Parameters (M)	FLOPs (G)	FPS
R2RNet	1.5	7.5	20
Retinexformer	1.6	15.6	35
Ours	2.0	13.7	30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

MambaDPF-Net: A Dual-Path Fusion Network with Selective State Space Modeling for Robust Low-Light Image Enhancement

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods

2.2. Deep Learning-Based Methods

3. Approach

3.1. Overview

3.2. Sharp-Net

3.3. Decouple-Net

3.4. Denoise-Net

3.5. Couple-Net

3.5.1. Spatial Domain Enhancement Module (SDEM)

3.5.2. Frequency Domain Augmentation Module (FDAM)

3.5.3. Dual-Domain Information Integration Module (DIM)

3.6. Multi-Task Training Learning Framework

4. Experiments

4.1. Realization Details

4.2. Comparison with State-of-the-Art Methods on Real Datasets

4.3. Ablation Experiment

4.4. Complexity and Efficiency Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics