Abstract
Low-light images commonly suffer from insufficient contrast, noise accumulation, and colour shifts, which impair human perception and subsequent visual tasks. We propose MambaDPF-Net—a dual-path fusion framework based on the retinal effect, adhering to a ‘decoupling–denoising–coupling’ paradigm while incorporating sharpening priors for texture stabilisation. Specifically, the decoupling branch estimates illumination and reflectance through dual-scale feature aggregation with physically interpretable constraints; the denoising branch primarily performs noise reduction in the reflectance domain, employing an illumination-aware modulation mechanism to prevent excessive smoothing in low-SNR regions; the coupling branch utilises a selective state space module (Mamba) to adaptively fuse spatio-temporal representations, achieving non-local interactions and cross-region long-range dependency modelling with near-linear complexity. Extensive experiments on public datasets demonstrate that this method achieves state-of-the-art performance on metrics such as PSNR and SSIM, excels in non-reference evaluations, and produces natural colours with enhanced details. This validates the proposed approach’s effectiveness and robustness.
1. Introduction
Imaging in low-light conditions degrades contrast, amplifies sensor noise, and introduces color bias, which undermines both visual quality and the reliability of detection and segmentation in downstream tasks [1,2,3,4]. Classical Retinex-based methods model the image as the product of illumination and reflectance, and achieve dynamic range compression by estimating a smooth illumination and a detail-preserving reflectance. Despite progress from SSR/MSR/MSRCR [5,6,7] to more robust priors, simplifying assumptions and hand-crafted pipelines often struggle with complex noise and non-uniform illumination, leading to color instability and insufficient denoising.
Data-driven approaches have substantially improved robustness by learning decomposition and enhancement in an end-to-end manner [8,9,10,11,12,13,14]. Representative methods such as RetinexNet [15], KinD [16], RUAS [17], DeepUPE [18], and Transformer-based Retinexformer enhance global consistency and detail preservation to varying degrees. However, they typically operate within a single domain or weakly couple cross-domain priors, leaving a gap in jointly modeling long-range dependencies and cross-region interactions under heterogeneous illumination while simultaneously controlling noise and preserving textures.
We address these gaps with MambaDPF-Net, a Retinex-guided dual-path fusion network. Our design introduces a sharpening prior to stabilize edges under low illumination, decouples illumination and reflectance with dual-scale aggregation to obtain interpretable components, and performs illumination-aware denoising primarily in the reflectance domain. A selective state space coupling block adaptively fuses spatial- and frequency-domain features to capture non-local interactions at near-linear complexity. This unified framework improves global–local consistency, reduces artifacts in low-SNR regions, and preserves color fidelity. We further provide comprehensive ablations, cross-dataset generalization, and complexity-throughput analysis to substantiate the practicality of our approach.
The contributions of this paper are outlined below:
- We propose MambaDPF-Net, a dual-path fusion network for low-light enhancement guided by the Retinex model, establishing an integrated framework where sharpening, decoupling, denoising, and coupling sub-networks collaborate synergistically.
- The decoupling branch employs dual-scale feature aggregation to robustly estimate illumination and reflectance maps, achieving physically interpretable component representations.
- A dedicated denoising branch for the reflectance domain is constructed, incorporating illumination noise correction to suppress artefacts while avoiding excessive smoothing.
- The coupling branch incorporates a Mamba selective state space module. This mechanism is uniquely suited for the non-uniform nature of low-light images, enabling content-aware fusion: it dynamically models long-range dependencies in structured, well-lit regions while simultaneously suppressing noise propagation from dark, low-signal areas.
2. Related Work
Low-light image enhancement (LLIE) has been extensively studied, with methods evolving from traditional signal processing techniques to sophisticated deep learning architectures.
2.1. Traditional Methods
Early approaches to LLIE were dominated by two main categories. Histogram Equalization (HE) and its variants, such as Contrast Limited Adaptive Histogram Equalization (CLAHE), aim to improve contrast by redistributing pixel intensities. While simple and fast, they often amplify background noise and can lead to unnatural-looking results.
The second category is grounded in Retinex theory [19], which models an image as the product of an illumination map and a reflectance map. Methods like Single-Scale Retinex (SSR) [5] and Multi-Scale Retinex (MSR) [6] estimate the illumination component to recover the reflectance, which is assumed to be the enhanced image. Subsequent works, such as LIME [20], introduced more robust structural priors for illumination map estimation. However, these methods rely on hand-crafted priors and heuristics, often struggling with severe noise, color distortion, and halo artifacts around sharp edges.
2.2. Deep Learning-Based Methods
With the advent of deep learning, data-driven methods have become the dominant paradigm in LLIE, demonstrating superior performance and robustness.
CNN-based Approaches: Convolutional Neural Networks (CNNs) have been widely adopted [21,22,23,24]. Early works like LLNet [11] used autoencoders for direct end-to-end enhancement. A significant number of methods combine deep learning with Retinex theory. For instance, RetinexNet [25] and KinD [16] use CNNs to learn the decomposition into illumination and reflectance, followed by separate adjustment and denoising steps. While effective, these methods often suffer from the limited receptive fields of CNNs, making it difficult to model global illumination variations, and their multi-stage pipelines can be complex to optimize. More recent zero-reference methods like Zero-DCE [9] and its successor [26] reformulate enhancement as a curve estimation problem, offering impressive efficiency. However, without physical constraints, they may sometimes produce results with color deviations or unnatural contrast.
Transformer-based Approaches: To address the locality of CNNs, Transformer-based models like Retinexformer have been introduced. By leveraging self-attention, they can capture long-range dependencies, leading to better global consistency and reduced artifacts. Their primary drawback, however, is the quadratic computational complexity with respect to image resolution, which limits their efficiency and practical deployment on resource-constrained devices.
Recent Advances and Emerging Architectures: The field continues to evolve rapidly, with researchers exploring novel domains and architectures. Recognizing that the frequency domain is adept at capturing global structural information, recent works like Freqspatnet [3] propose to learn collaboratively across both spatial and frequency domains. They aim to leverage the complementary characteristics of each domain—spatial for texture and local details, frequency for global structure. As an alternative to Transformers, State Space Models (SSMs) like Mamba have recently gained attention for their ability to model long-range dependencies with linear complexity. In the context of LLIE, recent work such as MambaLLIE [27] has demonstrated the potential of SSMs for efficient and effective enhancement. While these methods are promising, they are still in their infancy, and their integration within a physically grounded framework that explicitly handles noise and color fidelity has not been fully explored.
In summary, despite significant progress, existing methods face a persistent trade-off: CNNs are efficient but local; Transformers are global but computationally expensive. Furthermore, most methods operate within a single domain (e.g., spatial or frequency) or lack a robust mechanism to jointly model cross-domain interactions and long-range dependencies. This gap motivates the design of a unified framework that can efficiently capture global context while being grounded in a physical model, which is the primary focus of our proposed work.
In addition to methods specifically designed for low-light conditions, the broader field of image enhancement continues to see rapid advancements. For instance, recent work in optical imaging has explored novel frameworks for image restoration and detail enhancement. A study in [28] introduced a physics-informed model that leverages wave-optical principles to correct aberrations, achieving high-fidelity image recovery. Similarly, research in [29] proposed an advanced network architecture tailored for removing complex noise patterns specific to certain laser imaging systems. Furthermore, a lightweight framework for real-time enhancement was recently presented in [30], focusing on computational efficiency for dynamic scenes. While these methods demonstrate excellent performance in their specific application domains, our MambaDPF-Net addresses a different and unique set of challenges inherent to low-light photography. Unlike approaches that target sensor-specific noise or optical aberrations, our work focuses on the holistic problem of non-uniform illumination, color distortion, and signal-dependent noise. By integrating the physically grounded Retinex model with the selective state-space capabilities of Mamba, our dual-path fusion architecture provides a specialized solution that distinguishes it from these recent, yet distinct, advancements in the broader image enhancement landscape.
3. Approach
3.1. Overview
For the input low-illuminance image I, we estimate illumination L and reflectance R based on the retinal effect formula . This network comprises four synergistic branches: the sharpening subnetwork enhances high-frequency texture representation by incorporating edge gradient priors, providing a robust structural foundation for subsequent processing; the decoupling subnetwork innovatively adopts a dual-scale feature aggregation mechanism, achieving robust estimation of illumination and reflection components while maintaining physical interpretability; The denoising subnetwork directly adopts the proven R2RNet architecture, focusing on noise suppression in the reflective domain. Its multi-stage residual learning mechanism effectively balances artefact removal with detail preservation. The coupling subnetwork employs an attention mechanism and selective state space module to fuse spatio-temporal features, achieving globally consistent reconstruction. The final enhanced output is obtained by coupling the corrected illumination with denoised reflectance under reconstruction consistency constraints. The overall architecture of the proposed MambaDPF-Net is illustrated in Figure 1.
Figure 1.
The proposed network architecture for the MambaDPF-Net. The network is composed of four distinct sub-networks: the sharp network, the decoupling network, the denoising network, and the coupling network. The sharp network’s role is to enhance the details of the edges. The decoupling network’s task involves separating the input low-light images into illumination and reflection components. Subsequently, a denoising network is employed to diminish noise within the reflectivity map. Finally, the coupling network integrates the illumination map and the noise-reduced reflection map from the decoupled network to generate an enhanced output.
3.2. Sharp-Net
To achieve detail enhancement, this paper employs a ‘contrast pre-enhancement—multi-directional edge extraction—adaptive fusion’ workflow within the sharpening subnetwork. The effect of this process is illustrated in Figure 2. It begins by applying Contrast-Limited Adaptive Histogram Equalisation (CLAHE) for local contrast enhancement. This technique amplifies feature gradients in underexposed regions while simultaneously suppressing noise amplification through a preset clipping threshold. Subsequently, an eight-directional Sobel [31] operator is applied to compute gradients across a comprehensive set of orientations (θ ∈ {0°, 45°, …, 315°}). The final gradient magnitude at each pixel, , is determined by the maximum response across all directions, as defined by:
where represents the gradient response at pixel coordinates for a given orientation . The resultant raw edge map is then refined using morphological opening and closing operations to suppress isolated noise pixels and connect discontinuous edge segments. This refined map serves as a spatial confidence map, , to guide the final fusion. The enhanced image, , is synthesized through a detail-preserving fusion process:
Here, is the input image, is the contrast-enhanced intermediate image, denotes element-wise multiplication, and is the confidence map W after undergoing Gaussian smoothing to ensure spatial coherence and prevent halo artifacts.
Figure 2.
Visualization of the intermediate and final outputs of the Sharp-Net module. From left to right, the columns display: (a) original low-light image, (b) CLAHE pre-enhanced image, (c) eight-directional Sobel gradient map, (d) refined edge map after morphological operations, (e) edge saliency map after non-maximum suppression, (f) final enhanced image, and (g) the corresponding adaptive weight map used for fusion. Each row demonstrates the process on a different input image, showcasing the robustness of the method across various scenes.
3.3. Decouple-Net
The Decouple-Net is designed to robustly decompose
low-light images into illumination and reflectance components, which is crucial
for subsequent enhancement and denoising processes. Building upon Retinex
theory, our network employs a dual-scale decomposition strategy to capture both
global illumination trends and fine-grained structural details, ensuring
physically interpretable and high-quality component maps.
As illustrated in Figure 3, the Decouple-Net processes dual-scale inputs: the primary resolution
at the main scale and 1.5× upsampled resolution at the auxiliary scale. This
two-stage decomposition framework first extracts finer-grained structural
information at the auxiliary scale before returning to the main scale for
robust refinement. Both scales employ a cascaded “Residual Module (RM) +
Efficient Channel Attention (ECA)” architecture, where each RM incorporates
stacked convolutional kernels with sizes {1, 3, 3, 3, 1} and channel counts
{64, 128, 256, 128, 64} (see Figure 4,
bottom right). A 64 × 1 × 1 convolutional layer is added at the shortcut
connection to ensure feature dimension matching. The RM structure effectively
suppresses redundant information while emphasizing key channel responses
through the ECA mechanism, which adaptively recalibrates channel-wise feature
weights based on global context.
Figure 3.
The network processes inputs at two scales: main scale (×1) and auxiliary scale (×1.5). Both scales utilize cascaded Residual Modules (RMs) with Efficient Channel Attention (ECA). The Bimodal Integration Unit (BIU) fuses cross-scale features. Detailed structures of the BIU and RM are shown in the bottom left and right insets, respectively. The network outputs decoupled reflectance (RDecouple-low) and illumination (IDecouple-low) maps.
Figure 4.
High-level architecture of the Couple-Net. It consists of three main components: SDEM, FDAM, and DIM. These modules process features in parallel, and their outputs are fused by DIM for the final reconstruction.
Concurrently, an attention-aware Bimodal
Integration Unit (BIU, Figure 3, bottom
left) is employed to align, recalibrate, and gate-fuse cross-scale features.
The BIU utilizes max pooling and average pooling to capture multi-scale
contextual information, which is then processed through convolutional layers
and concatenated to generate attention weights. This mechanism effectively
injects auxiliary-scale high-frequency details into main-scale illumination
estimation, enabling more precise modeling of illumination and reflectance
maps.
During training, the Decouple-Net learns
decomposition using paired low-light (at both primary and auxiliary scales) and
normal-light images. The network leverages the shared reflection prior, i.e.,
the constraint that low-light and normal-light images of the same scene should
share the same reflectance map. This unsupervised learning paradigm eliminates
the need for ground-truth illumination or reflectance maps. Instead, the
network learns from the consistency of reflectance and the smoothness of illumination,
embedded into the loss function. Importantly, the reflection and illumination
maps derived from normal-light images serve solely as references during the
decomposition process and are excluded from subsequent training and inference
stages to prevent information leakage.
3.4. Denoise-Net
The primary function of the Denoise-Net is to
address the noise amplification that often occurs in the reflectance map (R)
after decomposition. This step is crucial for preventing noise from degrading
the final enhanced image. To ensure effective noise suppression without
sacrificing critical image details, we adopt the Denoise-Net architecture
consistent with that described in the R2RNet paper, leveraging its proven
efficacy in balancing noise removal and detail preservation.
3.5. Couple-Net
The Couple-Net constitutes the final stage of our
framework, meticulously designed to reconstruct the enhanced image by
synergistically fusing features from both the spatial and frequency domains.
Acknowledging that spatial information (e.g., local textures, edges) and
frequency information (e.g., global brightness, periodic patterns) are
complementary, the Couple-Net executes a sophisticated workflow. As illustrated
in Figure 4, this network is composed of
three core components operating in parallel: the Spatial Domain Enhancement
Module (SDEM), the Frequency Domain Augmentation Module (FDAM), and the
Dual-Domain Information Integration Module (DIM), which ultimately merges their
outputs.
3.5.1. Spatial Domain Enhancement Module (SDEM)
The SDEM, whose architecture is detailed in Figure 5, is tasked with enhancing the spatial
details of the denoised reflectance map. It is built upon a U-Net-like encoder–decoder
structure to effectively capture multi-scale contextual information. The
process begins with a 3 × 3 dilated convolution, which expands the initial
receptive field without adding extra parameters. The encoder path comprises
four downsampling stages. In each stage, a 2 × 2 stride convolution is used for
downsampling, followed by a Residual Module (RM) and an Efficient Channel
Attention (ECA) block. This design choice purposefully avoids max-pooling
layers to prevent the irreversible loss of feature information. Symmetrically,
the decoder path uses 2 × 2 deconvolution layers to upsample the features. A
long skip connection links the feature maps from the first RM in the encoder to
the last RM in the decoder, ensuring that low-level, high-resolution features
are preserved and reused for fine-grained reconstruction. The output of this
module is a spatially refined feature map, denoted as S.
Figure 5.
Architecture of the Spatial Domain Enhancement Module (SDEM). It features a U-Net structure equipped with stride convolutions for downsampling, Residual Modules (RMs), and Efficient Channel Attention (ECA) blocks.
3.5.2. Frequency Domain Augmentation Module (FDAM)
The FDAM, depicted in Figure 6,
is engineered to augment the feature representation within the frequency
domain. This domain is particularly adept at capturing global periodic patterns
and can mitigate potential information loss that occurs during domain
transformations. The module first converts spatial features into the frequency
domain using the Fast Fourier Transform (FFT). All subsequent operations,
including convolutions and residual connections, are performed using
complex-valued arithmetic. This approach preserves both the magnitude and phase
information, which is critical for a complete representation.
Figure 6.
Architecture of the Frequency Domain Augmentation Module (FDAM). It operates in the frequency domain using FFT, Complex Residual Blocks, and complex convolutions to process features.
A complex convolution operation is mathematically
defined as follows:
In this equation, the variables represent the components of the complex numbers being convolved. Specifically, is the real part of the input feature and is the real part of the convolutional kernel. Correspondingly, is the imaginary part of the input feature and is the imaginary part of the kernel. This
operation ensures that the rich information encoded in the complex domain is
fully leveraged. The FDAM also employs an encoder–decoder structure with
Complex ResBlocks to process these features, ultimately producing a
frequency-augmented feature map, F.
3.5.3. Dual-Domain Information Integration Module (DIM)
The DIM, detailed in Figure 7, serves as the intelligent core of the
Couple-Net. Its primary function is to fuse the spatial features S from
SDEM with the frequency features F from FDAM in a sophisticated,
multi-stage manner.
Figure 7.
Architecture of the Dual-Domain Information Integration Module (DIM). It fuses spatial (S) and frequency (F) features using a combination of local and global attention, a central SS2D (Mamba) block for long-range modeling, and subsequent pixel/channel attention mechanisms for refinement.
The fusion process unfolds as follows:
Initial Attention Gating: The input features S
and F first pass through separate Local Channel Attention blocks to
recalibrate their channel-wise responses independently. Concurrently, a Global
Interaction Attention mechanism is applied between them to explicitly model
cross-domain dependencies.
Long-Range Dependency Modeling: The concatenated
features from the initial gating stage are then fed into a 2D Selective State
Space (SS2D) module. This module, an adaptation of the Mamba architecture for
2D visual data, is pivotal for our fusion task. Unlike Transformers which apply
uniform attention, the SS2D module leverages a selective state space model.
This allows it to dynamically modulate the propagation of information across
the image based on local feature characteristics. For low-light enhancement,
this means it can establish long-range connections between salient structural
features in brighter areas while simultaneously preventing noise amplification
from darker regions, achieving a more context-aware and robust feature fusion.
It captures global context with near-linear computational complexity,
overcoming the quadratic complexity limitations of standard self-attention
mechanisms in Transformers.
Hierarchical Feature Refinement: The output from
the SS2D module undergoes a final refinement step. It is processed through
parallel pixel attention and channel attention blocks, which work together to
adaptively highlight the most salient spatial locations and informative feature
channels.
The resulting features are fused and passed through
a final convolution layer to produce the integrated feature map. This map is
then processed by a shallow block of 3 × 3 and 1 × 1 convolutions to generate
the final, high-quality enhanced image.
3.6. Multi-Task Training Learning Framework
To enhance the convergence stability and visual consistency of low-light enhancement, this paper employs a multi-task joint training strategy, simultaneously incorporating decoupling, denoising, and coupling objectives into the optimisation process. The overall loss function comprises , , and , each branch consisting of both content and
perceptual components. The content term emphasises measurable consistency at
the pixel and structural levels, while the perceptual term constrains
high-level semantic and subjective quality. This dual approach collaboratively
prevents overfitting or excessive smoothing that may arise from relying solely
on a single metric.
The content loss function is set to L2 and executed
on three subnets, which are defined as follows:
where , are the input low-light
image and normal image, respectively.
The perceptual loss is subsequently calculated
using the features derived from the VGG-16 pre-trained model, as defined below:
In summary, the formula for calculating the final
loss of DPF-Net within a multi-task learning training framework is defined as
follows:
where and represent the reflectance and illumination maps estimated by the Decouple-Net from the low-light input, respectively. is the denoised reflectance map produced by the Denoise-Net. denotes the enhanced image from the Couple-Net. refers to the feature
extractor from the pre-trained VGG-16 network.
4. Experiments
4.1. Realization Details
The experiments described herein employ a neural
network constructed upon the Windows operating system and PyTorch 2.1 deep
learning framework. Utilising publicly available LSRW and LOLv datasets, the
model converged efficiently after 40 training epochs on a 3080Ti GPU,
demonstrating outstanding performance across validation sets including LOL,
LSRW, LIME, DICM, and VV. Regarding parameter configuration, the Adam optimiser
was employed for gradient updates with an initial learning rate of 0.001. To
ensure model convergence, the L2 loss function was selected during training.
The batch size was fixed at 6, with image cropping dimensions set to 112
pixels. Additionally, a learning rate decay strategy was implemented, reducing
the learning rate to 10% of its previous value every 10 training epochs.
Datasets. For quantitative evaluation, we
utilized two paired datasets: the LOL dataset, which is divided into a training
set of 485 low/normal-light image pairs and a testing set of 15 pairs, and the
LSRW dataset, consisting of 5600 pairs for training and 100 for testing. To
assess the generalization capabilities of our method on real-world unpaired
data, our qualitative evaluations were performed on the LIME (10 images), VV
(24 images), and DICM (69 images) datasets.
4.2. Comparison with State-of-the-Art Methods on Real Datasets
To comprehensively evaluate the performance of our
proposed MambaDPF-Net, we conducted extensive comparisons with several
state-of-the-art (SOTA) low-light image enhancement methods. The evaluation is
twofold: quantitative analysis on paired datasets and qualitative analysis on
unpaired real-world datasets.
For the quantitative assessment, we benchmarked our
model against prominent methods on three widely used paired datasets: LOL-v1,
LOL-v2(real), and LSRW. The results summarized in Table 1 and the effects shown in Figure 8 clearly demonstrate the superiority of
our approach. As shown in the table, our method consistently achieves the
highest scores in both PSNR and SSIM across all three datasets.
Table 1.
Quantitative comparison with state-of-the-art methods on the LOL-v1, LOL-v2 and LSRW datasets. The best results are highlighted in bold.
Figure 8.
In the LOL dataset, a comparison of the visualization results for image 55, and in the LSRM dataset, for image 2060, from left to right, the images are as follows: low-light, Zero-DCE, Retinexnet, R2RNet, CNTNet, Retinexformer, Ours, normal light.
Notably, when compared to the current
state-of-the-art method, Retinexformer, our model shows significant
improvements. On the LOL-v2(real) dataset, our method surpasses Retinexformer
by 0.57 dB in PSNR and 0.013 in SSIM. The performance gain is even more
pronounced on the LSRW dataset, where our approach achieves an increase of
0.719 dB in PSNR and a substantial 0.252 in SSIM, showcasing its robustness. On
the LOL-v1 dataset, our method also maintains a competitive edge, outperforming
all other listed techniques. These consistent quantitative improvements
validate the effectiveness of our proposed network architecture in restoring
high-fidelity images from low-light conditions.
While quantitative metrics on paired datasets
provide a valuable measure of reconstruction fidelity, evaluating performance
on real-world, unpaired datasets is crucial for assessing a method’s practical
utility and generalization capabilities. To this end, we further conducted
qualitative comparisons on the challenging DICM and VV datasets, which feature
complex lighting conditions and lack ground-truth references.
As illustrated in Figure 9 and in Table 2, our method
exhibits remarkable generalization ability on these challenging real-world
images. In the DICM example (top row), our approach effectively enhances the
brightness of the indoor corridor while successfully preserving the details
visible through the glass doors without introducing over-exposure artifacts.
Compared to Retinexformer, which produces a slightly washed-out result with
less distinct colors, our method yields a more balanced contrast and superior
color fidelity.
Figure 9.
Qualitative comparison of the DICM and VV datasets. From left to right: Low-light Input, Retinexformer, and our method. Our proposed method demonstrates superior performance in restoring details and preserving natural colors in challenging real-world scenarios.
Table 2.
NIQE scores of the different methods on the LIME, VV and DICM dataset, with the best results highlighted in bold.
For the severely backlit image from the VV dataset
(bottom row), our model demonstrates its strength in handling extreme dynamic
ranges. It successfully illuminates the subject in the foreground while
retaining the rich details and vibrant colors of the sky and sea. In contrast,
the competing method struggles to balance the scene, leading to a loss of
texture in the highlight regions of the sky. These visual comparisons
underscore the robustness and superior performance of our model, confirming its
effectiveness in handling diverse and complex lighting conditions found in
practical applications.
4.3. Ablation Experiment
To meticulously validate the contribution of each
key component in our MambaDPF-Net, we conducted a comprehensive ablation study
on the LOL dataset. We started with a baseline model and progressively
integrated our proposed modules. The specific configurations are as follows,
with results presented in Table 3:
Baseline: A simplified version of our network,
using the basic U-Net structure from R2RNet as the backbone for both
decomposition and coupling, without the sharpening prior, BIU, SDEM, FDAM, and
SS2D.
Baseline + Sharp: The sharpening prior (Sharp-Net)
is added to the baseline.
Baseline + Sharp + BIU: The Bimodal Integration
Unit (BIU) is added to the decouple-net of the previous model.
Ours (MambaDPF-Net): The SS2D (Mamba) module is
added to the DIM, replacing a standard convolution-based fusion block.
Table 3.
Ablation study on the LOL(total) dataset.
Table 3.
Ablation study on the LOL(total) dataset.
| Methods | LOL(PSNR) | LOL(SSIM) | LSRW(PSNR) | LSRW(SSIM) |
|---|---|---|---|---|
| Baseline | 18.131 | 0.712 | 20.207 | 0.816 |
| Baseline + Sharp | 18.374 | 0.728 | 20.211 | 0.816 |
| Baseline + Sharp + BIU | 22.222 | 0.831 | 20.216 | 0.817 |
| Ours | 26.24 | 0.943 | 20.259 | 0.838 |
4.4. Complexity and Efficiency Analysis
To assess the computational profile of our model,
we benchmarked its complexity and inference speed against key counterparts on
an NVIDIA 3080 Ti GPU, using a uniform input size of 112 × 112 pixels. The
results are presented in Table 4.
Table 4.
Complexity and efficiency comparison.
The analysis highlights the exceptional efficiency
of our proposed MambaDPF-Net. Our model achieves a real-time speed of 30 FPS
with a computational load of only 13.7 GFLOPs. This is particularly noteworthy
when compared to Retinexformer, which requires significantly more computation
(15.6 GFLOPs) to attain its speed.
Crucially, our MambaDPF-Net strikes a superior
performance-efficiency trade-off. While delivering substantially higher
enhancement quality (as evidenced by PSNR/SSIM in Table 1 and Table 2), our model operates at a much
lower computational budget than Retinexformer. This demonstrates that our
architecture is not only more powerful but also more computationally efficient,
establishing it as a highly practical and advanced solution for real-world
deployment.
5. Discussion
In summary, this paper constructs a more robust low-light
enhancement pathway atop the R2RNet framework: through adaptive histogram
remapping coupled with modified Sobel gradient guidance, it performs prior
correction on input distribution and edge information; dual-branch scaling
captures global illuminance trends and fine-grained reflective details; while
the cross-domain integration unit organically couples frequency-domain
structure with spatial texture, mitigating the persistent trade-off between noise
amplification in dark regions and detail loss. The joint training strategy
establishes complementary constraints at the pixel, structural, and perceptual
levels, enabling more balanced improvements in contrast, colour constancy, and
local sharpness within the reconstructed results. Experiments demonstrate
consistent superiority over the original R2RNet across benchmarks, including
LOL, LSRW, LIME, and VV, delivering consistent subjective and objective gains
with faster convergence and reduced artefacts.
6. Conclusions
We proposed a Retinex-guided dual-path fusion network with selective state space modeling for low-light enhancement. By combining a sharpening prior, illumination–reflectance decoupling, illumination-aware denoising and frequency–spatial coupling via Mamba, the method achieves consistent improvements over recent Retinex and Transformer baselines, with practical complexity scaling. Comprehensive evaluations, ablations and efficiency analyses substantiate the effectiveness and robustness of the design.
Author Contributions
The authors confirm
contribution to the paper as follows: study conception and design: Z.Z.; data
collection: Z.Z.; analysis and interpretation of results: Z.Z. and S.Y.; draft
manuscript preparation: S.Y. All authors have read and agreed to the published
version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The data that support the findings of this study are available from the author, [Zikang Zhang].
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| CNNs | Convolutional Neural Networks |
| BIU | Bimodal Integration Unit |
| RM | Residual Module |
| SDEM | Spatial Domain Enhancement Module |
| FDAM | Frequency Domain Augmentation Module |
| DIM | Dual-Domain Information Integration Module |
| SFCB | Spatial-Frequency Conversion Block |
| DRB | Detail Recovery Block |
| CAM | Cross-attention Module |
| LCA | Local Channel Attention |
| GISA | Global Interaction Semantic Attention |
References
- Zhang, B.; Shu, D.; Fu, P.; Yao, S.; Chong, C.; Zhao, X.; Yang, H. Multi-Feature Fusion Yolo Approach for Fault Detection and Location of Train Running Section. Electronics 2025, 14, 3430. [Google Scholar] [CrossRef]
- Rodríguez-Lira, D.-C.; Córdova-Esparza, D.-M.; Terven, J.; Romero-González, J.-A.; Alvarez-Alvarado, J.M.; González-Barbosa, J.-J.; Ramírez-Pedraza, A. Recent Developments in Image-Based 3d Reconstruction Using Deep Learning: Methodologies and Applications. Electronics 2025, 14, 3032. [Google Scholar] [CrossRef]
- Guan, Y.; Liu, M.; Chen, X.; Wang, X.; Luan, X. Freqspatnet: Frequency and Spatial Dual-Domain Collaborative Learning for Low-Light Image Enhancement. Electronics 2025, 14, 2220. [Google Scholar] [CrossRef]
- Sun, Y.; Hu, S.; Xie, K.; Wen, C.; Zhang, W.; He, J. Enhanced Deblurring for Smart Cabinets in Dynamic and Low-Light Scenarios. Electronics 2025, 14, 488. [Google Scholar] [CrossRef]
- Choi, D.H.; Jang, I.H.; Kim, M.H.; Kim, N.C. Color Image Enhancement Based on Single-Scale Retinex with a Jnd-Based Nonlinear Filter. In Proceedings of the 2007 IEEE International Symposium on Circuits and Systems (ISCAS), New Orleans, LA, USA, 27–30 May 2007. [Google Scholar]
- Rahman, Z.; Jobson, D.J.; Woodell, G.A. Multi-Scale Retinex for Color Image Enhancement. In Proceedings of the 3rd IEEE International Conference on Image Processing, Lausanne, Switzerland, 19 September 1996; pp. 1003–1006. [Google Scholar]
- Parthasarathy, S.; Sankaran, P. An Automated Multi Scale Retinex with Color Restoration for Image Enhancement. In Proceedings of the 2012 National Conference on Communications (NCC), Kharagpur, India, 3–5 February 2012. [Google Scholar]
- Fu, Y.; Hong, Y.; Chen, L.; You, S. Le-Gan: Unsupervised Low-Light Image Enhancement Network Using Attention Module and Identity Invariant Loss. Knowl. Based Syst. 2022, 240, 108010. [Google Scholar] [CrossRef]
- Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1780–1789. [Google Scholar]
- Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep Light Enhancement without Paired Supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
- Lore, K.G.; Akintayo, A.; Sarkar, S. Llnet: A Deep Autoencoder Approach to Natural Low-Light Image Enhancement. Pattern Recognit. 2017, 61, 650–662. [Google Scholar] [CrossRef]
- Lv, F.; Lu, F.; Wu, J.; Lim, C.S. Mbllen: Low-Light Image/Video Enhancement Using Cnns. In Proceedings of the British Machine Vision Conference (BMVC 2018), Newcastle, UK, 3–6 September 2018. [Google Scholar]
- Moran, S.; Marza, P.; McDonagh, S.; Parisot, S.; Slabaugh, G. Deeplpf: Deep Local Parametric Filters for Image Enhancement. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Sharma, A.; Tan, R.T. Nighttime Visibility Enhancement by Increasing the Dynamic Range and Suppression of Light Effects. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021; pp. 11977–11986. [Google Scholar]
- Hai, J.; Hao, Y.; Zou, F.; Lin, F.; Han, S. Advanced Retinexnet: A Fully Convolutional Network for Low-Light Image Enhancement. Signal Process. Image Commun. 2023, 112, 116916. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, J.; Guo, X. Kindling the Darkness: A Practical Low-Light Image Enhancer. In Proceedings of the 27th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1632–1640. [Google Scholar]
- Liu, R.; Ma, L.; Zhang, J.; Fan, X.; Luo, Z. Retinex-Inspired Unrolling with Cooperative Prior Architecture Search for Low-Light Image Enhancement. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021. [Google Scholar]
- Wang, R.; Zhang, Q.; Fu, C.W.; Shen, X.; Zheng, W.S.; Jia, J. Underexposed Photo Enhancement Using Deep Illumination Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Jobson, D.J.; Rahman, Z.; Woodell, G.A. A Multiscale Retinex for Bridging the Gap between Color Images and the Human Observation of Scenes. IEEE Trans. Image Process. 1997, 6, 965–976. [Google Scholar] [CrossRef] [PubMed]
- Guo, X.; Li, Y.; Ling, H. Lime: Low-Light Image Enhancement Via Illumination Map Estimation. IEEE Trans. Image Process. 2017, 26, 982–993. [Google Scholar] [CrossRef] [PubMed]
- Fu, X.; Zeng, D.; Huang, Y.; Liao, Y.; Ding, X.; Paisley, J. A Fusion-Based Enhancing Method for Weakly Illuminated Images. Signal Process. 2016, 129, 82–96. [Google Scholar] [CrossRef]
- Li, M.; Liu, J.; Yang, W.; Sun, X.; Guo, Z. Structure-Revealing Low-Light Image Enhancement Via Robust Retinex Model. IEEE Trans. Image Process. 2018, 27, 2828–2841. [Google Scholar] [CrossRef]
- Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep Retinex Decomposition for Low-Light Enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar] [CrossRef]
- Subramani, B.; Veluchamy, M. Fuzzy Gray Level Difference Histogram Equalization for Medical Image Enhancement. J. Med. Syst. 2020, 44, 103. [Google Scholar] [CrossRef]
- Weng, J.; Yan, Z.; Tai, Y.; Qian, J.; Yang, J.; Li, J. Mamballie: An Efficient Low-Light Image Enhancement Model Based on State Space. arXiv 2024, arXiv:2405.16105v1. [Google Scholar]
- Li, C.; Guo, C.; Loy, C.C. Learning to Enhance Low-Light Image Via Zero-Reference Deep Curve Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4225–4238. [Google Scholar] [CrossRef] [PubMed]
- Yang, S.; Zhou, D. Cybernetics Efficient Low-Light Image Enhancement with Model Parameters Scaled Down to 0.02M. Int. J. Mach. Learn. Cyber. 2024, 15, 1575–1589. [Google Scholar] [CrossRef]
- Zhou, H.; Zeng, X.; Lin, B.; Li, D.; Ali Shah, S.A.; Liu, B.; Guo, K.; Guo, Z. Polarization Motivating High-Performance Weak Targets’ Imaging Based on a Dual-Discriminator Gan. Opt. Express 2024, 32, 3835–3851. [Google Scholar] [CrossRef]
- Fan, X.; Ding, M.; Lv, T.; Sun, X.; Lin, B.; Guo, Z. Meta-Dnet-Upi: Efficient Underwater Polarization Imaging Combining Deformable Convolutional Networks and Meta-Learning. Opt. Laser Technol. 2025, 187, 112900. [Google Scholar] [CrossRef]
- Lin, B.; Qiao, L.; Fan, X.; Guo, Z. Large-Range Polarization Scattering Imaging with an Unsupervised Multi-Task Dynamic-Modulated Framework. Opt. Lett. 2025, 50, 3413–3416. [Google Scholar] [CrossRef]
- Chen, S.; Yang, X. An Enhanced Adaptive Sobel Edge Detector Based on Improved Genetic Algorithm and Non-Maximum Suppression. In Proceedings of the 2021 China Automation Congress (CAC), Beijing, China, 22–24 October 2021. [Google Scholar]
- Hai, J.; Xuan, Z.; Yang, R.; Hao, Y.; Zou, F.; Lin, F.; Han, S. R2rnet: Low-Light Image Enhancement Via Real-Low to Real-Normal Network. J. Vis. Commun. Image Represent. 2023, 90, 103712. [Google Scholar] [CrossRef]
- Cai, Y.; Bian, H.; Lin, J.; Wang, H.; Timofte, R.; Zhang, Y. Retinexformer: One-Stage Retinex-Based Transformer for Low-Light Image Enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).