Next Article in Journal
Agricultural Knowledge-Enhanced Deep Learning for Joint Intent Detection and Slot Filling
Previous Article in Journal
Low-Carbon Economic Collaborative Scheduling Strategy for Aluminum Electrolysis Loads with a High Proportion of Renewable Energy Integration
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GLMA: Global-to-Local Mamba Architecture for Low-Light Image Enhancement

1
The State Key Laboratory of Robotics and Intelligent System, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China
2
University of Chinese Academy of Sciences, Beijing 100049, China
3
School of Automation and Electrical Engineering, Shenyang Ligong University, Shenyang 110159, China
4
The Key Laboratory of Manufacturing Industrial Integrated Automation, Shenyang University, Shenyang 110044, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(20), 10931; https://doi.org/10.3390/app152010931
Submission received: 4 September 2025 / Revised: 4 October 2025 / Accepted: 5 October 2025 / Published: 11 October 2025

Abstract

In recent years, Mamba has gained increasing importance in the field of image restoration, gradually outperforming traditional convolutional neural networks (CNNs) and Transformers. However, the existing Mamba-based networks mainly focus on capturing global contextual relationships and neglect the crucial impact of local feature interactions on restoration performance in low-light environments. These environments inherently require the joint optimization of multi-scale spatial dependencies and frequency-domain characteristics. The traditional CNNs and Transformers face challenges in modeling long-range dependencies, while State Space Models (SSMs) in Mamba demonstrate proficiency in sequential modeling yet exhibit limitations in fine-grained feature extraction. To address the limitations of existing methods in capturing global degradation patterns, this paper proposes a novel global-to-local feature extraction framework through systematic Mamba integration. The Low-Frequency Mamba Block (LFMBlock) is introduced first to perform refined feature extraction in the low-frequency domain. The High-Frequency Guided Enhancement Block (HFGBlock) is used, which utilizes low-frequency priors to compensate for texture distortions in high-frequency components. Comprehensive experiments on multiple benchmark datasets show that the Global-to-Local Mamba architecture achieves superior performance in low-light restoration and image enhancement. It significantly outperforms state-of-the-art methods in both quantitative metrics and visual quality preservation, especially in recovering edge details and suppressing noise amplification under extreme illumination conditions. The hierarchical design effectively bridges global structural recovery with local texture refinement, setting a new paradigm for frequency-aware image restoration.

1. Introduction

Within the realm of image restoration technologies [1,2], low-light enhancement [3,4], together with other crucial research areas like dehazing [5,6], motion deblurring [7], and super-resolution reconstruction [8,9], form the core technological spectrum for enhancing digital image quality. As a cutting-edge research domain in computer vision, low-light image enhancement primarily confronts technical challenges that stem from the dual constraints of inadequate illumination and the inherent physical limitations of imaging sensors. These constraints collectively lead to nonlinear radiometric distortions, loss of frequency-domain information, and spatially heterogeneous noise patterns in captured images. These degradations not only impair human visual perception but also degrade the performance of downstream vision tasks such as object detection and semantic segmentation. Therefore, effective low-light enhancement is essential for both human-centric and machine vision applications.
The advent of convolutional neural networks [10] (CNNs) has revolutionized the landscape of deep learning, propelling hierarchical feature representation-based methods [11] to become the dominant paradigm. By leveraging multi-scale convolutional kernels to aggregate local receptive fields, CNN architectures have achieved significant improvements in objective metrics like PSNR. Their inherent weight-sharing property effectively reduces the risk of model overfitting. However, the inductive bias imposed by fixed receptive fields in CNNs limits their ability to model global illumination, while the parameter-sharing mechanism restricts their adaptability to diverse degradation patterns, particularly in scenarios requiring spatially variant restoration of non-uniform illumination artifacts and noise distributions. This architectural limitation originates from the inherent conflict between the local operation paradigm of CNNs and the global–local interdependence inherent in low-light enhancement tasks.
Recent advancements in vision enhancement have underscored the transformative potential of Transformer architectures, which transcend spatial constraints through self-attention mechanisms. Although the global attention computation in Transformer-based architectures [12,13] establishes long-range pixel dependencies, the quadratic increase in computational demands for high-resolution images introduces a new trade-off between perceptual quality and model efficiency.
Amid growing interest in efficient long-range modeling, Mamba [14] has emerged as a compelling alternative to attention-based architectures, offering a global receptive field at linear complexity through its selective scanning mechanism. Nonetheless, its inherently unidirectional, sequential prior struggles to capture fine-grained local structures and is easily perturbed by complex noise distributions, particularly when simultaneous multi-scale texture preservation and illumination correction are required. Recent remedies have introduced non-causal scanning strategies, hybrid CNN-SSM blueprints, and frequency-domain gating to inject local inductive biases, yet a systematic framework that seamlessly integrates global context with localized, frequency-aware discrimination remains elusive. In the Mamba framework [15,16], the Visual State Space Module [17] (VSSM) is adept at modeling long-range dependencies with linear complexity, thus serving as a computationally efficient alternative to conventional Transformer architectures. Accordingly, we incorporate the VSSM as the global feature extraction module within our proposed framework. This integration capitalizes on the module’s ability to achieve a balance between computational efficiency and extensive contextual modeling. It enables robust hierarchical representation learning across both spatial and frequency domains, while preserving the fidelity of local texture reconstruction.
To address these limitations and enhance Mamba’s local feature extraction capabilities, we introduce the Global-to-Local Mamba framework. This framework integrates CNN-based modules to augment local feature extraction, alleviate localized color distortions, and refine restoration quality in fine-grained regions. The Multi-Scale Feedforward Network (MSFFN), which builds on the GDFFN foundation, employs a sequential structure to capture multi-scale features and enable precise recovery of textures and structural details. This framework proposes a novel approach for low-light image enhancement by combining Mamba, wavelet transforms, and a global-to-local feature extraction strategy within a U-Net architecture. The hierarchical design synergizes Mamba’s linear-complexity global modeling with CNN’s localized inductive bias, facilitating adaptive illumination correction and artifact suppression across both spatial and frequency domains.
Our main contributions are summarized as follows:
  • We propose a Global-to-Local Mamba network for low-light image restoration that effectively captures intricate global and local dependency relationships.
  • We incorporate wavelet transforms to avert information loss during downsampling and devise a framework that amalgamates a Global Mamba Block, predicated on the Visual State Space Module (VSSM) for long-range dependency modeling, which can progressively distill multi-level features.
  • We integrate a novel Multi-Scale Feedforward Network (MSFFN) to supply complementary structural features across scales, leveraging rich low-frequency feature information to guide the restoration of high-frequency details.
  • Extensive experiments conducted across multiple benchmark datasets demonstrate the superior performance of the proposed model, achieving state-of-the-art results in quantitative metrics while maintaining exceptional visual fidelity in restored images.

2. Related Works

2.1. Low-Light Image Enhancement

Low-light image restoration is dedicated to recovering image details and textures in low-light conditions. With the advent of novel paradigms and frameworks, this field has witnessed significant advancements. Traditional low-light image restoration methods [18,19,20] commonly utilize techniques such as histogram equalization, gamma correction, and Retinex theory [21]. These methodologies enhance image contrast and brightness by either adjusting grayscale distributions or applying nonlinear functions to modify pixel values. In recent years, the progression of deep learning [22] has spurred the continuous proposal of data-driven approaches for low-light image enhancement (LLIE), as illustrated in Figure 1. LLNet is a pioneering CNN-based method in this domain, employing a stacked sparse denoising autoencoder architecture to concurrently achieve noise suppression and illumination enhancement in low-light images. In the realm of LLIE, numerous methods focus on improving illumination through dynamic range expansion and contrast enhancement. Among these, Histogram Equalization (HE)-based approaches are particularly noteworthy, encompassing Global HE (GHE) and Local HE (LHE). Despite their popularity due to simplicity and computational efficiency, these methods often introduce artifacts such as over-enhancement or under-enhancement, which can result in detail loss in the processed images.
Zero-DCE [23] and its enhanced version, Zero-DCE++ [24], approach low-light image enhancement (LLIE) as an image-specific curve estimation task. They investigate unsupervised learning frameworks to enhance the generalization capability of networks. These methods achieve substantial progress in enhancing image brightness and visual appeal by adaptively estimating nonlinear pixel-wise mappings. They effectively circumvent the reliance on paired training data while maintaining natural color consistency. Moreover, diffusion models, such as the DiffLL network [25], have been employed to generate realistic details in low-light tasks through a denoising process. Recently, RetinexMamba [26] and MambaLLIE [27] have explored the use of Mamba-based networks to tackle LLIE challenges. Despite the promising performance of current LLIE methods, they predominantly concentrate on the influence of global features, often overlooking the crucial guiding role of local information in refining global representations. This oversight restricts their ability to integrate fine-grained details with holistic scene understanding, especially in scenarios that demand precise illumination recovery and texture preservation.

2.2. State Space Models (SSMs)

In recent years, there have been significant advancements in deep learning frameworks based on State Space Models (SSMs) [17]. The SSM architecture exhibits remarkable computational efficiency in modeling long-range dependencies, primarily due to its inherent linear computational complexity in sequence processing. This represents a substantial advantage over traditional Recurrent Neural Networks (RNNs) [28], which exhibit quadratic complexity and are prone to gradient decay in long-sequence scenarios. Given a one-dimensional input sequence x ( t ) R , the system maps it to a new one-dimensional output sequence y ( t ) R through a latent state transformation h ( t ) R . This dynamical process is formally defined by a linear ordinary differential equation (ODE), where the evolution of the latent state explicitly governs the input–output mapping.
h ( t ) = A h ( t ) + B x ( t )
y ( t ) = C h ( t ) + D x ( t )
Owing to the intrinsically analog characteristics of continuous-time state-space models, discretization is essential for their computational implementation. The discrete analogs of the continuous parameters A and B, represented as A ¯ and B ¯ , are obtained via the zero-order hold (ZOH) method [29]. This transformation ensures compatibility with digital systems while maintaining the dynamic properties of the original model. The ZOH-based discretization process can be mathematically formulated as follows:
h t = A ¯ h t 1 + B ¯ x t
y t = C h t + D x t
A ¯ = e Δ A
B ¯ = ( Δ A ) 1 e Δ A I · Δ B
where Δ denotes a step size.
This approach innovatively establishes a data-driven dynamic parameter selection mechanism. It not only outperforms Transformer models on natural language processing benchmarks but also retains the inherent linear complexity advantage with respect to input scale. Among these advancements, MambaIR [1] pioneers the first end-to-end optimization framework for image super-resolution tasks based on State Space Models (SSMs), while EVSSM constructs a lightweight SSM-based network for motion deblurring. In recent years, RetinexMamba [26] first reformulates the classical Retinex decomposition as a continuous-time state-space model and solves it with a discretized Mamba backbone, thereby endowing low-light enhancement with a clear physical interpretation and strict linear complexity. To fully develop the potential of Mamba in the low-illumination field, MambaLLIE [27] introduces a Local-Enhanced State Space Block that enriches the original 1-D Mamba scan with 2-D neighborhood residuals, while an implicit Retinex prior generated from max-mean statistics is injected through a lightweight depthwise gating mechanism. Nevertheless, the max-mean prior remains a single, scene-averaged statistic, so it mistakes overlapping sharp shadows for genuine illumination edges; the network then amplifies these false contours, producing noticeable color shifts. Wave-Mamba [30] first explores the role of frequency domain features in low-light enhancement, which casts the feature map into a wavelet packet domain and applies a frequency-selective Mamba scan that skips low-magnitude coefficients, achieving token reduction without sacrificing perceptual information. These methods based Mamba have successfully extended SSM frameworks to low-light image enhancement (LLIE) by designing multi-scale feature fusion mechanisms. Building upon these technological advancements, our paper proposes a novel global–local feature extraction paradigm to enhance high-order semantic representation learning within SSMs for LLIE tasks. This approach addresses the limitations of existing methods in capturing hierarchical illumination–reflectance relationships.
Figure 1. We compare qualitative results on the LOL-v1dataset using three strategies: (a) Only Retain High-Frequency Blocks, (b) Only Retain Low-Frequency Blocks, and (c) our proposed method. Zoom in for a better view.
Figure 1. We compare qualitative results on the LOL-v1dataset using three strategies: (a) Only Retain High-Frequency Blocks, (b) Only Retain Low-Frequency Blocks, and (c) our proposed method. Zoom in for a better view.
Applsci 15 10931 g001

2.3. Wavelet Transformation

The wavelet transform serves as a time-frequency analysis tool that captures localized signal features by decomposing signals into “wavelet basis functions” across various scales. Unlike the Fourier transform, which solely provides global frequency information, the wavelet transform offers localized information in both the time (or spatial) and frequency domains simultaneously. Within this context, DiffLL introduces the first enhancement framework based on a wavelet-domain diffusion model.
The wavelet transformation process can be illustrated as follows:
W ( a , b ) = 1 a f ( t ) ψ t b a d t .
In this formulation, a denotes the scale parameter that governs frequency resolution, b represents the translation parameter that controls spatial localization, and ψ corresponds to the mother wavelet function.
The Discrete Wavelet Transform (DWT) [31] is the discretized version of the wavelet transform, utilizing filter banks to achieve multi-scale decomposition. This process employs low-pass and high-pass filters to decompose a signal into low-frequency components (approximation coefficients) and high-frequency components (detail coefficients). The decomposition results in a low-frequency subband (LL) and three high-frequency subbands (LH, HL, HH). These subbands correspond to horizontal, vertical, and diagonal directional details, respectively. The DWT represents the two-dimensional Discrete Wavelet Transform operation. Subsequently, the decomposed frequency subbands can be reconstructed into the original signal via the Inverse Wavelet Transform (IWT) without information loss. This property enables effective downsampling and upsampling of images while preserving critical structural and textural details, which is the principal rationale for the iterative application of the DWT in such tasks.

3. Method

In this section, we delineate our methodology through a tripartite exposition. First, we elaborate on the theoretical motivation underpinning the adopted approach. Second, we present the Local Feature Modeling (LFM) module, which constitutes the methodological cornerstone of the framework. Third, we detail the High Frequency Guidance (HFG) Block module, designed to refine feature interactions through spectral prioritization.

3.1. Framework Overview

The proposed network architecture utilizes the Discrete Wavelet Transform (DWT) to decompose input images into high-frequency and low-frequency components. A specialized Low-Frequency Mamba (LFM) Block is designed to extract semantically rich features from the low-frequency subbands. This configuration enables the network to simultaneously enhance and optimize both global contextual information and local structural details while maintaining linear computational complexity, a critical advantage inherited from the State Space Model (SSM) framework.
A high-frequency enhancement module is proposed, based on a low-frequency guided compensation strategy. This module constructs a frequency-domain attention constraint through the fusion of global and local low-frequency components. This mechanism reconstructs the spatial gradient distribution of high-frequency subbands, thereby establishing a cross-frequency feature coupling framework. This framework optimizes inter-band interactions while preserving edge coherence and spectral consistency.
The proposed method effectively addresses the high-frequency texture distortion inherent in conventional approaches. It leverages a constrained deconvolutional network to achieve precise reconstruction of local microstructural features. This framework ensures that multi-scale neural networks can accurately extract physically meaningful edge oscillation characteristics and gradient-direction-consistent texture patterns.
In the following sections, we will detail the comprehensive workflow of our methodology and provide an in-depth exposition of its core components.
Our network architecture is constructed based on a multi-scale U-Net [32] framework. For downsampling, we employ the Discrete Wavelet Transform (DWT) [31] instead of conventional methods, effectively avoiding information loss typically caused by traditional downsampling operations. This is attributed to the inherent ability of DWT to preserve texture details during resolution reduction. Leveraging the characteristic of decomposing images into high-frequency and low-frequency components, we apply a global-to-local feature extraction strategy to the low-frequency subbands, enabling the derivation of more precise feature representations. These representations are then processed through a Multi-Scale Feedforward Network to hierarchically aggregate structural information from the image features, thereby enhancing the model’s capability to recover fine-grained details and maintain spatial–spectral consistency.
For the high-frequency components, we refrain from performing dedicated feature extraction operations and instead leverage low-frequency components to correct and restore high-frequency information. This strategy effectively reduces computational costs while preserving critical edge details and spectral fidelity through cross-frequency interaction mechanisms.

3.2. Low-Frequency Mamba Block

The LFMBlock is designed to restore low-frequency features in low-light images. It extracts low-frequency information flows from the spatial domain of feature embeddings and models them through the LFMBlock architecture, as illustrated in Figure 2. Given the low-frequency input features F n R H × W × C , we first apply Layer Normalization to stabilize feature distributions, followed by the Visual State Space Module (VSSM) to capture global contextual relationships and long-range dependencies. This hierarchical paradigm extracts a robust representation by progressively distilling global contextual cues, while simultaneously preserving the linear complexity and memory efficiency that are intrinsic to state-space formulations.
Building upon the global semantic relationships established by the VSSM, we provide contextual guidance for subsequent local feature extraction via the Local Feature Module (LFM). To preclude mutual gradient interference between the VSSM and LFM, we instantiate a principled sequential topology that operationalizes a global-first–local-later inductive bias. By enforcing a unidirectional flow of supervisory signals, the scheme guarantees temporal coherence of gradient update directions across successive modules, thereby eradicating stationary-point conflicts and assuring stable, joint optimization.
Finally, a Multi-Scale Feedforward Network (MSFFN) is employed to hierarchically learn and refine structural representations. The overall process can be formulated as follows:
F i = W 1 × 1 VSSM ( LN ( F i ) ) + LFE ( LN ( F i ) ) + F i ,
F o = MSFFN ( LN ( F i ) ) + F i ,
where VSSM(.), MSFFN(.), and LFM(.) denote the functional operations of the Visual State Space Module, MSFFN, and LFM, respectively. W 1 × 1 (.) represents a convolutional layer with a 1 × 1 kernel, while LN indicates the layer normalization applied to the input features. The advantages of integrating VSSM into the LSM module can be summarized as follows:
  • Computational Efficiency: Using state-space models like VSSM allows for linear complexity with respect to sequence length, which is beneficial for high-resolution images.
  • Improved Optimization: The sequential structure avoids the common issue of gradient cancellation when parallel global and local paths are used.
  • Contextual Coherence: Global context informs local processing, which is especially useful in low-light scenarios where local features might be noisy or ambiguous.
Vision State Space Module: Drawing inspiration from the linear computational complexity demonstrated by the Mamba architecture in long-range dependency modeling, we integrate the Visual State Space Module (VSSM) into the framework of low-light image enhancement (LLIE) tasks. The proposed model systematically establishes inter-regional feature correlations through discrete state-space equations and achieves efficient low-light image restoration via a multi-phase synergistic processing mechanism. The architecture of the VSSM is illustrated in Figure 2.
The input features first undergo channel dimension adjustment via a linear projection layer, which projects the raw data into a high-dimensional representation space to enhance expressive capacity. Subsequently, a depthwise separable convolution (DW 3 × 3 ) performs lightweight local feature extraction through a decoupled strategy of spatial filtering and cross-channel fusion.
Following this, the two-dimensional state space modeling (2D-SSM) unfolds the image bidirectionally along the primary and secondary diagonals, as shown in Figure 3. It dynamically fuses multi-path features via learnable directional gating mechanisms. The processed features are stabilized through layer normalization to calibrate their distribution and are ultimately mapped to the target space via a linear output layer. In parallel, another branch incorporates a linear layer followed by a SiLU activation. The outputs from this branch are additively integrated into the target space to further refine feature representations.
X 1 = SS 2 D SSM SILU DWConv Linear ( X )
X out = Linear LN X 1 SILU Linear ( X )
where ⊙ denotes the Hadamard product, which generates an output with the same dimensionality as the input. Linear(.) and DWConv(.) represent the linear projection and depthwise convolution operations, respectively.
Local Feature Extraction: Given that the Visual State Space Module (VSSM) leverages self-attention mechanisms to excel in capturing global contextual semantic relationships, we serially connect the Low-Frequency Mamba Block (LFM) after the VSSM’s global semantic modeling to avoid conflicting optimization directions between global and local feature extraction branches. This sequential architecture enables progressive stage-wise feature refinement: the LFM focuses on extracting fine-grained texture details and short-range features, thereby refining low-frequency components with higher precision. The enhanced low-frequency representations subsequently guide the High-Frequency Guided Enhancement Block (HFG Block) to amplify high-frequency features through targeted spectral adjustments, ensuring synergistic interaction between global and local information flows.
As illustrated in Figure 4a, the input features are first processed through Global Average Pooling (GAP), which aggregates spatial information to generate a channel-wise global descriptor capturing the overall response intensity of each channel. Subsequently, two 1 × 1 convolutional operations are employed to establish a cross-channel interaction pathway: the first convolution reduces the channel dimensionality while incorporating a ReLU activation function to model inter-channel dependency patterns; the second convolution restores the original dimensionality, followed by a Sigmoid function to generate normalized channel attention weights W . This sequential design enables adaptive recalibration of channel-wise feature importance through learned nonlinear interactions. The resulting attention weights are then element-wise multiplied with the original features along the channel dimension, enabling adaptive feature enhancement of critical channels and dynamic suppression of redundant information. Consequently, we formalize the Lightweight Feature Module (LFM) as follows:
W = σ W 1 × 1 ReLU W 1 × 1 GAP ( F n )
F local = W F n
where σ (.) denotes the Sigmoid function, ReLU(.) represents the Rectified Linear Unit activation function, and GAP(.) corresponds to Global Average Pooling.
Multi-Scale Feedforward Network: To augment the capacity for extracting structural information from image features, this study introduces a Multi-Scale Hybrid Feedforward Network (MSFFN), depicted in Figure 4b. Initially, the module promotes cross-channel interaction of the input features via a 1 × 1 convolutional layer, which is succeeded by the formation of a multi-branch parallel processing architecture.
In the main branch, a 3 × 3 depthwise separable convolution coupled with a GELU activation function is utilized to conduct fundamental feature extraction. The gated branch extends the conventional GDFFN framework by incorporating 3 × 3 and 5 × 5 depthwise separable convolutional layers, which facilitates the acquisition of multi-granularity spatial features.
Subsequently, features from all branches are activated by ReLU, concatenated along the channel dimension, and then compressed back to the original dimensionality through a 1 × 1 convolutional layer. The concluding step involves element-wise multiplication to adaptively fuse the primary branch features with gating weights, resulting in an optimized output that retains both local fine-grained details and global structural coherence. The operational workflow of the MSFFN is formulated as follows:
F 1 , F 2 , F 3 = Chunk ( W 1 × 1 ( F n ) )
F a = GeLU ( D W 3 × 3 ( F 1 ) )
F b = W 1 × 1 ( ReLU ( D W 3 × 3 ( F 2 ) ) ReLU ( D W 5 × 5 ( F 3 ) ) )
F = W 1 × 1 ( F a F b )
where Chunk(.) denotes the chunk function along the channel dimension.

3.3. High Frequency Guidance Block

To enhance the feature representation capability of high-frequency components, this study introduces a low-frequency-guided cross-frequency feature migration mechanism. By leveraging the semantic consistency of low-frequency components, this approach performs collaborative optimization on high-frequency features, addressing the issue of weakened high-frequency details often seen in conventional methodologies.
The core component of this mechanism is the High-Frequency Guided Block (HFGBlock), which employs a dual-path collaborative architecture comprising two critical components: the Frequency-Matching Attention Module (FMAM) and the Frequency-Correction Feedforward Network (FCFN).
The FMAM performs frequency-domain alignment operations to identify cross-frequency semantic cues within low-frequency components that exhibit strong correlations with high-frequency features. Concurrently, the FCFN applies nonlinear mapping and distribution calibration to the fused cross-frequency features, effectively mitigating phase misalignment artifacts introduced during inter-band integration.
Through iterative interaction between low-frequency and high-frequency features, this framework dynamically fills information-deficient regions in high-frequency subbands while preserving edge sharpness and suppressing structural artifacts. The detailed workflow is illustrated as follows:
F H = FMAM LN ( F H l n ) , F L + F H l n
F H o u t = FCFN LN ( F H ) , F L + F
where F H l n and F H o u t represent the input and output high-frequency features of the HFGBlock, respectively. The notation FMAM(·) and FCFN(·) corresponds to the operational functions of the FMAM and FCFN.
Frequency Matching Attention Module: Building on prior research into query effectiveness in attention mechanisms, this study proposes a low-frequency feature-guided semantic enhancement strategy to optimize the attention weight generation mechanism. As depicted in Figure 5a, the Frequency-Matching Attention Module (FMAM) employs a dual-stage optimization design. Specifically, FMAM takes the enhanced low-frequency components produced by LFM and the high-frequency components produced by wavelet decomposition as inputs, denoted as F L and I H , respectively. Then, the module generates high-frequency component features F L through parallel 1 × 1 convolutional transformations W 1 and 3 × 3 depthwise convolutions W 3 , which are denoted as queries (Q), keys (K), and value projections (V). Subsequently, the Frequency Matching Transformation (FMT) is executed between Q and F L , dynamically incorporating optimized low-frequency features into the query representations while preserving high-frequency details. This process establishes cross-band semantic correlations through adaptive frequency interaction. This design enhances the physical interpretability of attention weights and detail perception capability through a feature-level low-frequency and high-frequency cooperative mapping strategy, thereby providing a novel approach for high-frequency feature restoration in complex illumination scenarios. The workflow of the FMAM is illustrated as follows:
FMAM ( F H m , F L e ) = A ( FMT ( Q , F L e ) , K , V ) ,
Q , K , V = Split ( W 1 W 3 ( F H m ) ) ,
A ( Q , K , V ) = V · softmax ( K · Q α ) .
The notation FMT(.), Split(.), and Softmax(.), respectively, denote the Feature Modulation Transformer module, feature splitting operation, and normalized exponential function, where α serves as a learnable scaling parameter to regulate the magnitude of the dot product between keys (K) and queries (Q).
Frequency Correction Forward Network: Subsequent to the processing through the Frequency-Matching Attention Module (FMAM), a hierarchical cross-domain feature refinement architecture is implemented to further facilitate the enhancement of high-frequency components. This architecture achieves joint spatial-frequency domain enhancement via multi-stage collaborative processing, as illustrated in Figure 5a. The processing pipeline is formulated as follows: Initially, Layer Normalization (LN) is applied to the output features of the FMTA to stabilize channel-wise distributions and accelerate model convergence. Subsequently, a 1 × 1 convolution operation performs cross-channel information interaction and dimensional adaptation, constructing an efficient feature representation space for subsequent operations. Following this, a local spatial context extraction module based on 3 × 3 convolutions captures fine-grained texture and edge features. The Frequency Matching Transformation (FMT) is then executed to establish cross-band semantic correlations between low- and high-frequency features, thereby reinforcing the synergistic representation of structural patterns and detailed components. The pipeline ultimately outputs optimized results with enhanced visual consistency, as formally described below:
F C F N ( F H , F L e ) = F M T ( W 3 W 1 ( LN ( F H ) ) F L e )
Frequency Matching Transformation: As illustrated in Figure 5b, the Frequency Matching Transformation (FMT) primarily functions to transfer low-frequency features into enhanced high-frequency representations through the following computational workflow: Initially, a similarity matrix between high-frequency and low-frequency components is computed to identify the optimal feature vector D. Subsequently, the most semantically aligned low-frequency components are selected as channel-specific outputs based on D. These selected features are then concatenated with the original high-frequency features via parallel processing branches.
Y s = Select ( F L | Indices ( Top 1 ( Sim ( F L , F H ) ) ) )
Y s c = Concat ( Y s , F H ) ,
where the notation Sim(·, ·) denotes the similarity computation measured by Euclidean distance, Select(·|·) represents the feature selection operation with conditional filtering, and Indices(.) indicates the index retrieval operation.
Specifically, one branch computes attention maps using 1 × 1 convolutional layers coupled with Sigmoid activation to emphasize critical spatial locations, while the other branch employs 3 × 3 convolutions to capture contextual attention patterns. Finally, the outputs from both branches are multiplied and integrated through an additional 1 × 1 convolutional layer to produce the refined output features. The complete procedural pipeline is formally described as follows:
F H out = W 1 Sigmoid W 1 ( Y s c ) W 3 ( Y s c )
Compared with conventional attention mechanisms, the proposed Frequency-Matching Transformation (FMT) offers three distinct advantages:
  • Frequency-Matching Transformation module ensures that only semantically coherent low-frequency cues are injected into the high-frequency, eliminating semantic misalignment.
  • Frequency-Matching Transformation module offers statistically reliable cues for fine-grained detail recovery, resulting in perceptually faithful edge sharpening.
  • The residual pathway ensures unimpeded gradient flow, facilitating stable end-to-end back-propagation.

4. Experiment

The proposed Global-to-Local Mamba (GL-Mamba) framework is subjected to rigorous evaluation across a variety of vision tasks, utilizing several widely adopted datasets for low-light enhancement benchmarking. Furthermore, comprehensive ablation studies are conducted to systematically quantify the efficacy of individual architectural components. The first two comparative results are analytically highlighted in red and blue, respectively, to delineate performance variations under distinct configurations.

4.1. Implementation Details

The network architecture adopts a hierarchical configuration, comprising [1, 2, 4] LFMBlocks and [1, 1, 1] HFGBlocks per layer within the encoder–decoder framework. The model employs eight attention heads and maintains a channel dimension C of 32 throughout. Training is conducted using the AdamW optimizer, initialized with a learning rate of 5 × 10 4 , which is progressively decayed to 1 × 10 7 via cosine annealing over 100k iterations. Data augmentation techniques include random geometric transformations (90, 180, 270), random flips, and random cropping to 512 × 512 patches. The Global-to-Local Mamba framework is optimized under an L1 loss constraint. All experiments are executed on a NVIDIA RTX 3090 GPU (24 GB) setup.

4.2. Datasets

The experimental validation utilizes two benchmark datasets: LOL-v1 and LOL-v2-synthetic. The original Low-Light Dataset version 1, specifically curated for low-light image enhancement research, consists of 500 aligned low/normal-light image pairs with a fixed resolution of 400 × 600 pixels. This dataset comprises 485 training pairs and 15 testing pairs, primarily capturing indoor scenarios. The enhanced LOL-v2 version introduces two distinct subsets: LOL-v2-real, captured under authentic low-light conditions, and LOL-v2-synthetic, generated through illumination distribution analysis of RAW images. Our experiments specifically employ the LOL-v2-synthetic subset, which contains 1000 synthetically generated low- or normal-light image pairs. These pairs are divided into 900 training pairs and 100 testing pairs through systematic data partitioning.

4.3. Comparisons with State-of-the-Art Methods

In this section, we provide a comprehensive evaluation of the proposed Global-to-Local Mamba framework through quantitative and qualitative comparisons with state-of-the-art methods. The assessment utilizes three well-established metrics: PSNR, SSIM, and LPIPS. PSNR quantifies reconstruction quality by calculating the pixel intensity differences between images, with higher values indicating greater reconstruction fidelity. SSIM evaluates perceptual quality by analyzing luminance, contrast, and structural similarity, where values closer to 1 signify better preservation of image characteristics. LPIPS measures perceptual similarity using deep neural networks, with lower scores indicating better alignment with human visual perception. Collectively, these metrics offer a holistic view of both pixel-level accuracy and perceptual authenticity across different enhancement paradigms.
Quantitative Evaluation: The quantitative evaluation results on the LOL-v1 dataset are systematically summarized in Table 1. The proposed Global-to-Local Mamba framework is benchmarked against 12 state-of-the-art (SOTA) low-light enhancement methodologies, including HVIet, KinD [33], Zero-DCE [23], RUAS [34], EnlightenGAN [35], UFormer [36], IAT [37], PairLIE [38], SCI [39], LLFormer [40], Wave-Mamba [30], and HVI [41]. As demonstrated quantitatively, the Global-to-Local Mamba achieves superior PSNR performance, outperforming all comparative methods by a significant margin of 0.7 dB. This empirical evidence substantiates the method’s exceptional capability in preserving structural fidelity during illumination recovery processes.
The quantitative evaluation results on the LOL-v2-synthetic dataset are meticulously detailed in Table 2. This table juxtaposes the proposed Global-to-Local Mamba framework against 11 state-of-the-art low-light image enhancement methods, namely RetinexNet [26], KinD [33], ZeroDCE [23], RUAS [34], UFormer [36], Bread [42], PairLIE [38], LLFormer [40], QuadPrior [43], and Wave-Mamba [30]. As evidenced in Table 2, the Global-to-Local Mamba attains optimal performance metrics in both PSNR and SSIM on this benchmark, thereby establishing new advancements in synthetic low-light image restoration.
Qualitative Evaluation: Figure 6 presents a visual comparison between our method and existing approaches. Current methods often exhibit insufficient illumination, failing to restore fine details effectively. Additionally, color distortion and image degradation further compromise the enhancement outcomes of prior techniques. In contrast, our Global-to-Local Mamba framework not only effectively enhances brightness but also reconstructs intricate details with high precision. It demonstrates superior capability in amplifying low-visibility and low-contrast regions. The proposed method reliably eliminates noise without introducing artificial artifacts and robustly preserves original chromatic information throughout the enhancement process. This dual-scale architecture achieves balanced performance by integrating global contextual modeling with local feature refinement, ensuring both radiometric consistency and structural fidelity in challenging illumination conditions.
Model Parameters and Efficiency: We also conduct experiments on the parameters and efficiency of the model. The experimental results are shown in Table 3. It can be seen that the proposed GLMA achieves the best balance between model complexity and computational efficiency. With only 1.9 M parameters and 9.02 G FLOPs, it significantly outperforms most competing methods, including both lightweight models (e.g., PairLIE, KinD) and large-scale models (e.g., GSAD, QuadPrior). Despite its compact size, GLMA delivers superior performance in terms of PSNR and SSIM, demonstrating the effectiveness of the Global-to-Local Mamba design and wavelet-based frequency decomposition strategy. This lightweight yet powerful architecture makes GLMA highly suitable for deployment on resource-limited devices and real-time applications.

4.4. Ablation Study

Low-frequency Mamba Block: As the core component of the Global-to-Local Mamba framework, we conduct ablation experiments to validate the functional roles of individual modules. The results in Table 4 show that the wave-mamba architecture, which prioritizes global feature extraction followed by local refinement, achieves a 0.64 dB improvement in PSNR. These findings substantiate the practical significance of enhancing low-frequency components for subsequent high-frequency refinement, while also confirming the intrinsic value of low-frequency features in low-light image restoration. Compared to the LFMBlock, the High-Frequency Guidance Blocks (HFGBlocks) impose a more substantial impact on model parameters but yield a relatively minor performance gain. This underscores the efficiency superiority of our low-frequency-centric design paradigm.

5. Conclusions

In this study, we introduce an effective framework for low-light image enhancement (LLIE) based on a Global-to-Local Mamba Architecture. Motivated by the ability of wavelet transforms to decompose high- and low-frequency features, we employ wavelet-based downsampling to mitigate information loss during feature decomposition. The framework adopts a global-to-local feature extraction strategy to holistically capture structural information in the low-frequency domain. These low-frequency features further guide the refinement of high-frequency details through cross-domain interaction. Extensive experiments on multiple datasets demonstrate the compelling performance of our method in enhancing low-light images.
Nevertheless, several limitations of the proposed approach deserve further discussion. First, the current single-level Discrete Wavelet Transform (DWT) may fall short in representing mid-frequency components, which can lead to aliasing artifacts in highly textured regions. Second, the Frequency Matching Transformation (FMT) relies on a hand-crafted similarity metric, which can be sensitive to noise and complex illumination variations. Third, the lack of an explicit noise modeling module limits the model’s robustness under real-world noisy conditions. Additionally, while the method performs well on static images, its behavior on video sequences—such as temporal consistency and processing speed—has not been evaluated, which is critical for deployment in video applications.
In future work, we will explore multi-level wavelet decomposition to enhance mid-frequency representation. We also plan to design a learnable frequency matching mechanism to replace the fixed similarity metric in FMT, and integrate an explicit noise modeling branch to improve robustness. Extending the current framework to video-based low-light enhancement and restoration will be another important direction, with the goal of maintaining temporal coherence and enabling real-time processing. We will also investigate model compression and acceleration strategies to facilitate practical deployment on resource-constrained devices.

Author Contributions

Conceptualization, W.L.; data curation, S.L. and Y.G.; formal analysis, W.L. and Y.T.; funding acquisition, W.L.; investigation, Q.W.; methodology, W.L. and N.D.; project administration, Y.T.; resources, Y.G. and X.W.; software, W.L. and N.D.; supervision, Q.W.; validation, S.L.; visualization, W.L., X.W. and N.D.; writing—review editing, Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by 14th Five Year National Key R & D Program Project (No. 2023YFB3211001), the National Natural Science Foundation of China (Grant No. 62073205), the General Talents Project for Scientific Research grant of the Educational Department of Liaoning Province (LJ212410144062), and the Science and Technology Program of Liaoning Province (2023JH26/10300015).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The LOL-v1 dataset is an open dataset accessed on 18 February 2020 and can be downloaded at https://arxiv.org/abs/1808.04560. The LOL-v2 dataset is an open dataset accessed on 28 May 2018 and can be downloaded at https://doi.org/10.1109/TIP.2021.3050850.

Acknowledgments

This work is supported by 14th Five Year National Key R & D Program Project (No. 2023YFB3211001), the National Natural Science Foundation of China (Grant No. 62073205), the General Talents Project for Scientific Research grant of the Educational Department of Liaoning Province (LJ212410144062), and the Science and Technology Program of Liaoning Province (2023JH26/10300015).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Guo, H.; Li, J.; Dai, T.; Ouyang, Z.; Ren, X.; Xia, S.T. Mambair: A simple baseline for image restoration with state-space model. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 222–241. [Google Scholar]
  2. Shi, Y.; Xia, B.; Jin, X.; Wang, X.; Zhao, T.; Xia, X.; Xiao, X.; Yang, W. VmambaIR: Visual State Space Model for Image Restoration. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 5560–5574. [Google Scholar] [CrossRef]
  3. Li, J.; Li, B.; Tu, Z.; Liu, X.; Guo, Q.; Juefei-Xu, F.; Xu, R.; Yu, H. Light the night: A multi-condition diffusion framework for unpaired low-light enhancement in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15205–15215. [Google Scholar]
  4. Liang, D.; Xu, Z.; Li, L.; Wei, M.; Chen, S. PIE: Physics-inspired low-light enhancement. Int. J. Comput. Vis. 2024, 132, 3911–3932. [Google Scholar] [CrossRef]
  5. Goyal, B.; Dogra, A.; Lepcha, D.C.; Goyal, V.; Alkhayyat, A.; Chohan, J.S.; Kukreja, V. Recent advances in image dehazing: Formal analysis to automated approaches. Inf. Fusion 2024, 104, 102151. [Google Scholar] [CrossRef]
  6. Zhang, Y.; Zhou, S.; Li, H. Depth information assisted collaborative mutual promotion network for single image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2846–2855. [Google Scholar]
  7. Xiang, Y.; Zhou, H.; Li, C.; Sun, F.; Li, Z.; Xie, Y. Deep learning in motion deblurring: Current status, benchmarks and future prospects. Vis. Comput. 2024, 41, 3801–3827. [Google Scholar] [CrossRef]
  8. Su, H.; Li, Y.; Xu, Y.; Fu, X.; Liu, S. A review of deep-learning-based super-resolution: From methods to applications. Pattern Recognit. 2024, 157, 110935. [Google Scholar] [CrossRef]
  9. Lei, X.; Zhang, W.; Cao, W. Dvmsr: Distillated vision mamba for efficient super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 6536–6546. [Google Scholar]
  10. Zhao, X.; Wang, L.; Zhang, Y.; Han, X.; Deveci, M.; Parmar, M. A review of convolutional neural networks in computer vision. Artif. Intell. Rev. 2024, 57, 99. [Google Scholar] [CrossRef]
  11. Shou, Y.; Cao, X.; Liu, H.; Meng, D. Masked contrastive graph representation learning for age estimation. Pattern Recognit. 2025, 158, 110974. [Google Scholar] [CrossRef]
  12. Feng, K.; Ma, Y.; Wang, B.; Qi, C.; Chen, H.; Chen, Q.; Wang, Z. Dit4edit: Diffusion transformer for image editing. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 2969–2977. [Google Scholar]
  13. Xia, C.; Wang, X.; Lv, F.; Hao, X.; Shi, Y. Vit-comer: Vision transformer with convolutional multi-scale feature interaction for dense predictions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5493–5502. [Google Scholar]
  14. Ma, X.; Zhang, X.; Pun, M.O. Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
  15. Zhao, H.; Zhang, M.; Zhao, W.; Ding, P.; Huang, S.; Wang, D. Cobra: Extending mamba to multi-modal large language model for efficient inference. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 10421–10429. [Google Scholar]
  16. Shen, H.; Wan, Z.; Wang, X.; Zhang, M. Famba-v: Fast vision mamba with cross-layer token fusion. In Proceedings of the European Conference on Computer Vision, Dublin, Ireland, 22–23 October 2025; pp. 268–278. [Google Scholar]
  17. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
  18. Chang, M.; Feng, H.; Xu, Z.; Li, Q. Low-light image restoration with short-and long-exposure raw pairs. IEEE Trans. Multimed. 2021, 24, 702–714. [Google Scholar] [CrossRef]
  19. Xu, K.; Yang, X.; Yin, B.; Lau, R.W. Learning to restore low-light images via decomposition-and-enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2281–2290. [Google Scholar]
  20. Wu, X.; Lai, Z.; Yu, S.; Zhou, J.; Liang, Z.; Shen, L. Coarse-to-fine low-light image enhancement with light restoration and color refinement. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 8, 591–603. [Google Scholar] [CrossRef]
  21. Land, E.H.; McCann, J.J. Lightness and retinex theory. J. Opt. Soc. Am. 1971, 61, 1–11. [Google Scholar] [CrossRef]
  22. Zhao, L.; Wang, K.; Zhang, J.; Wang, A.; Bai, H. Learning deep texture-structure decomposition for low-light image restoration and enhancement. Neurocomputing 2023, 524, 126–141. [Google Scholar] [CrossRef]
  23. Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1780–1789. [Google Scholar]
  24. Li, C.; Guo, C.; Loy, C.C. Learning to enhance low-light image via zero-reference deep curve estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4225–4238. [Google Scholar] [CrossRef]
  25. Jiang, H.; Luo, A.; Fan, H.; Han, S.; Liu, S. Low-light image enhancement with wavelet-based diffusion models. ACM Trans. Graph. (TOG) 2023, 42, 238. [Google Scholar] [CrossRef]
  26. Bai, J.; Yin, Y.; He, Q.; Li, Y.; Zhang, X. Retinexmamba: Retinex-based mamba for low-light image enhancement. In Proceedings of the ICONIP 2024, Auckland, New Zealand, 2–6 December 2024; pp. 427–442. [Google Scholar]
  27. Weng, J.; Yan, Z.; Tai, Y.; Qian, J.; Yang, J.; Li, J. Mamballie: Implicit retinex-aware low light enhancement with global-then-local state space. Adv. Neural Inf. Process. Syst. 2024, 37, 27440–27462. [Google Scholar]
  28. Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 64–67. [Google Scholar]
  29. Mai, H.; Yin, Z. A new zero-order algorithm to solve the maximum hands-off control. IEEE Trans. Autom. Control 2023, 69, 2761–2768. [Google Scholar] [CrossRef]
  30. Zou, W.; Gao, H.; Yang, W.; Liu, T. Wave-mamba: Wavelet state space model for ultra-high-definition low-light image enhancement. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 1534–1543. [Google Scholar]
  31. Osadchiy, A.; Kamenev, A.; Saharov, V.; Chernyi, S. Signal processing algorithm based on discrete wavelet transform. Designs 2021, 5, 41. [Google Scholar] [CrossRef]
  32. Azad, R.; Aghdam, E.K.; Rauland, A.; Jia, Y.; Avval, A.H.; Bozorgpour, A.; Karimijafarbigloo, S.; Cohen, J.P.; Adeli, E.; Merhof, D. Medical Image Segmentation Review: The Success of U-Net. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10076–10095. [Google Scholar] [CrossRef] [PubMed]
  33. Zhang, Y.; Zhang, J.; Guo, X. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1632–1640. [Google Scholar]
  34. Liu, R.; Ma, L.; Zhang, J.; Fan, X.; Luo, Z. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10561–10570. [Google Scholar]
  35. Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef] [PubMed]
  36. Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17683–17693. [Google Scholar]
  37. Cui, Z.; Li, K.; Gu, L.; Su, S.; Gao, P.; Jiang, Z.; Qiao, Y.; Harada, T. You only need 90k parameters to adapt light: A light weight transformer for image enhancement and exposure correction. arXiv 2022, arXiv:2205.14871. [Google Scholar] [CrossRef]
  38. Fu, Z.; Yang, Y.; Tu, X.; Huang, Y.; Ding, X.; Ma, K.K. Learning a simple low-light image enhancer from paired low-light instances. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22252–22261. [Google Scholar]
  39. Ma, L.; Ma, T.; Liu, R.; Fan, X.; Luo, Z. Toward fast, flexible, and robust low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5637–5646. [Google Scholar]
  40. Jie, H.; Zuo, X.; Gao, J.; Liu, W.; Hu, J.; Cheng, S. Llformer: An efficient and real-time lidar lane detection method based on transformer. In Proceedings of the 2023 5th International Conference on Pattern Recognition and Intelligent Systems, Shenyang, China, 28–30 July 2023; pp. 18–23. [Google Scholar]
  41. Yan, Q.; Feng, Y.; Zhang, C.; Pang, G.; Shi, K.; Wu, P.; Dong, W.; Sun, J.; Zhang, Y. Hvi: A new color space for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 5678–5687. [Google Scholar]
  42. Guo, X.; Hu, Q. Low-light image enhancement via breaking down the darkness. Int. J. Comput. Vis. 2023, 131, 48–66. [Google Scholar] [CrossRef]
  43. Wang, W.; Yang, H.; Fu, J.; Liu, J. Zero-reference low-light enhancement via physical quadruple priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26057–26066. [Google Scholar]
  44. Hou, J.; Zhu, Z.; Hou, J.; Liu, H.; Zeng, H.; Yuan, H. Global structure-aware diffusion process for low-light image enhancement. Adv. Neural Inf. Process. Syst. 2023, 36, 79734–79747. [Google Scholar]
Figure 2. The proposed Global-to-Local Mamba architecture constructs a hierarchical framework that integrates wavelet transform-based upsampling and downsampling operations. Specifically, the Low-Frequency Mamba Block (LFMBlock) performs multi-scale feature extraction on frequency-decoupled components, while the High-Frequency Guided Enhancement Block (HFGBlock) refines details through gradient-aware attention mechanisms.
Figure 2. The proposed Global-to-Local Mamba architecture constructs a hierarchical framework that integrates wavelet transform-based upsampling and downsampling operations. Specifically, the Low-Frequency Mamba Block (LFMBlock) performs multi-scale feature extraction on frequency-decoupled components, while the High-Frequency Guided Enhancement Block (HFGBlock) refines details through gradient-aware attention mechanisms.
Applsci 15 10931 g002
Figure 3. The 2D-SSM framework functions by decomposing the input image via bidirectional diagonal scanning mechanisms. Specifically, the spatial tensor is disentangled along two principal axes: the primary diagonal (extending from the top-left to the bottom-right) and the secondary diagonal (extending from the bottom-left to the top-right). These scans are executed in both forward and reverse orientations, thereby generating four distinct scanning trajectories.
Figure 3. The 2D-SSM framework functions by decomposing the input image via bidirectional diagonal scanning mechanisms. Specifically, the spatial tensor is disentangled along two principal axes: the primary diagonal (extending from the top-left to the bottom-right) and the secondary diagonal (extending from the bottom-left to the top-right). These scans are executed in both forward and reverse orientations, thereby generating four distinct scanning trajectories.
Applsci 15 10931 g003
Figure 4. (a) Local Feature Extraction (LFM). (b) Multi-Scale Feed-Forward Network (MSFFN).
Figure 4. (a) Local Feature Extraction (LFM). (b) Multi-Scale Feed-Forward Network (MSFFN).
Applsci 15 10931 g004
Figure 5. (a) Frequency-Matching Attention Module. (b) Frequency Matching Transformation. The design of these two modules is an updated version based on [30].
Figure 5. (a) Frequency-Matching Attention Module. (b) Frequency Matching Transformation. The design of these two modules is an updated version based on [30].
Applsci 15 10931 g005
Figure 6. Visualization of the low-light enhancement model. Each column is a different image example, and each row is the prediction of the models. Comparative visualization style is an updated version based on [30].
Figure 6. Visualization of the low-light enhancement model. Each column is a different image example, and each row is the prediction of the models. Comparative visualization style is an updated version based on [30].
Applsci 15 10931 g006
Table 1. Quantitative comparisons on LOL-v1 datasets. The best result is in red color while the second best result is in blue color.
Table 1. Quantitative comparisons on LOL-v1 datasets. The best result is in red color while the second best result is in blue color.
MethodsVenuePSNRSSIMLPIPS
RetinexNet [26]BMVC 201816.770.560.47
KinD [33]MM 201917.650.720.17
ZeroDCE [23]CVPR 202014.860.580.33
RUAS [34]CVPR 202116.400.500.27
EnlightenGAN [35]TIP 202117.480.760.32
UFormer [36]CVPR 202216.360.770.32
IAT [37]BMVC 202221.300.810.32
PairLIE [38]CVPR 202319.510.730.24
SCI [39]CVPR 202214.780.520.33
LLFormer [40]AAAI 202323.650.840.16
WaveMamba [30]ACM MM 202423.270.840.14
HVI [41]CVPR 202523.800.850.08
Ours-23.910.840.14
Table 2. Quantitative comparisons on LOL-v2-synthetic datasets. The best result is in red color while the second best result is in blue color.
Table 2. Quantitative comparisons on LOL-v2-synthetic datasets. The best result is in red color while the second best result is in blue color.
MethodsVENUEPSNRSSIMLPIPS
RetinexNet [26]BMVC 201817.130.760.25
KinD [33]MM 201918.320.790.25
RUAS [34]CVPR 202113.760.630.30
ZeroDCE [23]CVPR 202017.710.810.17
Uformer [36]CVPR 202219.660.87-
PairLIE [38]CVPR 202319.070.790.23
LLFormer [40]AAAI 202324.030.900.06
Bread [42]JVC 202317.630.910.09
GSAD [44]NeurIPS 202324.470.920.05
QuadPrior [43]CVPR 202416.100.750.11
Wave-Mamba [30]ACMM 202424.760.920.06
Ours-24.870.930.06
Table 3. Calculation of model parameters and efficiency.
Table 3. Calculation of model parameters and efficiency.
MethodParam (M)Flops (G)
KinD8.0234.99
PairL0.3320.81
LLFormer24.5522.52
Uformer50.8845.90
GSAD217.36442.02
QuadPrior1252.71103.2
Ours1.99.02
Table 4. Ablation study on the proposed modules. The highest scores are indicated in red.
Table 4. Ablation study on the proposed modules. The highest scores are indicated in red.
MethodPSNRSSIM
w/o LFMBlock22.810.82
w/o HFGBlock22.950.83
w/o LFE23.060.83
w/o MSFFN23.120.83
w/o FMT23.580.84
Full Model23.910.84
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, W.; Wu, X.; Guan, Y.; Lin, S.; Ding, N.; Wang, Q.; Tang, Y. GLMA: Global-to-Local Mamba Architecture for Low-Light Image Enhancement. Appl. Sci. 2025, 15, 10931. https://doi.org/10.3390/app152010931

AMA Style

Li W, Wu X, Guan Y, Lin S, Ding N, Wang Q, Tang Y. GLMA: Global-to-Local Mamba Architecture for Low-Light Image Enhancement. Applied Sciences. 2025; 15(20):10931. https://doi.org/10.3390/app152010931

Chicago/Turabian Style

Li, Wentao, Xinhao Wu, Yu Guan, Sen Lin, Naida Ding, Qiang Wang, and Yandong Tang. 2025. "GLMA: Global-to-Local Mamba Architecture for Low-Light Image Enhancement" Applied Sciences 15, no. 20: 10931. https://doi.org/10.3390/app152010931

APA Style

Li, W., Wu, X., Guan, Y., Lin, S., Ding, N., Wang, Q., & Tang, Y. (2025). GLMA: Global-to-Local Mamba Architecture for Low-Light Image Enhancement. Applied Sciences, 15(20), 10931. https://doi.org/10.3390/app152010931

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop