1. Introduction
Low-light images are commonly captured under insufficient illumination and are degraded by low ambient light, sensor noise, and uneven exposure, leading to missing details in dark regions, amplified noise, and color distortion [
1]. Such images arise in nighttime surveillance, autonomous driving, robotics, mobile photography, and medical imaging, where degraded visibility severely hinders both visual quality and downstream vision tasks. Consequently, Low-Light Image Enhancement (LLIE) aims to recover perceptually pleasing images by improving brightness and contrast, restoring hidden details, correcting colors, and suppressing noise without introducing overexposure or color shifts.
Existing LLIE methods mainly face two primary challenges. On one hand, there are limitations in the adaptability of color spaces. Conventional color spaces (e.g., HSV [
2]) are not sufficiently robust in low-light conditions, often suffering from unstable color behavior. In particular, the hue-axis discontinuity and low-intensity noise can induce color artifacts and noticeable color shifts, degrading visual quality and downstream reliability. On the other hand, balancing performance and efficiency in feature modeling poses a challenge. CNN-based models [
3,
4,
5] are effective for local details but are less capable of modeling long-range dependencies, which may lead to non-uniform exposure and inconsistent color. Transformers [
6] improve global consistency via long-range interactions, but their computational overhead can be substantial for high-resolution inputs, limiting practical deployment. It remains challenging to achieve a favorable trade-off among color fidelity, detail preservation, global consistency, and efficiency.
To address the two primary challenges mentioned above, this study proposes a dual-stage low-light image enhancement framework, HVIFormer, based on the HVI representation [
7] (Horizontal/Vertical-Intensity). It leverages a trainable HVI representation to provide an explicit illumination prior while preserving chroma/structure cues. Unlike fixed-rule color spaces (e.g., HSV [
2]), HVI is trainable and tailored for LLIE, allowing it to adapt to diverse brightness scales and color variations. Specifically, the input sRGB image is mapped into HVI and decomposed into a global intensity map
I and an HVI feature map. The intensity map
I serves as a unified reference for exposure calibration, while the HVI feature map retains detail and color cues. Based on this decomposition, HVIFormer adopts a two-stage coarse-to-fine design: Stage I performs intensity-guided Transformer pre-recovery using
I as an illumination prior, and introduces an Intensity-Conditioned Block (ICB) to calibrate global exposure and suppress dominant noise. Stage II refines details and color based on the Stage-I result, using lightweight Complementary Cross-Attention (CCA) for efficient cross-branch fusion. This division of labor reduces over-exposure and color shift while preserving local fidelity. Extensive experiments on 10 paired/unpaired low-light datasets demonstrate robust performance, producing uniformly exposed images with rich details and natural colors.
Our contributions can be summarized as follows:
We propose a novel two-stage deep learning framework for low-light image enhancement based on HVI color space representation to address the issue. This framework significantly improves the visual quality of low-light images after enhancement.
In our two-stage framework, the first stage is the Transformer pre-recovery stage. At this stage, we introduce intensity prior conditions and adaptive mechanisms, combined with intensity-conditioned blocks (ICBs), to significantly improve the stability of image enhancement.
The second stage is image refinement and enhancement. At this stage, we introduce the Complementary Cross-Attention (CCA), which effectively reduces excessive image enhancement, unnatural color shifts, and dark noise artifacts.
We conduct comparative experiments on 10 datasets, qualitatively analyzing visual effects and quantitatively comparing 10 quantitative indicators. The results showed that the overall performance of our proposed HVIFormer method is superior to all compared state-of-the-art methods (SOTA).
3. Method
The primary challenge in low-light enhancement lies in the need to significantly brighten dark regions to restore visible structure; however, this brightening simultaneously amplifies noise and causes color distortion. At the same time, uneven local exposure results in different brightness and color distributions for the same object in different regions, making it difficult for models that rely solely on local operations to maintain global consistency. To address this, this study proposes a dual-stage enhancement framework in the HVI domain (
Figure 1), consisting of Stage-I intensity-conditioned Transformer pre-recovery and Stage-II dual-branch refinement enhancement. By decoupling global brightness recovery and feature preprocessing from image detail enhancement and color refinement, the proposed framework improves the model’s stability in extremely dark and high-noise scenarios.
The overall approach of our method is as follows: Given an input low-light image , HVIFormer first maps it to the HVI space, explicitly decomposing it into the intensity map and the HVI map containing color and structural information. This color space transformation effectively decouples illumination information from color and structural information.
Stage-I: Intensity-Conditioned Transformer Pre-Recovery Stage. In this stage, the intensity component is introduced as an explicit illumination prior for intensity-conditioned feature interaction. Through element-wise scaling, the intensity features dynamically guide and adjust the attention mechanism to aggregate features, enabling feature interactions in different brightness regions to adopt different strategies. In dark regions, the HVI feature map integrates reliable contextual information more effectively to recover the image structure and suppress noise amplification. In relatively bright regions, over-enhancement is avoided to preserve the original color and details of the image. By leveraging the Transformer architecture, Stage-I can effectively model global dependencies, ensuring global lighting consistency and optimizing detail recovery when processing regions with varying brightness.
Furthermore, Stage-I introduces a bidirectional interaction update mechanism between content (HVI feature) and intensity (intensity feature). In this mechanism, the intensity information first serves as a guide, using the Intensity-Conditioned Block (ICB) to control and direct the update of the image content features (such as color and structure). More importantly, the content features, in turn, adjust the intensity, making it not a fixed condition but one that can be continuously optimized based on the actual structure and object information in the image. This significantly improves the stability and consistency of the algorithm when handling real-world scenarios with issues such as local exposure inconsistency and complex noise. Through its self-attention mechanism, the Transformer effectively captures long-range dependencies between different regions of the image, providing stronger support for global recovery and detail processing of the image.
Stage-I finally outputs two results: the repaired intensity component and the denoised HVI feature , providing a more reliable and consistent input for Stage-II.
Stage-II: Dual-Branch Refinement Enhancer. In this stage, based on the input provided by Stage-I, a dual-branch U-Net architecture with six lightweight Complementary Cross-Attention (CCA) blocks is used to further refine the image’s brightness, recover details, and stabilize the color.
Finally, through the Perceptual Inverse HVI Transform (PHVIT), the enhanced HVI features are stably mapped back to the sRGB space, outputting the final enhanced image .
The primary logic of the dual-stage design is as follows: The first stage serves as a pre-recovery phase, focusing on unifying the overall brightness of the image and performing initial noise and degradation suppression in dark regions, thereby reducing the learning difficulty for the subsequent network. The second stage focuses on fine-grained processing to recover local details and ensure color stability and consistency. This dual-stage structure effectively decouples the tasks, avoiding conflicts that arise when a single network is tasked with handling both global brightness adjustment and local detail enhancement, thus preventing unstable artifacts in the final result.
Algorithm 1 outlines the primary process of the proposed HVIFormer in the HVI domain, from coarse to fine enhancement (as shown in Algorithm 1). This process follows a closed-loop design of spatial transformation, dual-stage optimization, and transformation, strictly adhering to the coarse-to-fine enhancement logic, ensuring the collaborative optimization of brightness consistency, detail integrity, and color naturalness.
Table 2 summarizes the modules of HVIFormer, their inputs/outputs, and their roles in improving brightness consistency, noise suppression, and color stability.
| Algorithm 1 Proposed HVI-domain coarse-to-fine pipeline. |
- Require:
Low-light sRGB image L - Ensure:
Enhanced sRGB image - 1:
▹ sRGB→HVI - 2:
▹ Stage-I intensity-conditioned Transformer pre-recovery - 3:
▹ dual-branch refinement enhancement - 4:
▹ HVI→sRGB - 5:
return
|
3.1. HVI Representation of the Image
To ensure the completeness of the method description, we first list the HVI representation formula of an image based on the work by [
39]. The trainable HVI color space transformation method begins by calculating the intensity map of the image using the maximum sRGB channel value:
thus extracting the scene’s brightness information. Then, using the intensity map and the original image, an HV color map is generated that combines color and structural information. The horizontal and vertical components are computed as follows:
where ⊙ denotes element-wise multiplication;
is a low-intensity color plane density adjusted by the trainable density parameter
, given by the following formula:
where
S represents color saturation; and
is the saturation adjusted by the training function
, with
. The horizontal and vertical components
and
are computed using the hue value
. The hue mapping
is defined as follows:
where
, and
represents the hue value. By introducing the adaptive linear color perception mapping
, the color shift problem is adjusted to mitigate the color distortion caused by cameras in low-light environments. This method effectively avoids the hue discontinuity and black-plane issues in the traditional HSV color space, providing a more precise solution for low-light image enhancement.
3.2. Stage-I Intensity-Conditioned Transformer Pre-Recovery
In extremely dark scenes, image degradation is not merely a result of underexposure. In fact, the issue involves a complex interplay among artifacts, including uneven brightness distribution, increased noise in dark areas, and severe color distortion.
If the image converted to the HVI space is directly fed into the subsequent enhancement network for processing, the network must solve two problems simultaneously: correcting global illumination inconsistencies, and restoring local details and color information. This coupling of tasks makes training more difficult and can lead to unstable phenomena, such as local overexposure, color bias, and artifacts in dark areas.
Therefore, a strength-conditioned pre-recovery phase is introduced before performing refinement enhancement. The goal of this phase is not to complete the final enhancement, but rather to use intensity information to guide the preprocessing and normalization of the input features. Through this process, the subsequent enhancement network can focus more on restoring image details and optimizing color, thereby avoiding the aforementioned problems.
In the HVI representation, we extract the intensity component I, which clearly reflects the image’s lighting conditions. I is strongly associated with exposure levels and thus helps us distinguish brightness differences in various regions of the image. Additionally, when noise in the dark areas of the image is high, I provides a natural guide, helping us determine which areas require stronger recovery processing and which areas should avoid excessive enhancement.
Based on this, we design Stage-I as an intensity-based feature recovery phase rather than relying on the network to infer illumination information on its own. By leveraging the illumination information provided by I, we can process the image more precisely, ensuring a more stable and reasonable recovery.
3.2.1. Multi-Scale U-Shaped Pre-Recovery Module with Dual Output
In low-light images, image degradation is not only reflected in overall underexposure but also includes loss of details in local regions and increased noise. To address these issues, Stage-I adopts a multi-scale encoder–decoder structure (
Figure 1 (Stage-I)). This structure expands the receptive field through the encoder, enabling global consistency, while the decoder restores spatial details via skip connections, preventing excessive smoothing and ensuring image details are not blurred.
The encoder part of Stage-I contains two levels of downsampling, forming feature maps of three resolutions (
,
,
). At each resolution, several intensity-conditioned basic blocks (ICBs) are used to update the features, followed by downsampling through strided convolutions to increase the receptive field and enhance channel capacity. Specifically, we map the input image from the standard sRGB color space to the HVI color space, obtaining its representation in the HVI space
, while explicitly extracting the intensity map
I, and performing convolution on both
and
I. First, a
convolution layer maps them to the same channel dimension
C, resulting in two feature representations:
where
represents the higher-level feature map extracted through convolution, and
is the intensity feature guided by the intensity map. Since both have the same spatial resolution and channel size
C, they can be easily subjected to subsequent conditional interaction modeling to enhance image enhancement. On top of this, another ICB module is applied, followed by a
convolution for downsampling, increasing the channel size from
C to
. The second stage stacks two ICBs at the
resolution and applies the same convolution operation to reduce the resolution to
, increasing the channel size from
to
. At the lowest resolution (
), we further stack two ICBs as a bottleneck module to better model global dependencies and restore consistency between regions.
The decoder is symmetrically designed with the encoder and includes two levels of upsampling. Starting from , it progressively upsamples through 2 × 2 transposed convolutions (stride 2), with skip connections to fuse features from the corresponding encoder at each scale. The number of channels gradually decreases from to C. Finally, a 3 × 3 convolution is applied to obtain .
Unlike directly outputting the enhanced image, Stage-I is designed as a pre-recovery module at the feature level. It does not directly output the final enhanced result, but instead generates two types of information: we predict the residual of the HVI feature map and obtain the adjusted through residual update. This process helps optimize and improve the HVI features to make them more consistent with real lighting conditions; we also predict the residual of the intensity map and add it to the original intensity map I, obtaining the corrected intensity map . This corrected intensity map will provide more reliable guidance for subsequent enhancement.
Thus, the output of Stage-I is , which serves as input to Stage-II. This design separates the tasks of correcting brightness consistency and optimizing content features, avoiding task coupling between the two. In this way, Stage-I provides more stable and controllable input conditions for subsequent image refinement and enhancement, thereby ensuring the robustness and reliability of the entire process.
3.2.2. Cross-Branch Interaction Update Between Content and Intensity
Although provides an effective illumination-condition signal, in extremely dark regions, due to noise or exposure inconsistencies, the intensity features may become unreliable, leading to the failure of the conditional signal. In contrast, usually contains more stable structural and texture information, which can help correct the bias in the intensity map. To overcome this issue, we introduce a bidirectional interaction update mechanism between HVI features and intensity features in the multi-scale U-shaped structure: on one hand, is used to guide the update of , while on the other hand, reversely corrects , allowing the intensity map to adaptively adjust under the guidance of structural and texture information, rather than remaining fixed.
Specifically, at each scale,
and
are iteratively updated through the same interaction module (ICB,
Figure 2). In addition to the update of
guided by
, we also perform the process of
correcting
(e.g.,
), which enhances the intensity map’s ability to perceive structure and texture, improving its stability.
Specifically, in the
t-th iteration of the interaction, we use the same basic block
to update these two features:
where
is the ICB, a module consisting of Intensity-Conditioned Multi-Head Self-Attention (IC-MHSA) and a Feed-Forward Network (FFN). This cross-branch interaction update mechanism continuously corrects the intensity map based on structural and texture information, ensuring that even regions with large illumination differences in real low-light scenes can receive more reliable enhancement, thereby improving the stability of the enhancement effect.
3.2.3. Intensity-Conditioned Multi-Head Self-Attention
Specifically, we first reshape the input HVI image feature into a feature tensor . Next, we adopt h independent attention heads, where each attention head has a feature dimension of d, satisfying the channel dimension constraint .
We first obtain the queries (
Q), keys (
K), and values (
V) via linear projections (
Figure 2b):
where
are learnable linear projection matrices, so
.
Then, we split
Q,
K, and
V into
h groups according to the number of heads:
where each
,
.
Next, we generate fusion weights
G aligned with the HVI image features based on the previously extracted intensity features
, where the weights dynamically adjust the update magnitude across different spatial locations and channels in the image based on intensity information, enabling local enhancement based on intensity. Specifically, we perform element-wise re-weighting of the value features in the attention mechanism to control the update magnitude at different spatial locations and channels:
where ⊙ denotes element-wise multiplication, and
represents a function that processes the intensity feature
, generating the fusion weight
G.
This strategy can be understood as adaptive fusion guided by intensity information: in low-light regions, G tends to amplify the contribution of effective context, helping to restore structural details in the image and suppress noise diffusion; in brighter regions, G suppresses excessive updates to avoid over-enhancement, maintaining the naturalness and consistency of color and texture. To achieve this, we do not impose explicit range constraints (such as ) on G, but instead allow the model to learn the appropriate scaling magnitude and direction through end-to-end training, enabling it to dynamically adjust its enhancement strategy in different environments.
We reshape and split the previously generated fusion weight into h parts to obtain , where each .
We then perform conditional scaling of the
V features for each head:
Next, each head computes attention independently and aggregates
Then, we concatenate the outputs of all heads and obtain the final result via a linear mapping:
Finally, the output features of all heads are concatenated, and feature fusion is completed via linear mapping, incorporating 2D positional encoding to preserve image spatial location information, and the output of the intensity-conditioned multi-head self-attention is obtained.
Unlike simply statically concatenating intensity-guided features with HVI space image features, this intensity-prior-based adaptive fusion mechanism recalculates the fusion weights G during each global aggregation. This allows us to explicitly model relationships across different lighting regions: in dark areas, the model relies more on effective contextual information to help restore details and suppress noise; in bright areas, the model updates more conservatively to prevent over-enhancement or color distortion.
3.3. Dual-Branch Refinement Enhancement
After completing the intensity-conditioned Transformer pre-recovery in Stage-I, we feed its output as a more stable input condition into Stage-II for refinement enhancement (
Figure 1 (Stage-II)). Stage-II adopts a dual-branch structure to process intensity and color structural information separately: the I branch learns brightness mapping to avoid underexposure or overexposure, while the HV branch focuses on dark region denoising and color stabilization to suppress color bias and noise textures. The two branches interact cross-branch to learn complementary information, allowing brightness enhancement, color correction, and detail recovery to be jointly optimized. To further enhance the information interaction between the brightness branch and the color branch, we adopt a lightweight Complementary Cross-Attention (CCA) module (
Figure 3). CCA effectively learns the complementary information between the HV branch and the intensity branch through the Cross-Attention Block (CAB) mechanism, promoting their collaborative optimization in the image enhancement process. Specifically, the HV branch handles HVI features, while the I branch processes intensity features, and the CAB mechanism establishes a mutually guiding relationship between these two branches.
The CAB exhibits a symmetrical structure between the I-way and the HV-way [
39]. We use the HV-branch as an example to describe the details.
represents the input of the HV-branch. The CAB first derives the query (Q) by
. Meanwhile, the CAB splits the key (K) and value (V) by
and
.
,
, and
represent the feature embedding convolution layers. This can be expressed as follows:
where
is the multi-head factor [
36], and
denotes the feature embedding convolutions.
Based on Retinex theory, the color denoise layer (CDL) decomposes the updated feature tensor
into illumination and reflectance components, which are achieved through feature embedding convolution layers
and
, respectively, i.e.,
(illumination component) and
(reflectance component). Based on these two components, CDL is defined as follows:
where ⊙ denotes element-wise multiplication, and
represents the depth-wise convolution layers. Finally, the output of the CDL adds a residual connection to mitigate the vanishing gradient problem in deep network training and simplify the model training process.
Unlike directly using the original HVI input, we use the intensity map corrected by Stage-I as a more reliable guidance signal and input it along with the Stage-I enhanced HVI image features into the dual-branch enhancer. Since Stage-I has already performed global illumination consistency correction and preliminarily suppressed degradation, Stage-II can focus more on local detail recovery and color refinement, thus reducing problems such as overexposure, dark region noise amplification, and color distortion. Specifically, in Stage-II, we input the output from Stage-I into the dual-branch refinement enhancer for further enhancement, resulting in the enhanced HVI representation .
PHVIT (Perceptual-invert HVI Transformation) maps the HVI representation back to the HSV color space, and is used to obtain the final sRGB enhancement result by restoring the Stage-II output
from the HVI space to the sRGB space. Overall, PHVIT forms a surjective mapping, thereby covering the valid representation domain of HSV; meanwhile, by introducing controllable parameters, it enables the saturation and brightness of an image to be adjusted independently. To ensure that the mapping is injective (and thus invertible) in computation, PHVIT first constrains the output components to valid numerical ranges to avoid outliers that may cause color overflow. It then defines
and
as intermediate variables:
where
is used to improve numerical stability. Next, according to the estimated intensity component, the polarized-plane components are de-normalized, and the hue and saturation are recovered by inverting the 2D coordinates: the hue map is computed from the inverse polar angle, while the saturation is obtained from the planar radius. Specifically, the hue map is formulated as
where
is an inverse piecewise-linear function:
where
and
are defined in Equation (
4). The saturation and value maps are perceptually estimated as
where
and
are customizable linear parameters for adjusting the image saturation and brightness, respectively, and
denotes the restored intensity (used as the HSV value channel). Finally, the HSV image is converted to an sRGB image [
40] via the standard HSV→sRGB mapping, yielding the final enhanced image
. This step ensures that the Stage-II output can be stably transformed back to the sRGB space, closing the two-stage pipeline and facilitating both visualization and quantitative evaluation.
3.4. Compared with HVI-CIDNet Method
Although our method is built upon the HVI representation, HVIFormer does not claim novelty in the HVI space itself. Instead, our contributions focus on how intensity is explicitly modeled and exploited to drive restoration, together with a two-stage collaborative design that is absent in CIDNet-style HVI pipelines. The key differences are as follows:
Two-stage collaboration: We separate restoration into Stage-I intensity-conditioned global pre-recovery (illumination calibration and dominant noise suppression) and Stage-II refinement (detail/color restoration), which is more stable for extremely dark inputs where single-stage pipelines often over-amplify noise or drift in color.
ICB: Rather than using intensity as a simple auxiliary cue, ICB couples intensity and content features bi-directionally, enabling mutual correction between illumination structure and scene content under severe low-light conditions.
IC-MHSA: We introduce intensity-conditioned MHSA where the intensity prior gates the attention update, yielding region-adaptive enhancement and mitigating over-enhancement and color shift.
CCA-based dual-branch refinement: Stage-II adopts dual branches (intensity-focused vs. chroma/detail-focused) and uses Complementary Cross-Attention (CCA) for controlled information exchange, improving denoising, texture recovery, and color fidelity beyond single-stream CIDNet-style designs.
3.5. Loss Function
To simultaneously improve both overall exposure and dark-region detail recovery, we apply joint supervision on the final enhanced result during training of the two-stage framework, considering both performances in the sRGB and HVI spaces.
In the HVI color space, we use L1 loss
[
41], edge loss
[
42], and perceptual loss
[
43] for the low-light enhancement task. Given the network’s final output enhanced image
and its corresponding HVI feature map
, the goal is to minimize the difference between them. The specific loss function is expressed as follows:
where
,
, and
are weights used to balance each loss term. In the sRGB space, for the restored sRGB image
and the original sRGB ground truth
, we use the same loss function. The final overall loss function is
where
is a hyperparameter that balances the strength of supervision in both color spaces.
In this way, we ensure that the output image maintains natural, accurate brightness and color in the sRGB space, while effectively enhancing details in dark regions and suppressing noise and color bias in the HVI space.
3.6. Evaluation of Image Enhancement Performance
In low-light image enhancement tasks, both qualitative and quantitative evaluations are important for assessing an approach’s performance. Qualitative evaluation mainly relies on visual comparisons to judge the quality of enhanced images, focusing on exposure and brightness distribution, detail recovery, noise suppression, color fidelity, and overall visual consistency. Specifically, the enhanced image should maintain uniform brightness while avoiding overexposure, underexposure, or noise amplification. Dark-region details and textures should be effectively recovered, and the image’s colors should remain consistent with the original, avoiding color bias or excessive saturation. In practical applications, by displaying the enhanced image, comparing it with the original image and other methods, and showing zoomed-in views, the algorithm’s advantages can be more intuitively demonstrated.
Quantitative evaluation relies on multiple evaluation metrics, commonly including PSNR, SSIM [
44], LPIPS [
45], etc. On datasets with ground truth (GT), we first calculate the Peak Signal-to-Noise Ratio (PSNR) to measure the pixel-wise difference between the enhanced image and the ground truth. The PSNR is computed as follows:
where MSE is the mean squared error and MAX is the maximum pixel value (typically 255 or 1). A higher PSNR value indicates better image quality.
Next, we use SSIM (Structural Similarity) to assess the similarity of images in terms of structure, brightness, and contrast. The SSIM is calculated as follows:
where
and
are the mean values of images
x and
y,
and
are the variances,
is the covariance, and
and
are constants, often set as
and
, where
D is the dynamic range of the image.
Finally, LPIPS (Learned Perceptual Image Patch Similarity) is used to measure perceptual differences based on deep feature differences. The LPIPS is computed as follows:
where
and
are the deep features of images
x and
y at the
l-th layer,
is the weight for each layer, and ⊙ denotes element-wise multiplication. A lower LPIPS value indicates that the enhancement result is closer to human perceptual similarity.
For low-light images without GT, we use no-reference metrics such as NIQE [
46] or BRISQUE [
47] to evaluate image quality. These metrics provide auxiliary evaluations by modeling the naturalness and quality of the image. Although these metrics do not rely on GT, they reflect the naturalness of the image and are consistent with subjective evaluation.
In the experiments, we used two types of datasets to evaluate the model’s performance. One type is a dataset with real images (GT), such as LOLv1, where quantitative evaluation is performed using metrics like PSNR, SSIM, and LPIPS. Visual comparisons highlight the enhanced image’s detail recovery and noise suppression capabilities. The other type is a real-world dataset without GT, where evaluation mainly relies on no-reference metrics (such as NIQE and BRISQUE), along with qualitative comparisons, to comprehensively demonstrate the method’s enhancement effects.
4. Experiment
4.1. Datasets
We evaluated our model on several commonly used low-light image enhancement (LLIE) benchmark datasets, aiming to test three different scenarios: (1) supervised training using paired images for easy quantitative comparison; (2) testing the model’s generalization ability in real-world scenarios without paired images; and (3) examining the model’s robustness in extremely dark environments.
LOL is one of the most commonly used benchmarks in the LLIE field. It includes low-light images and their corresponding normal exposure images, making it convenient for objective quantitative evaluation. Specifically, LOLv1 [
3] contains 500 image pairs under standard splits, typically using 485 pairs for training and 15 pairs for testing. The image resolution is commonly 400 × 600. LOLv2 [
48] further divides into two subsets, Real and Synthetic, to assess the model’s adaptation to real-world degradation and synthetic degradation distributions. LOLv2-Real typically uses 689 pairs for training and 100 pairs for testing, while LOLv2-Synthetic typically uses 900 pairs for training and 100 pairs for testing. Overall, LOL is mainly used to evaluate the model’s reconstruction performance, structural fidelity, and color recovery under paired supervision, while also reflecting the model’s stability when handling different types of degradation.
The unpaired datasets DICM [
49], LIME [
16], MEF [
50], NPE [
14], and VV [
51] typically only provide low-light images without strictly paired reference images. These datasets are closer to real-world applications of single-image enhancement scenarios. Thus, this setting focuses more on the model’s ability to adapt to different data distributions: the model needs to output natural, clean, and not overly enhanced images without a reference. We evaluate these datasets using no-reference metrics, measuring enhancement performance based on naturalness, noise control, and distortion levels.
The original SICE [
52] dataset contains 589 sets of low-light and over-exposed images. Following a commonly used protocol, we split SICE into training/validation/test sets at a 7:1:2 ratio. Unless otherwise stated, all methods are trained on the SICE training set and evaluated on the official evaluation subsets SICE-Mix and SICE-Grad [
53]. This split protocol and evaluation setting are adopted to avoid ambiguity and ensure reproducibility.
Sony-Total-Dark [
39] is a customized version of the Sony subset in the SID [
54]. There are 2697 short–long-exposure RAW image pairs. Following the commonly used setting, we convert RAW images to extremely dark sRGB inputs without gamma correction, as shown in the first row of
Figure 4, which significantly increases the difficulty in dark regions and amplifies sensor noise.
4.2. Experiment Settings
To ensure a fair comparison, we follow the mainstream low-light image enhancement (LLIE) evaluation protocols for training and testing, and use relatively matched cropping and training epoch settings for different datasets.
For LOLv1 and LOLv2-Real, we crop the training images into patches, set the batch size to 4, and train for 1500 epochs. For LOLv2-Synthetic, we use a batch size of 1, train for 500 epochs, and do not apply cropping. For SICE, we use patches for training, set the batch size to 10, and train for 1000 epochs. Testing is conducted on both SICE-Mix and SICE-Grad. For Sony-Total-Dark, we crop the training images into patches, set the batch size to 4, and train for 1000 epochs.
In terms of implementation, we train the models based on PyTorch using the Adam optimizer [
55] (
). The initial learning rate is set to
, and we use the cosine annealing strategy [
56] to gradually decay it to
. All experiments are conducted on third-party cloud GPU HPC compute nodes equipped with Intel(R) Xeon(R) Platinum CPUs, 80 GB RAM, and Ubuntu 20.04. We train and evaluate HVIFormer using a single NVIDIA GeForce RTX 4090 (24 GB) or RTX 3090 (24 GB). To ensure reproducibility, we keep the software stack identical across GPU instances (CUDA 11.8, PyTorch 2.0.1) and use the same training protocol and fixed random seeds.
4.3. Evaluation Metrics
For paired datasets, we use PSNR and SSIM to measure image distortion and, additionally, LPIPS (with AlexNet as the feature extraction network) to evaluate perceptual similarity. PSNR and SSIM focus more on pixel-level errors and structural consistency, while LPIPS emphasizes human perceptual effects, reflecting the similarity in texture and semantics. In this way, we can comprehensively evaluate the image enhancement performance from both precision and perceptual quality aspects.
For unpaired datasets, because ground truth images are unavailable, we use the no-reference quality metric NIQE to evaluate single-image enhancement results. Such metrics are usually more sensitive to natural image statistics, noise, and artifacts, and can indicate whether the enhanced result appears natural or whether over-enhancement and distortion are present to some extent.
4.4. Results Analysis
4.4.1. Quantitative Comparison on LOL Datasets
Table 3 presents the quantitative evaluation results and model complexity comparison on the LOLv1, LOLv2-Real, and LOLv2-Synthetic datasets. The results show that HVIFormer achieves the best PSNR and SSIM values across all three LOL benchmarks, and the LPIPS value is also the lowest, indicating that the method strikes a better balance between pixel-level accuracy and perceptual quality.
Specifically, compared to the second-best method, HVIFormer achieves a PSNR improvement of 2.78% on the LOLv1 dataset; a PSNR improvement of 4.52% on the LOLv2-Real dataset; and a PSNR improvement of 19.74%, SSIM improvement of 2.58%, and a 2.86% reduction in LPIPS on the LOLv2-Synthetic dataset.
4.4.2. Quantitative Comparison on Unpaired Datasets
Table 4 summarizes the NIQE results on five unpaired datasets: DICM, LIME, MEF, NPE, and VV. HVIFormer achieves the best average NIQE of 3.395, outperforming all other methods. Compared to the second-best method (AVG 3.457), it reduces the NIQE by 1.79%, indicating that, in the absence of real reference images, our method generates results that are more natural and better aligned with the distribution of real images.
4.4.3. Quantitative Comparison on Extreme Low-Light Datasets
To further validate the performance of HVIFormer under extreme low-light conditions,
Table 5 presents the evaluation results on the SICE and Sony-Total-Dark datasets. HVIFormer achieves significant improvements on both extreme datasets: on the SICE dataset, PSNR reaches 21.079 dB and SSIM is 0.765, with a PSNR improvement of approximately 58.66% compared to the strongest baseline method; on the Sony-Total-Dark dataset, PSNR is 24.234 dB and SSIM is 0.697, achieving the best performance. These significant improvements demonstrate that under conditions of extremely low illumination, severe information loss, and more complex noise, HVIFormer effectively avoids the imbalance between under-enhancement (overall grayness, invisible details) and over-enhancement (overexposure, color bias, noise amplification), showcasing stronger detail recovery and structural preservation capabilities. Moreover, the dual-stage approach of HVIFormer enables the best performance on this dataset even under extreme low-light conditions. These optimized results more clearly express the advantages of HVIFormer on these challenging datasets and highlight the key role of the dual-stage method in handling extreme low-light situations.
4.4.4. Qualitative Comparison
By combining the visual comparisons in
Figure 5 and
Figure 6, it is evident that the performance advantage of HVIFormer is intuitively validated across scenes of varying complexity. In paired datasets such as LOL, compared to traditional Retinex-based methods, HVIFormer demonstrates superior global exposure control, significantly improving the brightness of dark regions while effectively avoiding common issues such as over-enhancement, halos, and visual artifacts caused by discontinuities in local brightness. In the face of the real challenges posed by unpaired datasets, HVIFormer outperforms common end-to-end sRGB space enhancement networks by more effectively suppressing common dark-region noise and texture distortions. It also successfully eliminates noticeable color biases, such as yellowish or greenish tints, that are often introduced during enhancement, making the final generated images closer to the natural distribution in terms of hue and saturation.
4.5. Ablation Experiment
The ablation experiment in
Table 6 clearly demonstrates the key role of each primary component in the model: First, when only Stage-II is retained (Setting A), the model is required to handle both illumination consistency calibration and detail/colour recovery, resulting in an LPIPS value of 0.115. This indicates that the lack of prior illumination optimization leads to over-enhancement issues in the image. In contrast, when only Stage-I is used (Setting B), although it improves overall exposure and consistency to some extent, the image’s texture and colour restoration capability is limited due to the absence of a dedicated detail recovery stage. This fully proves that the decoupled design of Stage-I and Stage-II is the foundation for efficient enhancement, where their collaborative division of tasks prevents overload in a single stage.
Furthermore, after removing the Intensity-Conditioned Block (Setting C), the model’s performance significantly declines, with a noticeable gap compared to the complete model (Setting G), demonstrating that Stage-I’s correction of the intensity map provides a more reliable illumination baseline. Not only does this enhance the consistency of the image structure, but it also effectively suppresses over-enhancement and colour bias. When the Intensity-Conditioned Multi-Head Self-Attention is further removed (Setting D), the LPIPS value rises to 0.083, verifying the primary role of the intensity-guided adaptive fusion weights in suppressing dark-region noise and avoiding local overexposure. These weights precisely control the enhancement amplitude, ensuring the balance of local and overall brightness in the image.
Moreover, after removing the Complementary Cross-Attention mechanism (Setting E), the model’s performance in all metrics deteriorates, and the RGB two-stage approach without HVI space modeling (Setting F) performs even worse. This demonstrates the indispensable synergistic effect of both aspects: first, the bidirectional interaction of content and intensity in Stage-I enhances the credibility of conditional information, providing a solid foundation for subsequent enhancement; second, the cross-branch interaction between brightness and colour branches in Stage-II strengthens detail recovery and colour restoration capability, with both stages complementing each other to achieve synergistic enhancement.
As shown in
Figure 7, the enhancement results of HVIFormer are closest to the ground truth (GT), effectively preserving image details and colour consistency. In contrast, the outputs of other methods exhibit varying degrees of distortion and significant noise interference, resulting in a visual effect that deviates more noticeably from the GT. This fully validates the unique advantage of HVIFormer in decoupling modeling and enhancement in the HVI space by separating the optimization of illumination, details, and colour dimensions, before using cross-module interactions to achieve fusion, ultimately yielding more precise and natural low-light image enhancement.
4.6. Application Prospects and Cross-Domain Transferability
Many real-world vision systems require robust perception under poor illumination. The proposed intensity-guided Transformer pre-recovery stabilizes global exposure and suppresses severe noise, while the subsequent refinement stage restores fine structures and improves color fidelity; therefore, the framework is potentially transferable beyond low-light enhancement. Typical downstream scenarios include UAV-based construction inspection [
61] (e.g., rebar counting) and building façade analysis [
62], where challenging illumination is common. Future work will investigate cross-domain adaptation and evaluation on relevant public datasets, including a labelled dataset for rebar counting inspection on construction sites using unmanned aerial vehicles and building façade datasets for analyzing building characteristics using deep learning.
5. Conclusions
To improve the quality of low-light image enhancement, we propose a new two-stage deep learning framework using image HVI color space. The framework provides an alternative method for a user who seeks a method for enhancing low-light images with high quality but moderate model parameters. The ablation experiment shows that each module of our proposed method has a significant effect on enhancing low-light images. Comparative experiment results show that the proposed HVIFormer method is superior to 10 compared state-of-the-art methods according to the visual effects and the 10 quantitative indicators; in particular, the proposed method demonstrates greater stability in dark-region noise suppression and color restoration, effectively avoiding common issues such as overexposure, color bias, and detail loss.
Despite achieving good enhancement results, our model has a relatively large number of parameters and a high computational cost, especially when processing high-resolution images, which may lead to longer processing times. Therefore, future work will focus on optimizing the model, such as reducing computational complexity through techniques such as model pruning, quantization, or knowledge distillation, to make it more widely applicable in real-world scenarios. It is well known that transforming images between color spaces may cause information loss. The lack of in-depth theoretical analysis of the issue in our proposed method is a limitation of our paper, and we believe it is a topic worth further study. Additionally, although this study primarily focuses on static image enhancement, our framework demonstrates good scalability and can be further explored for video enhancement tasks. In video scenes, HVIFormer can leverage temporal information for cross-frame enhancement, not only improving the visual quality of individual frames but also effectively reducing the impact of motion blur, lighting changes, and other factors on the enhancement effect, thereby achieving more stable and natural video enhancement results. Beyond low-light enhancement, our framework may also benefit downstream vision applications under poor illumination (e.g., UAV-based construction inspection and building façade analysis), which we will explore in future work.