HVIFormer: A Dual-Stage Low-Light Image Enhancement Method Based on HVI Representation

Li, Yimei; Luo, Liuhong; Li, Hongjun

doi:10.3390/app16052450

Open AccessArticle

HVIFormer: A Dual-Stage Low-Light Image Enhancement Method Based on HVI Representation

by

Yimei Li

,

Liuhong Luo

^*

and

Hongjun Li

^*

College of Science, Beijing Forestry University, Beijing 100083, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(5), 2450; https://doi.org/10.3390/app16052450

Submission received: 8 February 2026 / Revised: 28 February 2026 / Accepted: 2 March 2026 / Published: 3 March 2026

Download

Browse Figures

Versions Notes

Abstract

Low-light image enhancement improves the quality of video surveillance and image analysis and, as a result, has long been a hot topic in image processing. However, current research on this topic faces a difficult challenge—effectively suppressing noise while improving brightness and maintaining color consistency, especially in extremely dark scenes, where dark noise amplification, uneven exposure, and color shifts often interact, leading to detail loss and color distortion. To address the issue, we propose a dual-stage low-light enhancement framework based on the HVI (Horizontal/Vertical-Intensity) color space. The low-light image is first mapped to the HVI space, obtaining the intensity component I and the HVI-based feature map, with I being explicitly extracted as an intensity prior. A Transformer-based pre-recovery module is introduced for global dependency modeling, guided by the intensity prior I through an Intensity-Conditioned Block (ICB) for conditional feature interaction. Subsequently, a dual-branch enhancement network utilizes lightweight Complementary Cross-Attention (CCA) blocks for brightness refinement and color denoising. Finally, the enhanced image is remapped to the sRGB color space. The proposed framework decouples global brightness recovery and feature preprocessing from detail enhancement and color refinement, improving stability in extremely dark and high-noise scenarios. Through 18 quantitative and qualitative experiments, we demonstrate that our proposed method achieves superior performance in dark noise suppression and color restoration across multiple low-light datasets.

Keywords:

low-light image enhancement; HVI color space; intensity-guided attention; transformer

1. Introduction

Low-light images are commonly captured under insufficient illumination and are degraded by low ambient light, sensor noise, and uneven exposure, leading to missing details in dark regions, amplified noise, and color distortion [1]. Such images arise in nighttime surveillance, autonomous driving, robotics, mobile photography, and medical imaging, where degraded visibility severely hinders both visual quality and downstream vision tasks. Consequently, Low-Light Image Enhancement (LLIE) aims to recover perceptually pleasing images by improving brightness and contrast, restoring hidden details, correcting colors, and suppressing noise without introducing overexposure or color shifts.

Existing LLIE methods mainly face two primary challenges. On one hand, there are limitations in the adaptability of color spaces. Conventional color spaces (e.g., HSV [2]) are not sufficiently robust in low-light conditions, often suffering from unstable color behavior. In particular, the hue-axis discontinuity and low-intensity noise can induce color artifacts and noticeable color shifts, degrading visual quality and downstream reliability. On the other hand, balancing performance and efficiency in feature modeling poses a challenge. CNN-based models [3,4,5] are effective for local details but are less capable of modeling long-range dependencies, which may lead to non-uniform exposure and inconsistent color. Transformers [6] improve global consistency via long-range interactions, but their computational overhead can be substantial for high-resolution inputs, limiting practical deployment. It remains challenging to achieve a favorable trade-off among color fidelity, detail preservation, global consistency, and efficiency.

To address the two primary challenges mentioned above, this study proposes a dual-stage low-light image enhancement framework, HVIFormer, based on the HVI representation [7] (Horizontal/Vertical-Intensity). It leverages a trainable HVI representation to provide an explicit illumination prior while preserving chroma/structure cues. Unlike fixed-rule color spaces (e.g., HSV [2]), HVI is trainable and tailored for LLIE, allowing it to adapt to diverse brightness scales and color variations. Specifically, the input sRGB image is mapped into HVI and decomposed into a global intensity map I and an HVI feature map. The intensity map I serves as a unified reference for exposure calibration, while the HVI feature map retains detail and color cues. Based on this decomposition, HVIFormer adopts a two-stage coarse-to-fine design: Stage I performs intensity-guided Transformer pre-recovery using I as an illumination prior, and introduces an Intensity-Conditioned Block (ICB) to calibrate global exposure and suppress dominant noise. Stage II refines details and color based on the Stage-I result, using lightweight Complementary Cross-Attention (CCA) for efficient cross-branch fusion. This division of labor reduces over-exposure and color shift while preserving local fidelity. Extensive experiments on 10 paired/unpaired low-light datasets demonstrate robust performance, producing uniformly exposed images with rich details and natural colors.

Our contributions can be summarized as follows:

We propose a novel two-stage deep learning framework for low-light image enhancement based on HVI color space representation to address the issue. This framework significantly improves the visual quality of low-light images after enhancement.
In our two-stage framework, the first stage is the Transformer pre-recovery stage. At this stage, we introduce intensity prior conditions and adaptive mechanisms, combined with intensity-conditioned blocks (ICBs), to significantly improve the stability of image enhancement.
The second stage is image refinement and enhancement. At this stage, we introduce the Complementary Cross-Attention (CCA), which effectively reduces excessive image enhancement, unnatural color shifts, and dark noise artifacts.
We conduct comparative experiments on 10 datasets, qualitatively analyzing visual effects and quantitatively comparing 10 quantitative indicators. The results showed that the overall performance of our proposed HVIFormer method is superior to all compared state-of-the-art methods (SOTA).

2. Related Work

The key literature related to this study can be summarized into three areas: low-light image enhancement, color spaces and decoupled representations, and vision Transformers.

2.1. Low-Light Image Enhancement

Early low-light enhancement methods mainly relied on traditional image processing techniques and prior assumptions, such as tone mapping [8,9], gamma transformation [10], histogram equalization [11,12,13], and illumination–reflection decomposition based on the Retinex theory [14,15,16,17,18,19,20,21]. While these methods are computationally efficient and interpretable, they fail to provide a unified model for handling complex degradations in real low-light scenarios, such as noise amplification, color distortion, and uneven exposure. As a result, they often lead to over-enhancement, loss of texture details, or unnatural color distortions in practical applications.

Deep learning methods overcome the limitations of traditional methods that rely on hand-crafted features by learning a complex non-linear mapping between low-light and normal-light images. This has significantly improved the quality, robustness, and generalization of low-light enhancement, making it the mainstream approach in the field. Several representative technical routes have emerged, covering various network architectures, including CNN [22], Generative Adversarial Networks (GAN) [23], Transformers [24], and hybrid architectures.

In 2017, LLNet [25] proposed by Lore et al. became a classic deep learning-based low-light enhancement tool. It constructed a deep network by stacking convolutional layers, integrating noise suppression and brightness enhancement, and was the first to demonstrate the potential of CNNs for low-light image processing. Liu et al. introduced RUAS [26], which incorporated an adaptive illumination-aware attention mechanism. This mechanism dynamically identifies dark and bright areas in the image, assigning differentiated enhancement weights to different lighting regions, thereby improving the brightness of dark regions while preventing overexposure in bright areas, further optimizing lighting uniformity. Xu et al. proposed SNR-Aware [6], which uses a hybrid CNN-Transformer collaborative mechanism. The CNN module captures local details and noise features, while the Transformer module models long-range dependencies through global self-attention. Additionally, a signal-to-noise ratio (SNR)-aware module dynamically adjusts the network’s enhancement strength to adapt to different noise levels in low-light scenarios.

Furthermore, MBLLEN, proposed by Li et al. [5], employs a multi-branch, multi-scale convolutional network architecture to provide dedicated enhancement paths for different brightness ranges in the image, enabling adaptive brightness adjustment and detail preservation. EnlightenGAN, proposed by Jiang et al. [23], leverages the advantages of GANs and employs an unsupervised training mode using unpaired data. The generator simulates the mapping from low-light to normal-light images, while the discriminator supervises the authenticity of the enhanced results, effectively addressing the issue of scarce paired training data and producing results closer to real-world visual experiences.

In recent years, with the rise of Transformer models in computer vision, RetinexFormer, introduced by Cai et al. [27], leverages the global modeling power of Transformers to accurately separate the image’s reflectance (details and color) from its illumination map (lighting information), optimizing each component separately and significantly improving color fidelity and detail restoration. LLFormer, proposed by Wang et al. [24], is specifically designed for low-light enhancement. It combines local window attention with global attention, reducing computational complexity while enhancing the model’s ability to capture dark region details and light distribution.

2.2. Color Spaces and Decoupled Representations

The essence of low-light enhancement tasks is not merely to improve brightness, but more fundamentally to ensure color accuracy and effectively suppress image noise. In addition to directly learning the mapping relationship in the traditional sRGB space, recent studies have also explored decoupled enhancement in more suitable color spaces. The primary idea of this approach is to leverage the luminance–chrominance decoupling characteristics of different color spaces to simplify the enhancement task.

The sRGB space is the most commonly used representation, highly compatible with display devices. However, its limitation lies in the tight coupling of luminance and color information in the red (R), green (G), and blue (B) channels [28,29]. As a result, under extremely low light conditions, even slight noise or exposure changes captured by the sensor can cause significant variations in the relative proportions of the three channels, leading to unnatural color shifts. This makes the entire enhancement process highly sensitive to noise and lighting changes. To address the coupling problem of sRGB, many studies have attempted to switch to decoupled spaces such as HSV and YCbCr [2,30,31,32,33].

The advantage of the HSV space lies in its ability to separate luminance from color information, where the value (V) component directly corresponds to brightness enhancement, greatly simplifying the brightness enhancement task. However, the main limitation of HSV for low-light enhancement is the discontinuity of the hue axis and stability issues at extremely low brightness levels. Hue is a periodic variable, causing the same color (e.g., red) to appear “disjointed” at the two ends of the hue axis (

0^{\circ}

and

360^{\circ}

), resulting in discontinuity in distance-based calculations. Even more critically, at very low brightness, where the value (V) is close to zero, color information is easily dominated and distorted by noise, leading to visual artifacts such as dark color noise and black-plane noise, ultimately making color information highly unstable.

The YCbCr space is also a commonly used decoupled space [2,33]. Its advantage lies in the separation of luminance (Y) and chrominance (Cb, Cr), which facilitates independent processing. Originally designed for video compression and transmission, it has advantages in data compression. However, its limitation is that the Y component is not fully linearly related to human visual perception of brightness, and it lacks explicit modeling of hue information, thus having limited capability for complex color adjustments and fine color distortion correction.

In addition, Horizontal/Vertical-Intensity (HVI) [7] is a trainable color space that, through polarization mapping, reduces the Euclidean distance between similar colors in the HSV space, effectively eliminating noise. Additionally, HVI introduces a learnable adaptive collapse mechanism specifically designed for low-intensity areas, significantly enhancing the color stability of dark regions. Experimental results show that the HVI representation can significantly improve image enhancement performance.

2.3. Vision Transformer

In recent years, Transformer [34] has been widely applied in image restoration (including image classification [35,36], semantic segmentation [37,38], low-light enhancement, etc.) due to its powerful ability to model global dependencies. Unlike traditional convolutional neural networks (CNNs), which rely on fixed receptive fields, Transformers can dynamically and globally capture long-range contextual information across the entire image when computing the features of the current pixel through its primary self-attention mechanism. This characteristic of Transformers gives them a significant advantage in image restoration tasks. For example, they excel at capturing long-range dependencies, enabling feature interactions across distant pixel regions, and effectively utilizing global context to guide local recovery, which is especially helpful in restoring large structural and blurred areas.

In LLIE tasks, Transformers can also better leverage reliable semantic information from bright regions to guide the recovery of dark regions with high noise and low visibility, thereby improving overall visual consistency. However, there are some notable limitations when applying Transformers to image restoration tasks. The computational complexity of the standard self-attention mechanism scales quadratically with the input image size, making Transformers require enormous computational resources and memory, which can become a bottleneck in practical deployment. Furthermore, while Transformers are adept at global modeling, they are less precise at capturing local high-frequency details compared to CNNs. In detail-recovery tasks, this may lead to issues such as local blurring or insufficient sharpening in the restored image, often requiring the integration of CNN structures to compensate for this limitation [6].

More importantly, when handling degradations such as low-light conditions, which exhibit spatial non-uniformities, if no explicit prior knowledge of factors like illumination or noise is incorporated into the conditional control, Transformers may apply global interactions uniformly across all regions. This could result in overenhancement or noise amplification, leading to instability. Therefore, current research in Transformer-based image restoration focuses on maintaining its global modeling capability while reducing computational complexity and effectively incorporating local details and scene priors.

For convenient comparison with our method, we list the features of several typical low-light enhancement methods in Table 1. From the table, it can be seen that our method is different from those existing methods in both model technology and color space. Therefore, our method can serve as a new exploration and an alternative method in the application of low-light image enhancement technology.

3. Method

The primary challenge in low-light enhancement lies in the need to significantly brighten dark regions to restore visible structure; however, this brightening simultaneously amplifies noise and causes color distortion. At the same time, uneven local exposure results in different brightness and color distributions for the same object in different regions, making it difficult for models that rely solely on local operations to maintain global consistency. To address this, this study proposes a dual-stage enhancement framework in the HVI domain (Figure 1), consisting of Stage-I intensity-conditioned Transformer pre-recovery and Stage-II dual-branch refinement enhancement. By decoupling global brightness recovery and feature preprocessing from image detail enhancement and color refinement, the proposed framework improves the model’s stability in extremely dark and high-noise scenarios.

The overall approach of our method is as follows: Given an input low-light image

L \in R^{H \times W \times 3}

, HVIFormer first maps it to the HVI space, explicitly decomposing it into the intensity map

I \in R^{H \times W \times 1}

and the HVI map

X_{h v i} \in R^{H \times W \times 3}

containing color and structural information. This color space transformation effectively decouples illumination information from color and structural information.

Stage-I: Intensity-Conditioned Transformer Pre-Recovery Stage. In this stage, the intensity component is introduced as an explicit illumination prior for intensity-conditioned feature interaction. Through element-wise scaling, the intensity features dynamically guide and adjust the attention mechanism to aggregate features, enabling feature interactions in different brightness regions to adopt different strategies. In dark regions, the HVI feature map integrates reliable contextual information more effectively to recover the image structure and suppress noise amplification. In relatively bright regions, over-enhancement is avoided to preserve the original color and details of the image. By leveraging the Transformer architecture, Stage-I can effectively model global dependencies, ensuring global lighting consistency and optimizing detail recovery when processing regions with varying brightness.

Furthermore, Stage-I introduces a bidirectional interaction update mechanism between content (HVI feature) and intensity (intensity feature). In this mechanism, the intensity information first serves as a guide, using the Intensity-Conditioned Block (ICB) to control and direct the update of the image content features (such as color and structure). More importantly, the content features, in turn, adjust the intensity, making it not a fixed condition but one that can be continuously optimized based on the actual structure and object information in the image. This significantly improves the stability and consistency of the algorithm when handling real-world scenarios with issues such as local exposure inconsistency and complex noise. Through its self-attention mechanism, the Transformer effectively captures long-range dependencies between different regions of the image, providing stronger support for global recovery and detail processing of the image.

Stage-I finally outputs two results: the repaired intensity component

I^{'} \in R^{H \times W \times 1}

and the denoised HVI feature

X_{h v i}^{'} \in R^{H \times W \times 3}

, providing a more reliable and consistent input for Stage-II.

Stage-II: Dual-Branch Refinement Enhancer. In this stage, based on the input provided by Stage-I, a dual-branch U-Net architecture with six lightweight Complementary Cross-Attention (CCA) blocks is used to further refine the image’s brightness, recover details, and stabilize the color.

Finally, through the Perceptual Inverse HVI Transform (PHVIT), the enhanced HVI features are stably mapped back to the sRGB space, outputting the final enhanced image

\hat{L}

.

The primary logic of the dual-stage design is as follows: The first stage serves as a pre-recovery phase, focusing on unifying the overall brightness of the image and performing initial noise and degradation suppression in dark regions, thereby reducing the learning difficulty for the subsequent network. The second stage focuses on fine-grained processing to recover local details and ensure color stability and consistency. This dual-stage structure effectively decouples the tasks, avoiding conflicts that arise when a single network is tasked with handling both global brightness adjustment and local detail enhancement, thus preventing unstable artifacts in the final result.

Algorithm 1 outlines the primary process of the proposed HVIFormer in the HVI domain, from coarse to fine enhancement (as shown in Algorithm 1). This process follows a closed-loop design of spatial transformation, dual-stage optimization, and transformation, strictly adhering to the coarse-to-fine enhancement logic, ensuring the collaborative optimization of brightness consistency, detail integrity, and color naturalness. Table 2 summarizes the modules of HVIFormer, their inputs/outputs, and their roles in improving brightness consistency, noise suppression, and color stability.

Algorithm 1 Proposed HVI-domain coarse-to-fine pipeline.

Require:: Low-light sRGB image L
Ensure:: Enhanced sRGB image $\hat{L}$
1:: $(X_{h v i}, I) \leftarrow HVIT (L)$ ▹ sRGB→HVI
2:: $(I^{'}, X_{h v i}^{'}) \leftarrow P (X_{h v i}, I)$ ▹ Stage-I intensity-conditioned Transformer pre-recovery
3:: ${\tilde{L}}_{h v i} \leftarrow E (I^{'}, X_{h v i}^{'})$ ▹ dual-branch refinement enhancement
4:: $\hat{L} \leftarrow PHVIT ({\tilde{L}}_{h v i})$ ▹ HVI→sRGB
5:: return $\hat{L}$

3.1. HVI Representation of the Image

To ensure the completeness of the method description, we first list the HVI representation formula of an image based on the work by [39]. The trainable HVI color space transformation method begins by calculating the intensity map of the image using the maximum sRGB channel value:

I_{\max} = max_{c \in {R, G, B}} (I_{c}),

(1)

thus extracting the scene’s brightness information. Then, using the intensity map and the original image, an HV color map is generated that combines color and structural information. The horizontal and vertical components are computed as follows:

\hat{H} = C_{k} ⊙ S ⊙ D_{T} ⊙ h, \hat{V} = C_{k} ⊙ S ⊙ D_{T} ⊙ v

(2)

where ⊙ denotes element-wise multiplication;

C_{k}

is a low-intensity color plane density adjusted by the trainable density parameter

k \in Q^{+}

, given by the following formula:

C_{k} = \sqrt[k]{sin (\frac{π I_{\max}}{2}) + ϵ},

(3)

where S represents color saturation; and

D_{T}

is the saturation adjusted by the training function

T (x)

, with

D_{T} = T (P_{γ})

. The horizontal and vertical components

h = cos (2 π P_{γ})

and

v = sin (2 π P_{γ})

are computed using the hue value

P_{γ}

. The hue mapping

P_{γ}

is defined as follows:

P_{γ} = \{\begin{matrix} 3 γ_{G} H, & if 0 \leq H < \frac{1}{3} \\ 3 (γ_{B} - γ_{G}) (H - \frac{1}{3}) + γ_{G}, & if \frac{1}{3} \leq H < \frac{2}{3} \\ 3 (1 - γ_{B}) (H - 1) + 1, & if \frac{2}{3} \leq H \leq 1 \end{matrix}

(4)

where

γ_{G}, γ_{B} \in (0, 1)

, and

H \in [0, 1]

represents the hue value. By introducing the adaptive linear color perception mapping

P_{γ}

, the color shift problem is adjusted to mitigate the color distortion caused by cameras in low-light environments. This method effectively avoids the hue discontinuity and black-plane issues in the traditional HSV color space, providing a more precise solution for low-light image enhancement.

3.2. Stage-I Intensity-Conditioned Transformer Pre-Recovery

In extremely dark scenes, image degradation is not merely a result of underexposure. In fact, the issue involves a complex interplay among artifacts, including uneven brightness distribution, increased noise in dark areas, and severe color distortion.

If the image converted to the HVI space is directly fed into the subsequent enhancement network for processing, the network must solve two problems simultaneously: correcting global illumination inconsistencies, and restoring local details and color information. This coupling of tasks makes training more difficult and can lead to unstable phenomena, such as local overexposure, color bias, and artifacts in dark areas.

Therefore, a strength-conditioned pre-recovery phase is introduced before performing refinement enhancement. The goal of this phase is not to complete the final enhancement, but rather to use intensity information to guide the preprocessing and normalization of the input features. Through this process, the subsequent enhancement network can focus more on restoring image details and optimizing color, thereby avoiding the aforementioned problems.

In the HVI representation, we extract the intensity component I, which clearly reflects the image’s lighting conditions. I is strongly associated with exposure levels and thus helps us distinguish brightness differences in various regions of the image. Additionally, when noise in the dark areas of the image is high, I provides a natural guide, helping us determine which areas require stronger recovery processing and which areas should avoid excessive enhancement.

Based on this, we design Stage-I as an intensity-based feature recovery phase rather than relying on the network to infer illumination information on its own. By leveraging the illumination information provided by I, we can process the image more precisely, ensuring a more stable and reasonable recovery.

3.2.1. Multi-Scale U-Shaped Pre-Recovery Module with Dual Output

In low-light images, image degradation is not only reflected in overall underexposure but also includes loss of details in local regions and increased noise. To address these issues, Stage-I adopts a multi-scale encoder–decoder structure (Figure 1 (Stage-I)). This structure expands the receptive field through the encoder, enabling global consistency, while the decoder restores spatial details via skip connections, preventing excessive smoothing and ensuring image details are not blurred.

The encoder part of Stage-I contains two levels of downsampling, forming feature maps of three resolutions (

H \times W

,

H / 2 \times W / 2

,

H / 4 \times W / 4

). At each resolution, several intensity-conditioned basic blocks (ICBs) are used to update the features, followed by downsampling through strided convolutions to increase the receptive field and enhance channel capacity. Specifically, we map the input image from the standard sRGB color space to the HVI color space, obtaining its representation in the HVI space

X_{hvi}

, while explicitly extracting the intensity map I, and performing convolution on both

X_{hvi}

and I. First, a

3 \times 3

convolution layer maps them to the same channel dimension C, resulting in two feature representations:

F_{hvi} = ϕ (X_{hvi}) \in R^{H \times W \times C}, F_{I} = ϕ_{I} (I) \in R^{H \times W \times C},

(5)

where

F_{hvi}

represents the higher-level feature map extracted through convolution, and

F_{I}

is the intensity feature guided by the intensity map. Since both have the same spatial resolution and channel size C, they can be easily subjected to subsequent conditional interaction modeling to enhance image enhancement. On top of this, another ICB module is applied, followed by a

4 \times 4

convolution for downsampling, increasing the channel size from C to

2 C

. The second stage stacks two ICBs at the

H / 2 \times W / 2

resolution and applies the same convolution operation to reduce the resolution to

H / 4 \times W / 4

, increasing the channel size from

2 C

to

4 C

. At the lowest resolution (

H / 4 \times W / 4

), we further stack two ICBs as a bottleneck module to better model global dependencies and restore consistency between regions.

The decoder is symmetrically designed with the encoder and includes two levels of upsampling. Starting from

H / 4 \times W / 4

, it progressively upsamples through 2 × 2 transposed convolutions (stride 2), with skip connections to fuse features from the corresponding encoder at each scale. The number of channels gradually decreases from

4 C

to C. Finally, a 3 × 3 convolution is applied to obtain

(I^{'}, X_{hvi}^{'})

.

Unlike directly outputting the enhanced image, Stage-I is designed as a pre-recovery module at the feature level. It does not directly output the final enhanced result, but instead generates two types of information: we predict the residual of the HVI feature map

Δ_{h v i}

and obtain the adjusted

X_{hvi}^{'}

through residual update. This process helps optimize and improve the HVI features to make them more consistent with real lighting conditions; we also predict the residual of the intensity map

Δ I

and add it to the original intensity map I, obtaining the corrected intensity map

I^{'}

. This corrected intensity map will provide more reliable guidance for subsequent enhancement.

Thus, the output of Stage-I is

(I^{'}, X_{hvi}^{'})

, which serves as input to Stage-II. This design separates the tasks of correcting brightness consistency and optimizing content features, avoiding task coupling between the two. In this way, Stage-I provides more stable and controllable input conditions for subsequent image refinement and enhancement, thereby ensuring the robustness and reliability of the entire process.

3.2.2. Cross-Branch Interaction Update Between Content and Intensity

Although

F_{I}

provides an effective illumination-condition signal, in extremely dark regions, due to noise or exposure inconsistencies, the intensity features may become unreliable, leading to the failure of the conditional signal. In contrast,

F_{hvi}

usually contains more stable structural and texture information, which can help correct the bias in the intensity map. To overcome this issue, we introduce a bidirectional interaction update mechanism between HVI features and intensity features in the multi-scale U-shaped structure: on one hand,

F_{I}

is used to guide the update of

F_{hvi}

, while on the other hand,

F_{hvi}

reversely corrects

F_{I}

, allowing the intensity map to adaptively adjust under the guidance of structural and texture information, rather than remaining fixed.

Specifically, at each scale,

F_{hvi}

and

F_{I}

are iteratively updated through the same interaction module (ICB, Figure 2). In addition to the update of

F_{hvi}

guided by

F_{I}

, we also perform the process of

F_{hvi}

correcting

F_{I}

(e.g.,

F_{I} \leftarrow B (F_{I}, F_{hvi})

), which enhances the intensity map’s ability to perceive structure and texture, improving its stability.

Specifically, in the t-th iteration of the interaction, we use the same basic block

B (\cdot)

to update these two features:

F_{hvi}^{t + 1} = B (F_{hvi}^{t}, F_{I}^{t}), F_{I}^{t + 1} = B (F_{I}^{t}, F_{hvi}^{t}),

(6)

where

B (\cdot)

is the ICB, a module consisting of Intensity-Conditioned Multi-Head Self-Attention (IC-MHSA) and a Feed-Forward Network (FFN). This cross-branch interaction update mechanism continuously corrects the intensity map based on structural and texture information, ensuring that even regions with large illumination differences in real low-light scenes can receive more reliable enhancement, thereby improving the stability of the enhancement effect.

3.2.3. Intensity-Conditioned Multi-Head Self-Attention

Specifically, we first reshape the input HVI image feature

X_{hvi} \in R^{H \times W \times C}

into a feature tensor

X \in R^{H \times W \times C}

. Next, we adopt h independent attention heads, where each attention head has a feature dimension of d, satisfying the channel dimension constraint

C = h \cdot d

.

We first obtain the queries (Q), keys (K), and values (V) via linear projections (Figure 2b):

Q = X W_{Q}, K = X W_{K}, V = X W_{V},

(7)

where

W_{Q}, W_{K}, W_{V} \in R^{C \times C}

are learnable linear projection matrices, so

Q, K, V \in R^{H \times W \times C}

.

Then, we split Q, K, and V into h groups according to the number of heads:

Q = {Q^{(i)}}_{i = 1}^{h}, K = {K^{(i)}}_{i = 1}^{h}, V = {V^{(i)}}_{i = 1}^{h},

(8)

where each

Q^{(i)}, K^{(i)}, V^{(i)} \in R^{H \times W \times d}

,

(i = 1, 2, \dots, h)

.

Next, we generate fusion weights G aligned with the HVI image features based on the previously extracted intensity features

F_{I}

, where the weights dynamically adjust the update magnitude across different spatial locations and channels in the image based on intensity information, enabling local enhancement based on intensity. Specifically, we perform element-wise re-weighting of the value features in the attention mechanism to control the update magnitude at different spatial locations and channels:

V^{'} = V ⊙ G, G = g (F_{I}),

(9)

where ⊙ denotes element-wise multiplication, and

g (\cdot)

represents a function that processes the intensity feature

F_{I}

, generating the fusion weight G.

This strategy can be understood as adaptive fusion guided by intensity information: in low-light regions, G tends to amplify the contribution of effective context, helping to restore structural details in the image and suppress noise diffusion; in brighter regions, G suppresses excessive updates to avoid over-enhancement, maintaining the naturalness and consistency of color and texture. To achieve this, we do not impose explicit range constraints (such as

[0, 1]

) on G, but instead allow the model to learn the appropriate scaling magnitude and direction through end-to-end training, enabling it to dynamically adjust its enhancement strategy in different environments.

We reshape and split the previously generated fusion weight

G \in R^{H \times W \times C}

into h parts to obtain

{G^{(i)}}_{i = 1}^{h}

, where each

G^{(i)} \in R^{H \times W \times d}

.

We then perform conditional scaling of the V features for each head:

{V^{'}}^{(i)} = V^{(i)} ⊙ G^{(i)} .

(10)

Next, each head computes attention independently and aggregates

O^{(i)} = Softmax (\frac{Q^{(i)} K^{(i) ⊤}}{\sqrt{d}}) V^{' (i)}, O^{(i)} \in R^{H \times W \times d} .

(11)

Then, we concatenate the outputs of all heads and obtain the final result via a linear mapping:

Attn (X; G) = Concat (O^{(1)}, \dots, O^{(h)}) W_{O}, W_{O} \in R^{C \times C} .

(12)

Finally, the output features of all heads are concatenated, and feature fusion is completed via linear mapping, incorporating 2D positional encoding to preserve image spatial location information, and the output of the intensity-conditioned multi-head self-attention is obtained.

Unlike simply statically concatenating intensity-guided features with HVI space image features, this intensity-prior-based adaptive fusion mechanism recalculates the fusion weights G during each global aggregation. This allows us to explicitly model relationships across different lighting regions: in dark areas, the model relies more on effective contextual information to help restore details and suppress noise; in bright areas, the model updates more conservatively to prevent over-enhancement or color distortion.

3.3. Dual-Branch Refinement Enhancement

After completing the intensity-conditioned Transformer pre-recovery in Stage-I, we feed its output as a more stable input condition into Stage-II for refinement enhancement (Figure 1 (Stage-II)). Stage-II adopts a dual-branch structure to process intensity and color structural information separately: the I branch learns brightness mapping to avoid underexposure or overexposure, while the HV branch focuses on dark region denoising and color stabilization to suppress color bias and noise textures. The two branches interact cross-branch to learn complementary information, allowing brightness enhancement, color correction, and detail recovery to be jointly optimized. To further enhance the information interaction between the brightness branch and the color branch, we adopt a lightweight Complementary Cross-Attention (CCA) module (Figure 3). CCA effectively learns the complementary information between the HV branch and the intensity branch through the Cross-Attention Block (CAB) mechanism, promoting their collaborative optimization in the image enhancement process. Specifically, the HV branch handles HVI features, while the I branch processes intensity features, and the CAB mechanism establishes a mutually guiding relationship between these two branches.

The CAB exhibits a symmetrical structure between the I-way and the HV-way [39]. We use the HV-branch as an example to describe the details.

Y_{H V} \in R^{H \times W \times C}

represents the input of the HV-branch. The CAB first derives the query (Q) by

Q = W (Q) Y_{H V}

. Meanwhile, the CAB splits the key (K) and value (V) by

K = W (K) Y_{H V}

and

V = W (V) Y_{H V}

.

W (Q)

,

W (K)

, and

W (V)

represent the feature embedding convolution layers. This can be expressed as follows:

{\tilde{Y}}_{H V} = W (V \otimes Softmax (Q \otimes K / α_{H V}) + Y_{H V})

(13)

where

α_{H V}

is the multi-head factor [36], and

W (\cdot)

denotes the feature embedding convolutions.

Based on Retinex theory, the color denoise layer (CDL) decomposes the updated feature tensor

{\tilde{Y}}_{H V}

into illumination and reflectance components, which are achieved through feature embedding convolution layers

W (I)

and

W (R)

, respectively, i.e.,

Y_{H V} = W (I) {\hat{Y}}_{H V}

(illumination component) and

Y_{R} = W (R) {\hat{Y}}_{H V}

(reflectance component). Based on these two components, CDL is defined as follows:

{\hat{Y}}_{H V} = W_{D} ((tanh (W_{D} Y_{H V}) + Y_{H V}) ⊙ (tanh (W_{D} Y_{R}) + Y_{R}))

(14)

where ⊙ denotes element-wise multiplication, and

W_{D}

represents the depth-wise convolution layers. Finally, the output of the CDL adds a residual connection to mitigate the vanishing gradient problem in deep network training and simplify the model training process.

Unlike directly using the original HVI input, we use the intensity map

I^{'}

corrected by Stage-I as a more reliable guidance signal and input it along with the Stage-I enhanced HVI image features

X_{hvi}^{'}

into the dual-branch enhancer. Since Stage-I has already performed global illumination consistency correction and preliminarily suppressed degradation, Stage-II can focus more on local detail recovery and color refinement, thus reducing problems such as overexposure, dark region noise amplification, and color distortion. Specifically, in Stage-II, we input the output

(I^{'}, X_{hvi}^{'})

from Stage-I into the dual-branch refinement enhancer for further enhancement, resulting in the enhanced HVI representation

{\tilde{L}}_{h v i}

.

PHVIT (Perceptual-invert HVI Transformation) maps the HVI representation back to the HSV color space, and is used to obtain the final sRGB enhancement result by restoring the Stage-II output

{\tilde{L}}_{hvi}

from the HVI space to the sRGB space. Overall, PHVIT forms a surjective mapping, thereby covering the valid representation domain of HSV; meanwhile, by introducing controllable parameters, it enables the saturation and brightness of an image to be adjusted independently. To ensure that the mapping is injective (and thus invertible) in computation, PHVIT first constrains the output components to valid numerical ranges to avoid outliers that may cause color overflow. It then defines

\hat{h}

and

\hat{v}

as intermediate variables:

\hat{h} = \frac{{\hat{I}}_{H}}{D^{T} C_{k} + ε}, \hat{v} = \frac{{\hat{I}}_{V}}{D^{T} C_{k} + ε},

(15)

where

ε = 1 \times 10^{- 8}

is used to improve numerical stability. Next, according to the estimated intensity component, the polarized-plane components are de-normalized, and the hue and saturation are recovered by inverting the 2D coordinates: the hue map is computed from the inverse polar angle, while the saturation is obtained from the planar radius. Specifically, the hue map is formulated as

H = F_{γ} (arctan (\frac{\hat{v}}{\hat{h}}) mod 1),

(16)

where

F_{γ}

is an inverse piecewise-linear function:

F_{γ} (X) = \{\begin{matrix} \frac{X}{3 γ_{G}}, & 0 \leq X < γ_{G}, \\ \frac{X - γ_{G}}{3 (γ_{B} - γ_{G})} + \frac{1}{3}, & γ_{G} \leq X < γ_{B}, \\ \frac{X - 1}{3 (1 - γ_{B})} + 1, & γ_{B} \leq X \leq 1, \end{matrix}

(17)

where

γ_{G}

and

γ_{B}

are defined in Equation (4). The saturation and value maps are perceptually estimated as

S = α_{S} \sqrt{{\hat{h}}^{2} + {\hat{v}}^{2}}, V = α_{I} \tilde{I},

(18)

where

α_{S}

and

α_{I}

are customizable linear parameters for adjusting the image saturation and brightness, respectively, and

\tilde{I}

denotes the restored intensity (used as the HSV value channel). Finally, the HSV image is converted to an sRGB image [40] via the standard HSV→sRGB mapping, yielding the final enhanced image

\hat{L}

. This step ensures that the Stage-II output can be stably transformed back to the sRGB space, closing the two-stage pipeline and facilitating both visualization and quantitative evaluation.

3.4. Compared with HVI-CIDNet Method

Although our method is built upon the HVI representation, HVIFormer does not claim novelty in the HVI space itself. Instead, our contributions focus on how intensity is explicitly modeled and exploited to drive restoration, together with a two-stage collaborative design that is absent in CIDNet-style HVI pipelines. The key differences are as follows:

Two-stage collaboration: We separate restoration into Stage-I intensity-conditioned global pre-recovery (illumination calibration and dominant noise suppression) and Stage-II refinement (detail/color restoration), which is more stable for extremely dark inputs where single-stage pipelines often over-amplify noise or drift in color.
ICB: Rather than using intensity as a simple auxiliary cue, ICB couples intensity and content features bi-directionally, enabling mutual correction between illumination structure and scene content under severe low-light conditions.
IC-MHSA: We introduce intensity-conditioned MHSA where the intensity prior gates the attention update, yielding region-adaptive enhancement and mitigating over-enhancement and color shift.
CCA-based dual-branch refinement: Stage-II adopts dual branches (intensity-focused vs. chroma/detail-focused) and uses Complementary Cross-Attention (CCA) for controlled information exchange, improving denoising, texture recovery, and color fidelity beyond single-stream CIDNet-style designs.

3.5. Loss Function

To simultaneously improve both overall exposure and dark-region detail recovery, we apply joint supervision on the final enhanced result during training of the two-stage framework, considering both performances in the sRGB and HVI spaces.

In the HVI color space, we use L1 loss

L_{1}

[41], edge loss

L_{e}

[42], and perceptual loss

L_{p}

[43] for the low-light enhancement task. Given the network’s final output enhanced image

X_{h v i}^{'}

and its corresponding HVI feature map

X_{hvi}

, the goal is to minimize the difference between them. The specific loss function is expressed as follows:

\begin{matrix} l (X_{h v i}^{'}, X_{hvi}) & = λ_{1} \cdot L_{1} (X_{h v i}^{'}, X_{hvi}) + λ_{e} \cdot L_{e} (X_{h v i}^{'}, X_{hvi}) \\ + λ_{p} \cdot L_{p} (X_{h v i}^{'}, X_{hvi}) \end{matrix}

(19)

where

λ_{1}

,

λ_{e}

, and

λ_{p}

are weights used to balance each loss term. In the sRGB space, for the restored sRGB image

\hat{L}

and the original sRGB ground truth

L_{g t} \in R^{H \times W \times 3}

, we use the same loss function. The final overall loss function is

L = l (\hat{L}, L_{g t}) + λ_{h} l (X_{h v i}^{'}, X_{hvi}),

(20)

where

λ_{h}

is a hyperparameter that balances the strength of supervision in both color spaces.

In this way, we ensure that the output image maintains natural, accurate brightness and color in the sRGB space, while effectively enhancing details in dark regions and suppressing noise and color bias in the HVI space.

3.6. Evaluation of Image Enhancement Performance

In low-light image enhancement tasks, both qualitative and quantitative evaluations are important for assessing an approach’s performance. Qualitative evaluation mainly relies on visual comparisons to judge the quality of enhanced images, focusing on exposure and brightness distribution, detail recovery, noise suppression, color fidelity, and overall visual consistency. Specifically, the enhanced image should maintain uniform brightness while avoiding overexposure, underexposure, or noise amplification. Dark-region details and textures should be effectively recovered, and the image’s colors should remain consistent with the original, avoiding color bias or excessive saturation. In practical applications, by displaying the enhanced image, comparing it with the original image and other methods, and showing zoomed-in views, the algorithm’s advantages can be more intuitively demonstrated.

Quantitative evaluation relies on multiple evaluation metrics, commonly including PSNR, SSIM [44], LPIPS [45], etc. On datasets with ground truth (GT), we first calculate the Peak Signal-to-Noise Ratio (PSNR) to measure the pixel-wise difference between the enhanced image and the ground truth. The PSNR is computed as follows:

PSNR = 10 {log}_{10} (\frac{{MAX}^{2}}{MSE})

(21)

where MSE is the mean squared error and MAX is the maximum pixel value (typically 255 or 1). A higher PSNR value indicates better image quality.

Next, we use SSIM (Structural Similarity) to assess the similarity of images in terms of structure, brightness, and contrast. The SSIM is calculated as follows:

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(22)

where

μ_{x}

and

μ_{y}

are the mean values of images x and y,

σ_{x}^{2}

and

σ_{y}^{2}

are the variances,

σ_{x y}

is the covariance, and

C_{1}

and

C_{2}

are constants, often set as

C_{1} = {(K_{1} D)}^{2}

and

C_{2} = {(K_{2} D)}^{2}

, where D is the dynamic range of the image.

Finally, LPIPS (Learned Perceptual Image Patch Similarity) is used to measure perceptual differences based on deep feature differences. The LPIPS is computed as follows:

LPIPS (x, y) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} {∥w_{l} ⊙ (ϕ_{l} {(x)}_{h w} - ϕ_{l} {(y)}_{h w})∥}_{2}^{2}

(23)

where

ϕ_{l} (x)

and

ϕ_{l} (y)

are the deep features of images x and y at the l-th layer,

w_{l}

is the weight for each layer, and ⊙ denotes element-wise multiplication. A lower LPIPS value indicates that the enhancement result is closer to human perceptual similarity.

For low-light images without GT, we use no-reference metrics such as NIQE [46] or BRISQUE [47] to evaluate image quality. These metrics provide auxiliary evaluations by modeling the naturalness and quality of the image. Although these metrics do not rely on GT, they reflect the naturalness of the image and are consistent with subjective evaluation.

In the experiments, we used two types of datasets to evaluate the model’s performance. One type is a dataset with real images (GT), such as LOLv1, where quantitative evaluation is performed using metrics like PSNR, SSIM, and LPIPS. Visual comparisons highlight the enhanced image’s detail recovery and noise suppression capabilities. The other type is a real-world dataset without GT, where evaluation mainly relies on no-reference metrics (such as NIQE and BRISQUE), along with qualitative comparisons, to comprehensively demonstrate the method’s enhancement effects.

4. Experiment

4.1. Datasets

We evaluated our model on several commonly used low-light image enhancement (LLIE) benchmark datasets, aiming to test three different scenarios: (1) supervised training using paired images for easy quantitative comparison; (2) testing the model’s generalization ability in real-world scenarios without paired images; and (3) examining the model’s robustness in extremely dark environments.

LOL is one of the most commonly used benchmarks in the LLIE field. It includes low-light images and their corresponding normal exposure images, making it convenient for objective quantitative evaluation. Specifically, LOLv1 [3] contains 500 image pairs under standard splits, typically using 485 pairs for training and 15 pairs for testing. The image resolution is commonly 400 × 600. LOLv2 [48] further divides into two subsets, Real and Synthetic, to assess the model’s adaptation to real-world degradation and synthetic degradation distributions. LOLv2-Real typically uses 689 pairs for training and 100 pairs for testing, while LOLv2-Synthetic typically uses 900 pairs for training and 100 pairs for testing. Overall, LOL is mainly used to evaluate the model’s reconstruction performance, structural fidelity, and color recovery under paired supervision, while also reflecting the model’s stability when handling different types of degradation.

The unpaired datasets DICM [49], LIME [16], MEF [50], NPE [14], and VV [51] typically only provide low-light images without strictly paired reference images. These datasets are closer to real-world applications of single-image enhancement scenarios. Thus, this setting focuses more on the model’s ability to adapt to different data distributions: the model needs to output natural, clean, and not overly enhanced images without a reference. We evaluate these datasets using no-reference metrics, measuring enhancement performance based on naturalness, noise control, and distortion levels.

The original SICE [52] dataset contains 589 sets of low-light and over-exposed images. Following a commonly used protocol, we split SICE into training/validation/test sets at a 7:1:2 ratio. Unless otherwise stated, all methods are trained on the SICE training set and evaluated on the official evaluation subsets SICE-Mix and SICE-Grad [53]. This split protocol and evaluation setting are adopted to avoid ambiguity and ensure reproducibility.

Sony-Total-Dark [39] is a customized version of the Sony subset in the SID [54]. There are 2697 short–long-exposure RAW image pairs. Following the commonly used setting, we convert RAW images to extremely dark sRGB inputs without gamma correction, as shown in the first row of Figure 4, which significantly increases the difficulty in dark regions and amplifies sensor noise.

4.2. Experiment Settings

To ensure a fair comparison, we follow the mainstream low-light image enhancement (LLIE) evaluation protocols for training and testing, and use relatively matched cropping and training epoch settings for different datasets.

For LOLv1 and LOLv2-Real, we crop the training images into

256 \times 256

patches, set the batch size to 4, and train for 1500 epochs. For LOLv2-Synthetic, we use a batch size of 1, train for 500 epochs, and do not apply cropping. For SICE, we use

160 \times 160

patches for training, set the batch size to 10, and train for 1000 epochs. Testing is conducted on both SICE-Mix and SICE-Grad. For Sony-Total-Dark, we crop the training images into

256 \times 256

patches, set the batch size to 4, and train for 1000 epochs.

In terms of implementation, we train the models based on PyTorch using the Adam optimizer [55] (

β_{1} = 0.9, β_{2} = 0.999

). The initial learning rate is set to

1 \times 10^{- 4}

, and we use the cosine annealing strategy [56] to gradually decay it to

1 \times 10^{- 7}

. All experiments are conducted on third-party cloud GPU HPC compute nodes equipped with Intel(R) Xeon(R) Platinum CPUs, 80 GB RAM, and Ubuntu 20.04. We train and evaluate HVIFormer using a single NVIDIA GeForce RTX 4090 (24 GB) or RTX 3090 (24 GB). To ensure reproducibility, we keep the software stack identical across GPU instances (CUDA 11.8, PyTorch 2.0.1) and use the same training protocol and fixed random seeds.

4.3. Evaluation Metrics

For paired datasets, we use PSNR and SSIM to measure image distortion and, additionally, LPIPS (with AlexNet as the feature extraction network) to evaluate perceptual similarity. PSNR and SSIM focus more on pixel-level errors and structural consistency, while LPIPS emphasizes human perceptual effects, reflecting the similarity in texture and semantics. In this way, we can comprehensively evaluate the image enhancement performance from both precision and perceptual quality aspects.

For unpaired datasets, because ground truth images are unavailable, we use the no-reference quality metric NIQE to evaluate single-image enhancement results. Such metrics are usually more sensitive to natural image statistics, noise, and artifacts, and can indicate whether the enhanced result appears natural or whether over-enhancement and distortion are present to some extent.

4.4. Results Analysis

4.4.1. Quantitative Comparison on LOL Datasets

Table 3 presents the quantitative evaluation results and model complexity comparison on the LOLv1, LOLv2-Real, and LOLv2-Synthetic datasets. The results show that HVIFormer achieves the best PSNR and SSIM values across all three LOL benchmarks, and the LPIPS value is also the lowest, indicating that the method strikes a better balance between pixel-level accuracy and perceptual quality.

Specifically, compared to the second-best method, HVIFormer achieves a PSNR improvement of 2.78% on the LOLv1 dataset; a PSNR improvement of 4.52% on the LOLv2-Real dataset; and a PSNR improvement of 19.74%, SSIM improvement of 2.58%, and a 2.86% reduction in LPIPS on the LOLv2-Synthetic dataset.

4.4.2. Quantitative Comparison on Unpaired Datasets

Table 4 summarizes the NIQE results on five unpaired datasets: DICM, LIME, MEF, NPE, and VV. HVIFormer achieves the best average NIQE of 3.395, outperforming all other methods. Compared to the second-best method (AVG 3.457), it reduces the NIQE by 1.79%, indicating that, in the absence of real reference images, our method generates results that are more natural and better aligned with the distribution of real images.

4.4.3. Quantitative Comparison on Extreme Low-Light Datasets

To further validate the performance of HVIFormer under extreme low-light conditions, Table 5 presents the evaluation results on the SICE and Sony-Total-Dark datasets. HVIFormer achieves significant improvements on both extreme datasets: on the SICE dataset, PSNR reaches 21.079 dB and SSIM is 0.765, with a PSNR improvement of approximately 58.66% compared to the strongest baseline method; on the Sony-Total-Dark dataset, PSNR is 24.234 dB and SSIM is 0.697, achieving the best performance. These significant improvements demonstrate that under conditions of extremely low illumination, severe information loss, and more complex noise, HVIFormer effectively avoids the imbalance between under-enhancement (overall grayness, invisible details) and over-enhancement (overexposure, color bias, noise amplification), showcasing stronger detail recovery and structural preservation capabilities. Moreover, the dual-stage approach of HVIFormer enables the best performance on this dataset even under extreme low-light conditions. These optimized results more clearly express the advantages of HVIFormer on these challenging datasets and highlight the key role of the dual-stage method in handling extreme low-light situations.

4.4.4. Qualitative Comparison

By combining the visual comparisons in Figure 5 and Figure 6, it is evident that the performance advantage of HVIFormer is intuitively validated across scenes of varying complexity. In paired datasets such as LOL, compared to traditional Retinex-based methods, HVIFormer demonstrates superior global exposure control, significantly improving the brightness of dark regions while effectively avoiding common issues such as over-enhancement, halos, and visual artifacts caused by discontinuities in local brightness. In the face of the real challenges posed by unpaired datasets, HVIFormer outperforms common end-to-end sRGB space enhancement networks by more effectively suppressing common dark-region noise and texture distortions. It also successfully eliminates noticeable color biases, such as yellowish or greenish tints, that are often introduced during enhancement, making the final generated images closer to the natural distribution in terms of hue and saturation.

4.5. Ablation Experiment

The ablation experiment in Table 6 clearly demonstrates the key role of each primary component in the model: First, when only Stage-II is retained (Setting A), the model is required to handle both illumination consistency calibration and detail/colour recovery, resulting in an LPIPS value of 0.115. This indicates that the lack of prior illumination optimization leads to over-enhancement issues in the image. In contrast, when only Stage-I is used (Setting B), although it improves overall exposure and consistency to some extent, the image’s texture and colour restoration capability is limited due to the absence of a dedicated detail recovery stage. This fully proves that the decoupled design of Stage-I and Stage-II is the foundation for efficient enhancement, where their collaborative division of tasks prevents overload in a single stage.

Furthermore, after removing the Intensity-Conditioned Block (Setting C), the model’s performance significantly declines, with a noticeable gap compared to the complete model (Setting G), demonstrating that Stage-I’s correction of the intensity map provides a more reliable illumination baseline. Not only does this enhance the consistency of the image structure, but it also effectively suppresses over-enhancement and colour bias. When the Intensity-Conditioned Multi-Head Self-Attention is further removed (Setting D), the LPIPS value rises to 0.083, verifying the primary role of the intensity-guided adaptive fusion weights in suppressing dark-region noise and avoiding local overexposure. These weights precisely control the enhancement amplitude, ensuring the balance of local and overall brightness in the image.

Moreover, after removing the Complementary Cross-Attention mechanism (Setting E), the model’s performance in all metrics deteriorates, and the RGB two-stage approach without HVI space modeling (Setting F) performs even worse. This demonstrates the indispensable synergistic effect of both aspects: first, the bidirectional interaction of content and intensity in Stage-I enhances the credibility of conditional information, providing a solid foundation for subsequent enhancement; second, the cross-branch interaction between brightness and colour branches in Stage-II strengthens detail recovery and colour restoration capability, with both stages complementing each other to achieve synergistic enhancement.

As shown in Figure 7, the enhancement results of HVIFormer are closest to the ground truth (GT), effectively preserving image details and colour consistency. In contrast, the outputs of other methods exhibit varying degrees of distortion and significant noise interference, resulting in a visual effect that deviates more noticeably from the GT. This fully validates the unique advantage of HVIFormer in decoupling modeling and enhancement in the HVI space by separating the optimization of illumination, details, and colour dimensions, before using cross-module interactions to achieve fusion, ultimately yielding more precise and natural low-light image enhancement.

4.6. Application Prospects and Cross-Domain Transferability

Many real-world vision systems require robust perception under poor illumination. The proposed intensity-guided Transformer pre-recovery stabilizes global exposure and suppresses severe noise, while the subsequent refinement stage restores fine structures and improves color fidelity; therefore, the framework is potentially transferable beyond low-light enhancement. Typical downstream scenarios include UAV-based construction inspection [61] (e.g., rebar counting) and building façade analysis [62], where challenging illumination is common. Future work will investigate cross-domain adaptation and evaluation on relevant public datasets, including a labelled dataset for rebar counting inspection on construction sites using unmanned aerial vehicles and building façade datasets for analyzing building characteristics using deep learning.

5. Conclusions

To improve the quality of low-light image enhancement, we propose a new two-stage deep learning framework using image HVI color space. The framework provides an alternative method for a user who seeks a method for enhancing low-light images with high quality but moderate model parameters. The ablation experiment shows that each module of our proposed method has a significant effect on enhancing low-light images. Comparative experiment results show that the proposed HVIFormer method is superior to 10 compared state-of-the-art methods according to the visual effects and the 10 quantitative indicators; in particular, the proposed method demonstrates greater stability in dark-region noise suppression and color restoration, effectively avoiding common issues such as overexposure, color bias, and detail loss.

Despite achieving good enhancement results, our model has a relatively large number of parameters and a high computational cost, especially when processing high-resolution images, which may lead to longer processing times. Therefore, future work will focus on optimizing the model, such as reducing computational complexity through techniques such as model pruning, quantization, or knowledge distillation, to make it more widely applicable in real-world scenarios. It is well known that transforming images between color spaces may cause information loss. The lack of in-depth theoretical analysis of the issue in our proposed method is a limitation of our paper, and we believe it is a topic worth further study. Additionally, although this study primarily focuses on static image enhancement, our framework demonstrates good scalability and can be further explored for video enhancement tasks. In video scenes, HVIFormer can leverage temporal information for cross-frame enhancement, not only improving the visual quality of individual frames but also effectively reducing the impact of motion blur, lighting changes, and other factors on the enhancement effect, thereby achieving more stable and natural video enhancement results. Beyond low-light enhancement, our framework may also benefit downstream vision applications under poor illumination (e.g., UAV-based construction inspection and building façade analysis), which we will explore in future work.

Author Contributions

Conceptualization, Y.L. and H.L.; methodology, Y.L. and H.L.; software, Y.L.; validation, L.L., Y.L. and H.L.; formal analysis, Y.L. and L.L.; investigation, Y.L.; resources, L.L. and H.L.; data curation, Y.L.; writing—original draft preparation, Y.L. and H.L.; writing—review and editing, Y.L. and L.L.; visualization, Y.L.; supervision, L.L. and H.L.; project administration, H.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data can be downloaded from the website of the referenced database; the enhanced data we have processed can be downloaded from Github (version 2.0.1 of PyTorch, version 11.8 of CUDA) or are available on request from the first author.

Acknowledgments

We gratefully acknowledge the anonymous reviewers for their insightful comments and constructive suggestions, which have significantly improved the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, C.; Guo, C.; Han, L.; Jiang, J.; Cheng, M.M.; Gu, J.; Loy, C.C. Low-Light Image and Video Enhancement Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9396–9416. [Google Scholar] [CrossRef] [PubMed]
Guo, X.; Hu, Q. Low-Light Image Enhancement via Breaking Down the Darkness. Int. J. Comput. Vis. 2023, 131, 48–66. [Google Scholar] [CrossRef]
Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep Retinex Decomposition for Low-Light Enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar] [CrossRef]
Zhang, Y.; Guo, X.; Ma, J.; Liu, W.; Zhang, J. Beyond Brightening Low-Light Images. Int. J. Comput. Vis. 2021, 129, 1013–1037. [Google Scholar] [CrossRef]
Lv, F.; Lu, F.; Wu, J.; Lim, C. MBLLEN: Low-Light Image/Video Enhancement Using CNNs. In BMVC; Number 1; Northumbria University: Newcastle upon Tyne, UK, 2018; Volume 220, p. 4. [Google Scholar]
Xu, X.; Wang, R.; Fu, C.; Jia, J. SNR-Aware Low-Light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2022; pp. 17714–17724. [Google Scholar]
Yan, Q.; Feng, Y.; Zhang, C.; Pang, G.; Shi, K.; Wu, P.; Dong, W.; Sun, J.; Zhang, Y. HVI: A New Color Space for Low-light Image Enhancement. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2025; pp. 5678–5687. [Google Scholar] [CrossRef]
Ok, J.; Lee, C. HDR Tone Mapping Algorithm Based on Difference Compression with Adaptive Reference Values. J. Vis. Commun. Image Represent. 2017, 43, 61–76. [Google Scholar] [CrossRef]
Feng, W.; Liu, H.D.; Wu, G.M.; Zhao, D.A. Gradient Domain Adaptive Tone Mapping Algorithm Based on Color Correction Model. Laser Optoelectron. Prog. 2020, 57, 081007. [Google Scholar] [CrossRef]
Li, X.; Liu, M.; Ling, Q. Pixel-Wise Gamma Correction Mapping for Low-Light Image Enhancement. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 681–694. [Google Scholar] [CrossRef]
Dale-Jones, R.; Tjahjadi, T. A Study and Modification of the Local Histogram Equalization Algorithm. Pattern Recognit. 1993, 26, 1373–1381. [Google Scholar] [CrossRef]
Rao, B.S. Dynamic Histogram Equalization for Contrast Enhancement for Digital Images. Appl. Soft Comput. 2020, 89, 106114. [Google Scholar] [CrossRef]
Kang, L.; Chen, X. Image Contrast Enhancement Based on Multi-Level Histogram Shape Segmentation. Comput. Appl. Softw. 2022, 39, 207–212. [Google Scholar]
Wang, S.; Zheng, J.; Hu, H.-M.; Li, B. Naturalness Preserved Enhancement Algorithm for Non-Uniform Illumination Images. IEEE Trans. Image Process. 2013, 22, 3538–3548. [Google Scholar] [CrossRef] [PubMed]
Fu, X.; Liao, Y.; Zeng, D.; Huang, Y.; Zhang, X.-P.; Ding, X. A Probabilistic Method for Image Enhancement with Simultaneous Illumination and Reflectance Estimation. IEEE Trans. Image Process. 2015, 24, 4965–4977. [Google Scholar] [CrossRef] [PubMed]
Guo, X.; Li, Y.; Ling, H. LIME: Low-Light Image Enhancement via Illumination Map Estimation. IEEE Trans. Image Process. 2017, 26, 982–993. [Google Scholar] [CrossRef] [PubMed]
Park, S.; Yu, S.; Moon, B.; Ko, S.; Paik, J. Low-Light Image Enhancement Using Variational Optimization-Based Retinex Model. IEEE Trans. Consum. Electron. 2017, 63, 178–184. [Google Scholar] [CrossRef]
Li, M.; Liu, J.; Yang, W.; Sun, X.; Guo, Z. Structure-Revealing Low-Light Image Enhancement Via Robust Retinex Model. IEEE Trans. Image Process. 2018, 27, 2828–2841. [Google Scholar] [CrossRef]
Gu, Z.; Li, F.; Fang, F.; Zhang, G. A Novel Retinex-Based Fractional-Order Variational Model for Images With Severely Low Light. IEEE Trans. Image Process. 2020, 29, 3239–3253. [Google Scholar] [CrossRef]
Ren, X.; Yang, W.; Cheng, W.-H.; Liu, J. LR3M: Robust Low-Light Enhancement via Low-Rank Regularized Retinex Model. IEEE Trans. Image Process. 2020, 29, 5862–5876. [Google Scholar] [CrossRef]
Hao, S.; Han, X.; Guo, Y.; Xu, X.; Wang, M. Low-Light Image Enhancement With Semi-Decoupled Decomposition. IEEE Trans. Multimed. 2020, 22, 3025–3038. [Google Scholar] [CrossRef]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 1780–1789. [Google Scholar]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. EnlightenGAN: Deep Light Enhancement Without Paired Supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Wang, T.; Zhang, K.; Shen, T.; Luo, W.; Stenger, B.; Lu, T. Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2023; Volume 37, pp. 2654–2662. [Google Scholar] [CrossRef]
Kin, G.L.; Akintayo, A.; Sarkar, S. LLNet: A Deep Autoencoder Approach to Natural Low-Light Image Enhancement. Pattern Recognit. 2017, 61, 650–662. [Google Scholar] [CrossRef]
Liu, R.; Ma, L.; Zhang, J.; Fan, X.; Luo, Z. Retinex-Inspired Unrolling With Cooperative Prior Architecture Search for Low-Light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2021; pp. 10561–10570. [Google Scholar]
Cai, Y.; Bian, H.; Lin, J.; Wang, H.; Timofte, R.; Zhang, Y. Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2023; pp. 12504–12513. [Google Scholar]
Gevers, T.; Gijsenij, A.; Van de Weijer, J.; Geusebroek, J.-M. Color in Computer Vision: Fundamentals and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Lee, J.; Park, J.; Baik, S.; Lee, K.M. Rethinking RGB Color Representation for Image Restoration Models. arXiv 2024, arXiv:2402.03399. [Google Scholar] [CrossRef]
Li, Z.; Jia, Z.; Yang, J.; Kasabov, N. Low Illumination Video Image Enhancement. IEEE Photonics J. 2020, 12, 1–13. [Google Scholar] [CrossRef]
Zhang, Y.; Di, X.; Zhang, B.; Ji, R.; Wang, C. Better Than Reference in Low-Light Image Enhancement: Conditional Re-Enhancement Network. IEEE Trans. Image Process. 2022, 31, 759–772. [Google Scholar] [CrossRef]
Zhou, L.; Chen, X.; Ye, B.; Jiang, X.; Zou, S.; Ji, L.; Yu, Z.; Wei, J.; Zhao, Y.; Wang, T. A Low-Light Image Enhancement Method Based on HSV Space. Imaging Sci. J. 2025, 73, 16–29. [Google Scholar] [CrossRef]
Brateanu, A.; Balmez, R.; Avram, A.; Orhei, C.; Ancuti, C. LYT-NET: Lightweight YUV Transformer-Based Network for Low-Light Image Enhancement. IEEE Signal Process. Lett. 2025, 32, 2065–2069. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Advances in Neural Information Processing Systems; Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 12 August 2025).
Alaaeldin, A.; Touvron, H.; Caron, M.; Bojanowski, P.; Douze, M.; Joulin, A.; Laptev, I.; Neverova, N.; Synnaeve, G.; Verbeek, J.; et al. XCiT: Cross-Covariance Image Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 20014–20027. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Proceedings of Computer Vision—ECCV 2022 Workshops; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar]
Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.E.; Keutzer, K.; Vajda, P. Visual Transformers: Where Do Transformers Really Belong in Vision Models? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 599–609. [Google Scholar]
Feng, Y.; Zhang, C.; Wang, P.; Wu, P.; Yan, Q.; Zhang, Y. You Only Need One Color Space: An Efficient Network for Low-light Image Enhancement. arXiv 2024, arXiv:2402.05809. [Google Scholar] [CrossRef]
Foley, J.D.; Van Dam, A. Fundamentals of Interactive Computer Graphics; Addison-Wesley Longman Publishing Co., Inc.: Reading, MA, USA, 1982; Available online: https://api.semanticscholar.org/CorpusID:62562590/ (accessed on 19 September 2025).
Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 2016, 3, 47–57. [Google Scholar] [CrossRef]
Seif, G.; Androutsos, D. Edge-based loss function for single image super-resolution. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2018; pp. 1468–1472. [Google Scholar] [CrossRef]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision—ECCV 2016; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2018; pp. 586–595. [Google Scholar]
Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “Completely Blind” Image Quality Analyzer. IEEE Signal Process. Lett. 2013, 20, 209–212. [Google Scholar] [CrossRef]
Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-Reference Image Quality Assessment in the Spatial Domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef]
Yang, W.; Wang, W.; Huang, H.; Wang, S.; Liu, J. Sparse Gradient Regularized Deep Retinex Network for Robust Low-Light Image Enhancement. IEEE Trans. Image Process. 2021, 30, 2072–2086. [Google Scholar] [CrossRef]
Lee, C.; Lee, C.; Kim, C.-S. Contrast Enhancement Based on Layered Difference Representation of 2D Histograms. IEEE Trans. Image Process. 2013, 22, 5372–5384. [Google Scholar] [CrossRef]
Ma, K.; Zeng, K.; Wang, Z. Perceptual Quality Assessment for Multi-Exposure Image Fusion. IEEE Trans. Image Process. 2015, 24, 3345–3356. [Google Scholar] [CrossRef] [PubMed]
Vonikakis, V.; Kouskouridas, R.; Gasteratos, A. On the Evaluation of Illumination Compensation Algorithms. Multimed. Tools Appl. 2018, 77, 9211–9231. [Google Scholar] [CrossRef]
Cai, J.; Gu, S.; Zhang, L. Learning a Deep Single Image Contrast Enhancer from Multi-Exposure Images. IEEE Trans. Image Process. 2018, 27, 2049–2062. [Google Scholar] [CrossRef]
Zheng, S.; Ma, Y.; Pan, J.; Lu, C.; Gupta, G. Low-Light Image and Video Enhancement: A Comprehensive Survey and Beyond. arXiv 2024, arXiv:2212.10772. [Google Scholar] [CrossRef]
Chen, C.; Chen, Q.; Xu, J.; Koltun, V. Learning to See in the Dark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 3291–3300. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2017, arXiv:1608.03983. [Google Scholar] [CrossRef]
Ma, L.; Ma, T.; Liu, R.; Fan, X.; Luo, Z. Toward Fast, Flexible, and Robust Low-Light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2022; pp. 5637–5646. [Google Scholar]
Hou, J.; Zhu, Z.; Hou, J.; Liu, H.; Zeng, H.; Yuan, H. Global Structure-Aware Diffusion Process for Low-light Image Enhancement. In Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 79734–79747. Available online: https://proceedings.neurips.cc/paper_files/paper/2023/file/fc034d186280f55370b6aca7a3285a65-Paper-Conference.pdf (accessed on 20 September 2025).
Zhang, Y.; Zhang, J.; Guo, X. Kindling the Darkness: A Practical Low-light Image Enhancer. In Proceedings of the 27th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1632–1640. [Google Scholar] [CrossRef]
Zhang, T.; Liu, P.; Zhao, M.; Lv, H. DMFourLLIE: Dual-Stage and Multi-Branch Fourier Network for Low-Light Image Enhancement. In Proceedings of the 32nd ACM International Conference on Multimedia; ACM: New York, NY, USA, 2024; pp. 7434–7443. [Google Scholar] [CrossRef]
Wang, S.; Eum, I.; Park, S.; Kim, J. A Labelled Dataset for Rebar Counting Inspection on Construction Sites Using Unmanned Aerial Vehicles. Data Brief 2024, 55, 110720. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Park, S.; Kim, J. Building Façade Datasets for Analyzing Building Characteristics Using Deep Learning. Data Brief 2024, 57, 110885. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the HVIFormer framework. (a) The HVI color transformation (HVIT) converts the sRGB image into the intensity map and HVI feature map. (b) The enhancement network is divided into two stages: Stage-I performs intensity-conditioned Transformer pre-recovery, and Stage-II conducts dual-branch refinement enhancement. Finally, the enhanced HVI image is converted back to the sRGB enhanced image through PHVIT.

Figure 2. Content-intensity cross-branch interaction module: (a) Intensity-Conditioned Block (ICB), (b) Intensity-Conditioned Multi-Head Self-Attention (IC-MHSA).

Figure 3. Dual-branch Complementary Cross-Attention (CCA) Block (i.e., I branch and HV branch). CCA includes Cross-Attention Block (CAB), Intensity Enhancement Layer (IEL), and Color Denoising Layer (CDL). The feature embedding convolution layers consist of 1 × 1 depthwise convolution and 3 × 3 group convolution.The design of these modules is an updated version based on [39].

Figure 4. Input the extreme dataset Sony-Total-Dark and the image enhanced using the HVIFormer method.

Figure 5. Visual comparisons of the enhanced results by different methods on LOLv1 and LOLv2. The non-English text in the images can be considered as a painting without paying attention to its textual meaning (culture).

Figure 6. Five unpaired datasets are compared visually, and we randomly select one image in each dataset to compare with the other methods. Our HVIFormer enhances dark details and illumination to a suitable interval, which is better than the other methods.

Figure 7. Visualizing component ablation in HVIFormer. Visual comparisons show noticeable differences against the ground truth when specific components of HVIFormer are removed. Each omitted component causes varying levels of color distortion and noise, highlighting their importance for high-fidelity image enhancement.

Table 1. Comparison of features of several typical low-light enhancement methods (Part I). (TIP: traditional priors/image processing; SML: supervised machine learning; U/WML: unsupervised or weak machine learning).

Method	Year	Category	Model Features	Color Space
Tone mapping [8,9]	2017 [8]; 2020 [9]	TIP	Gradient domain adaptive tone mapping	RGB; grayscale (intensity)
Gamma correction [10]	2023	TIP	Power-law nonlinear luminance curve transformation	RGB; grayscale (intensity)
Histogram equalization [11,12,13]	1993 [11]; 2020 [12]; etc	TIP	Histogram redistribution for contrast adjustment	RGB; grayscale (intensity)
Retinex decomposition [14,15,16,17,18,19,20,21]	2013 [14]; 2015 [15]; etc	TIP	Illumination and reflectance decomposition with priors/regularization	Mostly RGB (also log domain or luminance channel)
LLNet [25]	2017	SML	End-to-end mapping with stacked denoising autoencoders for brightening and denoising	RGB
MBLLEN [5]	2018	SML	Multi-branch CNN with feature extraction, enhancement, and fusion	RGB
EnlightenGAN [23]	2021	U/WML	Unpaired GAN-based framework with generator and discriminator	RGB
Zero-DCE [22]	2020	U/WML	Curve-parameter estimation with self-supervised no-reference losses	RGB
RetinexFormer [27]	2023	Transformer	Transformer self-attention with Retinex formulation and illumination guidance	Retinex
LLFormer [24]	2023	Transformer	High-resolution LLIE with self-attention for long-range modeling	RGB
HVIFormer (Ours)	–	Ours	HVI representation with intensity-guided transformer pre-recovery and detail/color refinement	HVI

Table 2. Module-level summary of HVIFormer (inputs/outputs and to do).

Module	Inputs	Outputs	To Do
HVIT (sRGB → HVI)	Low-light sRGB image L	$X_{h v i}$ and I	Extracts intensity prior I and decouples chroma for more stable low-light representation (brightness + color).
Stage-I	$(X_{h v i}, I)$	Pre-recovered $(I^{'}, X_{h v i}^{'})$	Global exposure calibration and coarse denoising for a stable base before refinement (brightness + noise).
ICB	Content/intensity features (within Stage-I)	Bi-directionally updated content/intensity features	Bi-directional intensity–content correction to stabilize illumination cues and suppress noise (brightness + noise).
IC-MHSA	Q/K/V features and intensity-driven gating (within Stage-I)	Intensity-gated attention output	Intensity-gated attention: stronger updates in dark regions, conservative updates in bright regions (noise + color).
Stage-II	$(I^{'}, X_{h v i}^{'})$	Refined ${\tilde{L}}_{h v i}$	Refines details and color fidelity on the pre-recovered result; suppresses residual noise (color + noise).
CCA	Features from intensity branch and chroma/detail branch (Stage-II)	Complementarily fused cross-branch features	Complementary cross-branch fusion between intensity and chroma/detail features to reduce artifacts (color + noise).
PHVIT (HVI → sRGB)	Enhanced ${\tilde{L}}_{h v i}$	Enhanced sRGB image $\hat{L}$	Reconstructs sRGB with range constraints to avoid overflow and preserve visual consistency (color).

Table 3. Quantitative results of PSNR/SSIM ↑ and LPIPS ↓ on the LOL (v1 and v2) datasets. Best results are in red and second best in blue.

Methods	Color Model	Complexity		LOLv1			LOLv2-Real			LOLv2-Synthetic
Methods	Color Model	Params/M	FLOPs/G	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓
RetinexNet [3]	Retinex	0.44	772.01	16.774	0.419	0.474	16.097	0.401	0.543	17.137	0.762	0.225
LLFormer [24]	RGB	12.54	12.90	23.649	0.829	0.169	27.749	0.873	0.143	17.715	0.789	0.243
ZeroDCE [22]	RGB	0.076	4.83	14.121	0.502	0.433	12.594	0.446	0.484	17.175	0.817	0.197
SCI [57]	RGB	0.0056	10.67	22.015	0.566	0.293	23.866	0.687	0.202	24.726	0.941	0.096
EnlightenGAN [23]	RGB	114.35	61.01	17.484	0.651	0.322	18.489	0.672	0.311	17.598	0.720	0.265
GSAD [58]	RGB	17.43	439.46	27.630	0.875	0.091	22.295	0.794	0.191	21.737	0.874	0.060
RUAS [26]	Retinex	0.003	0.83	16.411	0.426	0.519	18.112	0.407	0.517	15.621	0.423	0.523
KinD [59]	Retinex	8.02	183.72	20.624	0.817	0.193	18.554	0.800	0.259	22.199	0.891	0.118
SNR-Net [6]	RGB	4.01	26.35	24.609	0.850	0.150	21.479	0.857	0.158	22.875	0.118	0.903
RetinexFormer [27]	Retinex	1.61	16.72	26.644	0.851	0.128	21.444	0.835	0.237	25.467	0.931	0.035
CIDNet [7]	HVI	1.88	7.57	27.656	0.864	0.122	22.847	0.818	0.246	24.341	0.912	0.043
HVIFormer (Ours)	HVI	3.40	26.63	28.778	0.891	0.067	29.001	0.902	0.096	30.491	0.955	0.034

Table 4. NIQE scores on LIME, VV, DICM, NPE, and MEF datasets. “AVG” denotes the average NIQE scores across these five datasets. Best results are in red and second best in blue.

Methods	DICM	LIME	MEF	NPE	VV	AVG
RetinexNet [3]	4.413	4.611	4.243	4.529	4.622	4.483
RetinexFormer [27]	3.695	3.864	3.456	3.941	2.961	3.583
ZeroDCE [22]	3.670	4.089	3.504	4.362	3.302	3.785
CIDNet [7]	4.074	4.122	3.360	3.627	3.247	3.686
RUAS [26]	6.088	5.269	5.256	5.934	5.682	5.646
DMFourLLIE [60]	3.625	3.233	3.565	3.564	3.298	3.457
HVIFormer	3.616	3.536	3.211	3.402	3.212	3.395

Table 5. Results on extreme low-light datasets. The arrow ↑ indicates that a higher value is better. Best results are in red and second best in blue.

Methods	SICE		Sony-Total-Dark
Methods	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑
RetinexNet [3]	11.845	0.612	13.842	0.350
RetinexFormer [27]	12.845	0.620	16.468	0.375
ZeroDCE [22]	12.302	0.634	15.876	0.405
CIDNet [7]	13.302	0.633	22.654	0.668
RUAS [26]	8.380	0.505	12.246	0.078
HVIFormer	21.079	0.765	24.234	0.697

Table 6. Ablation studies on HVIFormer with the LOL-v1. The term ‘w/o’ denotes the absence of a specific component. The arrows ↑ and ↓ indicate that higher and lower values are better, respectively. Best results are in red and second best in blue.

Setting	PSNR ↑	SSIM ↑	LPIPS ↓
A: w/o Stage-I	27.86	0.874	0.090
B: w/o Stage-II	27.42	0.871	0.099
C: w/o Intensity-Conditioned Block	28.18	0.882	0.078
D: w/o Intensity-Conditioned Multi-Head Self-Attention	28.03	0.878	0.083
E: w/o Complementary Cross-Attention	28.09	0.880	0.081
F: w/o HVI (sRGB two-stage)	26.94	0.858	0.115
G: Full (Ours)	28.778	0.891	0.067

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Luo, L.; Li, H. HVIFormer: A Dual-Stage Low-Light Image Enhancement Method Based on HVI Representation. Appl. Sci. 2026, 16, 2450. https://doi.org/10.3390/app16052450

AMA Style

Li Y, Luo L, Li H. HVIFormer: A Dual-Stage Low-Light Image Enhancement Method Based on HVI Representation. Applied Sciences. 2026; 16(5):2450. https://doi.org/10.3390/app16052450

Chicago/Turabian Style

Li, Yimei, Liuhong Luo, and Hongjun Li. 2026. "HVIFormer: A Dual-Stage Low-Light Image Enhancement Method Based on HVI Representation" Applied Sciences 16, no. 5: 2450. https://doi.org/10.3390/app16052450

APA Style

Li, Y., Luo, L., & Li, H. (2026). HVIFormer: A Dual-Stage Low-Light Image Enhancement Method Based on HVI Representation. Applied Sciences, 16(5), 2450. https://doi.org/10.3390/app16052450

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HVIFormer: A Dual-Stage Low-Light Image Enhancement Method Based on HVI Representation

Abstract

1. Introduction

2. Related Work

2.1. Low-Light Image Enhancement

2.2. Color Spaces and Decoupled Representations

2.3. Vision Transformer

3. Method

3.1. HVI Representation of the Image

3.2. Stage-I Intensity-Conditioned Transformer Pre-Recovery

3.2.1. Multi-Scale U-Shaped Pre-Recovery Module with Dual Output

3.2.2. Cross-Branch Interaction Update Between Content and Intensity

3.2.3. Intensity-Conditioned Multi-Head Self-Attention

3.3. Dual-Branch Refinement Enhancement

3.4. Compared with HVI-CIDNet Method

3.5. Loss Function

3.6. Evaluation of Image Enhancement Performance

4. Experiment

4.1. Datasets

4.2. Experiment Settings

4.3. Evaluation Metrics

4.4. Results Analysis

4.4.1. Quantitative Comparison on LOL Datasets

4.4.2. Quantitative Comparison on Unpaired Datasets

4.4.3. Quantitative Comparison on Extreme Low-Light Datasets

4.4.4. Qualitative Comparison

4.5. Ablation Experiment

4.6. Application Prospects and Cross-Domain Transferability

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI