Noise-Resilient Low-Light Image Enhancement with CLIP Guidance and Pixel-Reordering Subsampling

Song, Seongjong

doi:10.3390/electronics14244839

Open AccessArticle

Noise-Resilient Low-Light Image Enhancement with CLIP Guidance and Pixel-Reordering Subsampling

by

Seongjong Song

School of Integrated Technology, College of Computing, Yonsei University, 402, 12, Baumoe-ro 6-gil, Seocho-gu, Seoul 06763, Republic of Korea

Electronics 2025, 14(24), 4839; https://doi.org/10.3390/electronics14244839

Submission received: 16 November 2025 / Revised: 4 December 2025 / Accepted: 4 December 2025 / Published: 8 December 2025

Download

Browse Figures

Versions Notes

Abstract

Low-light image enhancement (LLIE) is an essential task for improved image quality that ultimately supports crucial downstream tasks such as autonomous driving and mobile photography. Despite notable advances achieved by traditional, Retinex-based methods, existing approaches still struggle to maintain globally consistent illumination and to suppress sensor noise under extremely dark conditions. To overcome these limitations, we propose a noise-resilient LLIE framework that integrates a CLIP-guided loss (CLIP-LLA) and a pixel-reordering subsampling (PRS) scheme into the Retinexformer backbone. The CLIP-LLA loss exploits the semantic prior of a large-scale vision–language model to align enhanced outputs within the manifold of well-illuminated natural images, leading to faithful global tone rendering and perceptual realism. In parallel, the PRS-based multi-scale training strategy effectively regularizes the network by augmenting structural diversity, thereby improving denoising capability without architectural modification or inference cost. Extensive experiments on both sRGB and RAW benchmarks validate the effectiveness of our design. The proposed method achieves consistent improvements over state-of-the-art techniques, including a

+ 2.73 dB

PSNR gain on the SMID dataset and superior perceptual scores, while maintaining computational efficiency. These results demonstrate that fusing foundation-model priors with transformer-based Retinex frameworks offers a practical and scalable pathway toward perceptually faithful low-light image enhancement.

Keywords:

low-light image enhancement; image restoration; multi-scale; CLIP prior

1. Introduction

Low-light image enhancement (LLIE) is one of the most fundamental restoration tasks in computer vision. It directly influences the reliability of downstream applications such as nighttime surveillance [1,2], autonomous driving [3,4], and remote sensing [5,6]. Real-world low-light scenes suffer from severe degradations—including sensor noise, detail loss, and color imbalance—that cannot be remedied by naïve exposure or gamma correction alone. This makes LLIE a core challenge for robust visual perception in both scientific and consumer imaging pipelines.

Early efforts in LLIE explored a wide spectrum of enhancement paradigms, ranging from histogram equalization [7] and image-adaptive lookup-table approaches [8] to learning-based mappings [9,10]. Among them, Retinex-based methods [11,12,13,14,15] hold particular significance because they explicitly model illumination–reflectance interaction—a physical interpretation of image formation that promotes decomposition stability and generalization across domains. This formulation not only bridges traditional physics-based reasoning and modern deep learning, but also provides a principled way to incorporate semantic priors and transformer-based architectures (e.g., RetinexFormer [16]). Consequently, Retinex-guided frameworks serve as a unifying platform capable of accommodating varied enhancement strategies while preserving interpretable illumination estimation.

Despite these advances, current LLIE models still struggle to produce globally consistent illumination and to suppress sensor-induced noise in extreme darkness. Pixel-level optimization schemes tend to overfit to dataset-specific brightness distributions and lack semantic awareness, resulting in color shift and unnatural tone rendering. Furthermore, diffusion-based approaches [17,18,19,20] achieve strong restoration quality but often hallucinate structures and incur excessive computational cost, limiting their practical use in real-time or mobile scenarios. These limitations highlight the need for models that combine high-level semantic priors with efficient transformer-based architectures for robust, perceptually faithful enhancement.

To address these challenges, we extend the RetinexFormer framework with two key components, as shown in Figure 1. First, we introduce a CLIP-guided low-light assessment loss (CLIP-LLA) that leverages the vision–language priors of CLIP, or contrastive language-image pre-training [21], to align enhanced images with the semantic distribution of well-illuminated natural scenes. By optimizing in the CLIP embedding space, the network learns global tone consistency and color semantics that are difficult to capture through pixel-wise objectives alone. Second, we propose a pixel-reordering subsampling (PRS) strategy for multi-scale training. This strategy creates diverse sub-samples from single images without altering the model architecture, serving as an implicit regularizer that reduces noise and enhances local detail preservation. Together, these components address the two dominant limitations in LLIE—global tone ambiguity and noise resilience—through a synergistic combination of semantic guidance and multi-scale representation learning.

The main contributions of this work are summarized as follows:

We introduce the CLIP-LLA loss to inject semantic and perceptual priors from a foundation model into LLIE training, achieving faithful illumination and color consistency.
We design a pixel-reordering subsampling (PRS) scheme that enhances robustness and noise suppression through multi-scale data regularization.
We conduct extensive experiments on representative benchmarks such as LOL variations and SID. The results demonstrate consistent performance boosts, over $2.7 dB$ PSNR and $0.010$ LPIPS improvement on SMID benchmark, validating the effectiveness of our proposed method.

2. Related Works

Traditional algorithmic approaches, such as gamma correction [22] and histogram equalization [7], operate by remapping pixel intensities to adjust the dynamic range. However, these methods are insufficient for aesthetically enhancing real-world RGB images that contain complex color components. To address this limitation, Retinex-based techniques [11,23] have been introduced, which decompose low-light images into reflection and illumination maps. Despite their effectiveness, these approaches face significant challenges: designing explicit priors [12] that can generalize across diverse scenes is inherently difficult, and they remain vulnerable to noise artifacts [13] introduced by low exposure conditions.

The introduction of deep learning to LLIE began with autoencoder-based models trained on synthetically darkened data [9]. Multi-branch CNNs further addressed brightness restoration and noise reduction [10]. To ease paired-data dependency, supervised and adversarial methods exploited expert-enhanced references [24,25], while zero-reference strategies estimated pixel-wise curves without ground truth [26]. Retinex theory was soon incorporated into learning frameworks: Retinex-Net jointly decomposed reflectance and illumination [14], and KinD extended this with practical pipelines for color fidelity and denoising [15]. These works established Retinex-guided enhancement as a strong foundation for LLIE.

Transformers, first introduced for sequence modeling [27], were later adapted to vision tasks in ViT [28] and enhanced by Swin Transformer with shifted-window attention [29]. In image restoration, architectures like Uformer [30] and Restormer [31] enabled efficient local–global modeling. LLIE soon adopted these designs, with LLFormer exploiting axis-based attention for high-resolution enhancement [32]. Retinexformer advanced this trend by combining Retinex decomposition with Transformer attention [16], guiding illumination estimation and achieving strong performance under uneven lighting.

RetinexFormer has become the dominant baseline for LLIE, inspiring a range of studies aimed at improving its training inefficiency. RetinexFormer+ [33] incorporates a dual-channel paradigm and a Nested U-Net [34] to increase efficiency, while RetinexMamba [35] follows the recent shift toward state-space models [36] by substituting the transformer components with Mamba blocks [37]. Spike-RetinexFormer [38] leverages spiking neural networks [39] to reduce energy consumption. These variants typically deliver comparable or slightly improved pixel-wise metrics while lowering training cost. Additionally, several works seek to extend RetinexFormer to real-world deployment, including methods that reduce domain gaps using unpaired real data [40] and unsupervised fine-tuning techniques based solely on extra-low-light images [41]. Through these developments, RetinexFormer continues to serve as a strong and widely adopted golden standard for LLIE.

In the field of generative vision, the success of diffusion models has brought increasing attention to generation-based approaches for low-light image enhancement. Among these, Diff-Retinex [17] employs the diffusion process for the reconstruction of both the illumination map and the reflectance map. Other approaches [18,19,20] have demonstrated performance improvements by directly learning a representation of the degradation itself. Nevertheless, a few of those diffusion-based methods tend to produce hallucinations, non-existent structures in extremely low-light regions. Furthermore, a critical limitation of these methods lies in their excessive computational cost and prolonged inference time, rendering them impractical for real-world applications.

By jointly training LLIE and deblurring, very recent deterministic LLIE models DarkIR [42] and URWKV [43] report more efficient training and considerably better pixel-based evaluation metrics in several benchmark datasets, while the training part of their model is not available publicly yet. Thus, we selected the relatively popular and widely verified benchmark model Retinexformer.

3. Method

3.1. Motivation and Preliminaries

In LLIE tasks, incorporating a prior that enables the separation of illumination and reflectance contributes to improved training stability and better generalization. Consequently, retinex theory aims to decompose a low-light image

I \in R^{H \times W \times 3}

into a reflectance image

R \in R^{H \times W \times 3}

and an illumination map

L \in R^{H \times W}

as

I = R ⊙ L,

(1)

where ⊙ denotes the element-wise multiplication. Traditional approaches estimate

L

using handcrafted priors such as the Gray-world or Max-RGB assumptions. In contrast, CNN-based supervised methods adopt a data-driven strategy to predict the normal-light illumination map, which is subsequently multiplied with the reflectance component

R

to assume the normal-light image

I_{l u}

.

To account for the inevitable corruption in low-light images, Retinexformer [16] models the reflectance

R

as the normal-light target image, obtained by removing the overall corruption

C

from the light-up image

I_{l u} \in R^{H \times W \times 3}

. The light-up image is formulated as

I_{l u} = I ⊙ L_{0},

(2)

where the network aims to predict the light-up map

L_{0} \in R^{H \times W}

. The obtained

I_{l u}

effectively serves as a post-processing method for corruption reduction. To simultaneously estimate the corruption term

C

, Retinexformer employs a U-net structure with skip connections and transformer blocks, in which the self-attention mechanism is modified to utilize the illumination prior for guidance.

Despite its effectiveness, Retinexformer’s optimization remains limited by pixel-level supervision, which cannot capture the global semantics or perceptual realism of well-lit natural scenes. Moreover, its training instability under low SNR inputs often leads to oscillatory convergence (Figure 2), causing inconsistencies between pixel-level and perceptual metrics.

To overcome these issues, our method introduces two complementary components: (1) CLIP-guided low-light assessment (CLIP-LLA) to align restored images with the semantic manifold of well-lit images via foundation-model priors and (2) pixel-reordering subsampling (PRS) to improve generalization and noise resilience through efficient multi-scale regularization. These two components are designed to be modular and orthogonal, improving Retinexformer’s perceptual alignment and local detail robustness without modifying its inference pipeline.

3.2. CLIP-LLA Loss

CLIP [21] is a large-scale vision–language foundation model designed to align visual and textual embeddings. We leverage this semantic alignment capability to design a CLIP-guided loss (CLIP-LLA) that encourages the enhanced image to resemble a semantically well-lit image both textually and visually.

Textual Guidance. We take inspiration from CLIP-IQA [44], which exploited CLIP’s capability and introduced single-image visual quality assessment. Specifically, the approach employs predefined positive–negative keyword pairs and evaluates image quality by measuring the cosine similarity of the image embedding to the positive keywords and its dissimilarity to the negative keywords, thereby quantifying the overall positivity of the image.

Following CLIP-IQA [44], we compute the cosine similarity between L2-normalized image features

I_{feat} \in R^{1 \times d}

and text embeddings

T_{feat} \in R^{2 \times d}

of positive–negative prompt pairs as follows:

z = I_{feat} T_{feat}^{⊤} \times t,

(3)

where

z = [z_{good}, z_{bad}]

represents the logits corresponding to a positive–negative prompt pair. And

t

is a temperature scaling factor for CLIP, which is fixed to 100.

The probabilities are obtained through the softmax function,

p = [p_{good}, p_{bad}] = softmax (z)

. The CLIP loss is then defined as

L_{CLIP - IQA} = 1 - \frac{1}{N} \sum_{i = 1}^{N} p_{good}^{(i)},

(4)

where N denotes the number of prompt pairs and

p_{good}^{(i)}

is the probability assigned to the positive prompt in the i-th pair.

For designing the Proposed CLIP-Low Light Assessment loss, or CLIP-LLA loss, the five general prompt pair

z_{g e n e r a l}

for overall quality enhancement are set as following:

{“Good image”, “bad image”},
{“Sharp image”, “blurry image”},
{“sharp edges”, “blurry edges”},
{“High resolution image”, “low resolution image”},
{“Noise-free image”, “noisy image”}.

Additionally, we utilize a task-specific prompt pair

z_{L L I E}

, namely

{“A well lit image”, “A low light image”}.

Visual–Semantic Guidance. In addition to the aforementioned textual guiding method, we propose a variation which exploits the average feature of the target reference image

R_{feat}

as an additional positive–negative pair. Specifically, we form the logits

T_{feat}^{vis} = [\begin{matrix} R_{feat}^{+} \\ R_{feat}^{-} \end{matrix}] \in R^{2 \times d}, z_{mix} = I_{feat} {(T_{feat}^{vis})}^{⊤} \times t,

(5)

which yields

p_{mix} = [p_{ref}, p_{non - ref}]

through the softmax function.

This loss could be used solely as a visual variation of the loss, or can be considered if it was another additional prompt pair for calculating CLIP-LLA, written as

L_{CLIP - LLA - mix} = 1 - \frac{1}{N + 1} (\sum {i = 1}^{N} p_{good}^{(i)} + p_{ref}),

(6)

which formulates the mixed variation.

Then,

L_{CLIP}

is summed with the original Retinexformer training loss

λ L_{1}

to construct the final training objective as

L_{F i n a l} = L_{L 1} + λ L_{CLIP},

(7)

where

λ

denotes the CLIP-LLA loss weight, which is set to 1 in our experiments. The visual guidance directs the learning process toward the ground truth of the overall dataset within a high-level perceptual embedding space. With textual guidance, the joint objective balances pixel fidelity and perceptual alignment, ensuring that enhanced images not only match the ground truth in the intensity space but also in the semantic embedding space of high-quality images.

3.3. Pixel-Reordering Subsampling (PRS): Multi-Scale Training for Noise Robustness

To enhance the local detail preservation of Retinexformer and guide consistent output, we introduce pixel-reordering subsampling (PRS). This approach is inspired by pretext task design strategies in self-supervised learning, drawing upon the image partitioning and rearrangement strategy of PsyNet [45] as well as the subsampler-based training set generation method utilized in single-image denoising methods [46,47].

PRS is a data augmentation technique that downsamples both the input and target images into four half-resolution images through pixel rearrangement. Specifically, an image

I \in R^{H \times W \times 3}

is divided into a

2 \times 2

grid, and for each cell, pixel values are sampled from a fixed position among the four candidates. This procedure yields four half-resolution images

{I_{k} \in R^{\frac{H}{2} \times \frac{W}{2} \times 3}}_{k = 1}^{4}

, which are then concatenated into sub-pixels, as illustrated in Figure 3. PRS is applied simultaneously to both the low-light input

I

and the ground-truth target

R

, and the resulting pairs are treated analogously to the original dataset for training.

Unlike PsyNet, where symmetries may allow semantically identical regions to be exploited, PRS arranges sub-images in a parallel manner without symmetry. Although training patches may cross sub-image boundaries, the presence of these seams increases local patch variation, which can positively influence the performance of vision models [48]. Moreover, PRS alleviates the reduced sampling frequency of near-boundary regions that commonly arises in random cropping. It is important to note that PRS is employed solely during training; inference proceeds without modification.

We fix the ordering of 4 sub-images to match the relative position of pixel. Practically, PRS can be performed and saved in advance of the training process, where benchmark datasets of LLIE are relatively small. Thereby, PRS introduces no additional training time or computation power, while effectively doubling the dataset and preventing overfit to a noise characteristic in a certain scale.

3.4. Summary of Proposed Algorithm

The proposed LLIE network is trained from scratch. Before training, PRS can optionally be precomputed and stored as an additional augmentation of the training dataset. The network is then trained in a fully supervised manner using paired low-light and ground-truth well-lit images, enabling the network to predict the latter from the former. If PRS is not precomputed, it is applied on the fly to

50 %

of the training pairs so that the model learns both scales consistently.

The training objective consists of the standard L1 loss, which enforces pixel-wise fidelity between the prediction and the ground truth, combined with the proposed CLIP-LLA loss that encourages perceptual alignment. After training, the network infers well-lit images directly from low-light inputs without requiring PRS during inference.

4. Experiment

4.1. Dataset and Implementation Details

4.1.1. sRGB Datasets

We employ the widely used benchmark LOL dataset for both training and evaluation. The dataset is divided into three variants: v1 [14], v2-real, and v2-synthetic [49]. Following the protocol proposed in the original work, we adopt the official training–test splits of 485:15, 689:100, and 900:100, respectively.

4.1.2. RAW Datasets

We utilize four RAW datasets: SID, SMID, SDSD-indoor, and SDSD-outdoor. The SID dataset [50] is designed to address extreme low-light conditions and consists of 2697 paired samples captured with a Sony

α

7S II camera. Following the standard protocol, the dataset is split into 2099 training pairs and 598 test pairs. The SMID dataset [51] provides real noisy–clean pairs acquired through dual smartphone captures, and it is widely recognized as a benchmark for image denoising. We use 15,763 pairs for training and 5063 pairs for testing. The SDSD dataset [52] is derived from video sequences recorded with a smartphone camera and decomposed into frame-level pairs. It contains realistic degradations including camera shake, light leakage, and sensor noise. Specifically, SDSD-indoor comprises 7778 training pairs and 360 test pairs, whereas SDSD-outdoor contains 13,690 training pairs and 390 test pairs.

4.1.3. Evaluation Metrics

For the evaluation, four metrics are employed, which can be divided into two groups. PSNR and SSIM [53] are the most commonly adopted indicators for assessing LLIE performance, although they often diverge considerably from human perceptions. To assess reconstruction fidelity, we employ the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index Measure (SSIM). PSNR quantifies the pixel-wise error between a reference image I and a reconstructed image

\hat{I}

as follows:

PSNR = 10 {log}_{10} (\frac{L^{2}}{MSE}), MSE = \frac{1}{N} \sum_{i = 1}^{N} {(I_{i} - {\hat{I}}_{i})}^{2},

(8)

where L denotes the maximum possible pixel value and N the number of pixels. A higher PSNR indicates a smaller reconstruction error. SSIM, in contrast, measures perceptual similarity considering luminance, contrast, and structural information as follows:

SSIM (I, \hat{I}) = \frac{(2 μ_{I} μ_{\hat{I}} + C_{1}) (2 σ_{I \hat{I}} + C_{2})}{(μ_{I}^{2} + μ_{\hat{I}}^{2} + C_{1}) (σ_{I}^{2} + σ_{\hat{I}}^{2} + C_{2})},

(9)

where

μ

,

σ^{2}

, and

σ_{I \hat{I}}

denote the mean, variance, and covariance of the image intensities. Larger SSIM values (closer to 1) represent better perceptual quality.

Although SSIM measures perceptual structures and visual quality more faithfully, it is still lacking in terms of alignment to human perception, as it overlooks uniform regions and distortion near strong edges [54,55]. To complement this limitation, two additional perceptually oriented metrics are introduced. LPIPS [56] quantifies the distance between feature maps extracted from a pre-trained network, thereby yielding results that are more consistent with human visual perception. DISTS [57], on the other hand, explicitly separates the measurement of structural and textural information, leading to a reliable alignment with human preference when evaluating distortions.

4.1.4. Training and Evaluation Protocol

The models were independently trained and evaluated on each dataset. For patch extraction, we set the patch size to

256 \times 256

for LOL-v2-real and LOL-v2-synthetic, and to

128 \times 128

for the remaining datasets. For each iteration, each of the training datasets is augmented by PRS, with a probability of

0.5

. In experiments based on RAW datasets, the number of training iterations and the phase division of the cosine annealing schedule were kept identical to those used for the real-image datasets. Batch size and learning rate were halved for RAW dataset-based experiments, while all other configurations followed those of RetinexFormer. Notably, during reproduction, we observed that performance deviated substantially from the results reported in the original paper for certain datasets. The varying resolution of the dataset being used across different benchmark reports could cause PSNR differences on the LOLv1 and LOLv2-real datasets. Consequently, we re-trained and evaluated the model and dataset using the official code and configurations released by each of the authors, reporting those results herein.

4.2. Low-Light Image Enhancement

The quantitative comparison of various state-of-the-art LLIE methods and the proposed approach are presented in Table 1. Our method outperforms other approaches, including KinD [15], Restormer [31], and MIRNet [58], on the majority of the benchmark datasets. A key advantage of the proposed method is that it retains the efficiency of the baseline model, Retinexformer, particularly its low parameter count (Param) and FLOPs during inference. Furthermore, all three variations of the CLIP-LLA Loss, as detailed in the relevant section, demonstrated overall performance improvements compared to the baseline Retinexformer. Notably, the mixed variation, which leverages both textual and visual guidance, achieved significant gains. Specifically, it exhibited a PSNR increase of

+ 0.38 dB

on the LOL-v2-synthetic dataset and a more substantial increase of

+ 2.73 dB

on the SMID dataset.

Table 2 summarizes the extent to which the proposed method improves upon the baseline in terms of perceptually accurate evaluation metrics. In most cases, the proposed approach achieves significant improvements, including on the SDSD dataset where the pixel-wise low-level metric did not improve. The greatest effectiveness is observed in the mixed configuration, where both textual and visual guidance are employed simultaneously. Compared to the naive Retinexformer, our method outperforms in 13 out of 14 quantitative metrics across seven datasets.

The quantitative results are presented in Figure 4 and Figure 5. As shown in Figure 4, KinD suffers from reduced local saturation, resulting in tonal inconsistencies with the ground truth. SNR-Net, in contrast, tends to suppress high-frequency details, such as sharp object boundaries. Retinexformer, which serves as the most direct point of comparison with our approach, fails to preserve fine details and produces an overall reddish global tone that deviates from the reference. In comparison, the proposed method yields results that are most consistent with the ground truth, both in terms of tonal fidelity and textural detail.

Figure 5 and Figure 6 illustrate the experimental results on the LOLv2-synthetic and SMID datasets, respectively. The LOLv2-synthetic dataset presents a considerable challenge due to its wide dynamic range. On this dataset, the proposed method produces reconstructions that most closely approximate the ground truth, particularly with respect to sky coloration and image sharpness. The SMID dataset, by contrast, is characterized by severe noise-induced distortions, making it a highly demanding benchmark. Within this setting, the proposed method is uniquely capable of accurately restoring stripe-like structures, even within extremely dark regions, thereby demonstrating its robustness under adverse imaging conditions. Additional qualitative results are displayed in Appendix A.

4.3. Ablation Study

We further include an ablation study focusing on the design of individual modules within our proposed framework. Table 3 presents the results when utilizing only one of the two key components: the proposed CLIP-LLA loss or the pixel reordering subsampling (PRS). The results show that employing a CLIP-based loss alone negatively impacts computational comparison metrics of PSNR and SSIM. Conversely, incorporating multi-scale training plays a crucial role in preventing pixel-wise overfitting to the domain characteristics of the training dataset, leading to a significant performance improvement in perceptually accurate metrics. When both the CLIP-LLA loss and the PRS module are integrated into the final model, we observe a considerable performance gain, in line with their complementary effects.

5. Discussion and Conclusions

In this paper, we propose a methodology that integrates the rich visual–semantic representations of the image–text foundation model CLIP into the training scheme of Retinexformer, while enhancing robustness through a multi-scale dataset augmentation strategy. By introducing CLIP-based perceptual supervision, the model effectively compensates for the information deficiency inherent in pixel-wise losses, which often approximate flat regions and fail to reconstruct fine details. Furthermore, the inherently ill-posed global tone prediction is guided more accurately through CLIP’s high-level semantic alignment between visual and linguistic domains. Complementarily, the proposed multi-scale training scheme prevents overfitting to the fixed-resolution noise characteristics of the dataset. The corresponding results show improved generalization and perceptual coherence. As a result, the proposed approach achieves substantial gains in both global tone alignment and fine-detail restoration, consistently outperforming the baseline across seven benchmark datasets.

Currently, state-of-the-art (SotA) methods in image restoration, particularly in low-light image enhancement (LLIE), are primarily evaluated using pixel-wise comparison metrics such as PSNR and SSIM. When models are trained with an L-distance loss against the ground truth, they are effectively optimized directly for these metrics, which often conflicts with perceptually meaningful fidelity [53]. This trade-off on restoration models has been thoroughly analyzed and empirically demonstrated in prior literature [66,67]. Aside from LLIE models pursuing higher PSNR, the PSNR metric itself is known to be overly sensitive to minor, perceptually negligible details, such as global tone shifts or slight brightness inconsistencies. Future models are therefore expected to avoid overfitting to such metrics and instead aim for a more balanced optimization that aligns with human perception.

It should be noted that the LLIE benchmarks typically span fewer than 500 scenes, which stands in contrast to the scalability that transformer-based architectures are designed to leverage. This limitation motivates the introduction of LLIE networks that achieve competitive performance through more efficient architectures and training schemes. In theory, the proposed CLIP-LLA and PRS modules are not confined to Retinexformer, which opens its applicability to general end-to-end LLIE networks, potentially enhancing both robustness and perceptual quality.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the MSIP (No. RS-2025-00520207); and by an IITP grant funded by the Korea government (MSIT) and a KEIT grant funded by the Korea government (MOTIE) (No. 2022-0-01045).

Data Availability Statement

The datasets used in this study are publicly available. Specifically, the LOL dataset [14], LOL-v2 dataset [49], SID dataset [50], SMID dataset [51], and SDSD dataset [52] were utilized for experimental validation. No new data were created or analyzed in this study. All datasets cited above are publicly accessible and can be obtained from the corresponding references.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A. Detail on Network Framework and Algorithm

The detailed architecture of the proposed framework is displayed on Figure A1. Based on the baseline RetinexFormer architecture, the network consists of an Illuumination estimator, and multiple stages of an IGA (Illumination-Guided Attention) block to predict the residual of the enhanced image. Multiple IGA blocks form an encoder–bottleneck–decoder structure that downsamples the feature maps to one-quarter resolution before upsampling them again. Downsampling is performed using stride-2

4 \times 4

convolution layers, while upsampling employs

2 \times 2

stride-2 transposed convolutions followed by a

1 \times 1

convolution for concatenating the encoded features.

Figure A1. The detailed architecture the proposed framework utilizes. Please refer to Figure 3 and Section 3.3 for details on pixel-reordering subsampling (PRS).

Compared to a standard transformer block followed by a feed-forward network, IGAB implements following modifications to suit the LLIE task:

Utilization of illumination features. IG-MSA multiplies the value vectors by the illumination map from the estimator before attention; standard blocks operate solely on the token features.
Positional branch. IGAB adds a depthwise $3 \times 3$ convolutional positional stream to the attention output; transformers rely on fixed or learned positional embeddings without extra conv branches.
Channel-wise FFN. IGAB’s feed-forward path is all convolutional ( $1 \times 1 \to$ depthwise $3 \times 3 \to 1 \times 1$ with GELU); transformers use linear layers.
PreNorm placement. IGAB wraps only the FFN with PreNorm; attention path skips LayerNorm to keep illumination modulation raw, whereas standard blocks apply LayerNorm before both attention and FFN.
Residual flow. IGAB applies residual adds after both attention + positional fusion and the conv MLP; transformers use two residuals but without the illumination gating step.

Appendix B. Additional Qualitative Results

We post additional comparisons on different benchmark datasets in Figure A2–Figure A4.

Figure A2. Qualitative results on RAW dataset SDSD-outdoor. Proposed method preserves texture and prevents over-flattening, especially on the zoomed-in section. It also offers the closest match to the ground truth in terms of global brightness and tonal appearance.

Figure A3. Qualitative results on RAW dataset SDSD-indoor. The baseline RetinexFormer exhibits flickering blotchy artifacts on the white wall, and produces faint white halos around sharp edges in the zoomed-in region. In contrast, our method recovers the scene more faithfully, without such artifacts.

Figure A4. Qualitative results on real-world indoor dataset LOLv1. Considering the bright yellow tone of the plates, Ours—Mixed recovers the closest brightness and color, while perceptually outperforming other methods in overall noise suppression.

Appendix C. Influence of Loss Weight λ

The influence of the CLIP-LLA loss weight

λ

, used during training, is summarized in Table A1. Overall, larger values of

λ

tend to improve perceptual metrics, whereas excessively large values (e.g.,

λ = 4

) lead to degradation across metrics. We observe that

λ = 1

provides a favorable balance, maintaining both pixel-wise accuracy and perceptual similarity. Although the optimal value of

λ

may vary depending on the dataset characteristics,

λ = 1

consistently produced a fine trade-off in our experiments. All qualitative and quantitative results presented in the main paper were obtained using this setting.

Table A1. Effect of CLIP-LLA loss weight

λ

on each performance metrics, conducted with LOL-v2-synthetic dataset. The best result in each row is bolded.

Table A1. Effect of CLIP-LLA loss weight

λ

on each performance metrics, conducted with LOL-v2-synthetic dataset. The best result in each row is bolded.

$λ$	0.25	0.5	1	2	4
PSNR↑	25.44	25.24	25.99	25.78	25.68
SSIM↑	0.9318	0.9332	0.9556	0.9335	0.9324
LPIPS↓	0.0567	0.0560	0.0546	0.0532	0.0544
DISTS↓	0.0647	0.0644	0.0639	0.0627	0.0645

Appendix D. Detailed Computational Efficiency

For comparison, selected representative networks with comparable performances have been displayed in the qualitative comparison. Table A2 presents their computational efficiency measured under the same hardware setup. All experiments were performed on a single NVIDIA A6000 GPU with

600 \times 400

RGB inputs, matching the configuration of LOL-v1 and LOL-v2-real. Relative to RetinexFormer, the proposed approach delivers improved performance without adding any extra inference time and computational burden.

Table A2. Comparison of computational efficiency among different LLIE models. The best result in each row is bolded.

Model	Inference Time (s)	FLOPs (G)	Memory (MB)	#Params (M)
MIRNet	0.51	785	154	31.8
KinD	0.53	35.0	154	8.02
SNR-Net	0.11	26.4	149	4.01
RetinexFormer	0.33	15.6	572	1.61
Ours	0.34	15.6	572	1.61

References

Loh, Y.P.; Chan, C.S. Getting to know low-light images with the exclusively dark dataset. Comput. Vis. Image Underst. 2019, 178, 30–42. [Google Scholar] [CrossRef]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3496–3504. [Google Scholar]
Li, J.; Li, B.; Tu, Z.; Liu, X.; Guo, Q.; Juefei-Xu, F.; Xu, R.; Yu, H. Light the night: A multi-condition diffusion framework for unpaired low-light enhancement in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 15205–15215. [Google Scholar]
Xian, X.; Zhou, Q.; Qin, J.; Yang, X.; Tian, Y.; Shi, Y.; Tian, D. CROSE: Low-light enhancement by CROss-SEnsor interaction for nighttime driving scenes. Expert Syst. Appl. 2024, 248, 123470. [Google Scholar] [CrossRef]
Zhao, X.; Huang, L.; Li, M.; Han, C.; Nie, T. Atmospheric Scattering Model and Non-Uniform Illumination Compensation for Low-Light Remote Sensing Image Enhancement. Remote Sens. 2025, 17, 2069. [Google Scholar] [CrossRef]
Wu, J.; Ai, H.; Zhou, P.; Wang, H.; Zhang, H.; Zhang, G.; Chen, W. Low-Light Image Dehazing and Enhancement via Multi-Feature Domain Fusion. Remote Sens. 2025, 17, 2944. [Google Scholar] [CrossRef]
Pizer, S.M.; Amburn, E.P.; Austin, J.D.; Cromartie, R.; Geselowitz, A.; Greer, T.; ter Haar Romeny, B.; Zimmerman, J.B.; Zuiderveld, K. Adaptive histogram equalization and its variations. Comput. Vis. Graph. Image Process. 1987, 39, 355–368. [Google Scholar] [CrossRef]
Zeng, H.; Cai, J.; Li, L.; Cao, Z.; Zhang, L. Learning image-adaptive 3d lookup tables for high performance photo enhancement in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2058–2073. [Google Scholar] [CrossRef]
Lore, K.G.; Akintayo, A.; Sarkar, S. LLNet: A deep autoencoder approach to natural low-light image enhancement. Pattern Recognit. 2017, 61, 650–662. [Google Scholar] [CrossRef]
Lv, F.; Lu, F.; Wu, J.; Lim, C. MBLLEN: Low-light image/video enhancement using cnns. In Proceedings of the BMVC, Newcastle upon Tyne, UK, 3–6 September 2018; Volume 220, p. 4. [Google Scholar]
Land, E.H.; McCann, J.J. Lightness and retinex theory. J. Opt. Soc. Am. 1971, 61, 1–11. [Google Scholar] [CrossRef]
Guo, X.; Li, Y.; Ling, H. LIME: Low-light image enhancement via illumination map estimation. IEEE Trans. Image Process. 2016, 26, 982–993. [Google Scholar] [CrossRef]
Li, M.; Liu, J.; Yang, W.; Sun, X.; Guo, Z. Structure-Revealing Low-Light Image Enhancement Via Robust Retinex Model. IEEE Trans. Image Process. 2018, 27, 2828–2841. [Google Scholar] [CrossRef]
Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep retinex decomposition for low-light enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, J.; Guo, X. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1632–1640. [Google Scholar]
Cai, Y.; Bian, H.; Lin, J.; Wang, H.; Timofte, R.; Zhang, Y. Retinexformer: One-stage retinex-based transformer for low-light image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 12504–12513. [Google Scholar]
Yi, X.; Xu, H.; Zhang, H.; Tang, L.; Ma, J. Diff-retinex: Rethinking low-light image enhancement with a generative diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 12302–12311. [Google Scholar]
Wang, T.; Zhang, K.; Zhang, Y.; Luo, W.; Stenger, B.; Lu, T.; Kim, T.K.; Liu, W. LLDiffusion: Learning degradation representations in diffusion models for low-light image enhancement. Pattern Recognit. 2025, 166, 111628. [Google Scholar] [CrossRef]
Zhou, D.; Yang, Z.; Yang, Y. Pyramid diffusion models for low-light image enhancement. arXiv 2023, arXiv:2305.10028. [Google Scholar] [CrossRef]
Jiang, H.; Luo, A.; Fan, H.; Han, S.; Liu, S. Low-light image enhancement with wavelet-based diffusion models. ACM Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Poynton, C. Digital Video and HD: Algorithms and Interfaces; Elsevier: Amsterdam, The Netherlands, 2012. [Google Scholar]
Jobson, D.J.; Rahman, Z.u.; Woodell, G.A. A multiscale retinex for bridging the gap between color images and the human observation of scenes. IEEE Trans. Image Process. 1997, 6, 965–976. [Google Scholar] [CrossRef] [PubMed]
Wang, R.; Zhang, Q.; Fu, C.W.; Shen, X.; Zheng, W.S.; Jia, J. Underexposed photo enhancement using deep illumination estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6849–6857. [Google Scholar]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef] [PubMed]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event, 14–19 June 2020; pp. 1780–1789. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Event, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 17683–17693. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5728–5739. [Google Scholar]
Wang, T.; Zhang, K.; Shen, T.; Luo, W.; Stenger, B.; Lu, T. Ultra-high-definition low-light image enhancement: A benchmark and transformer-based method. AAAI Conf. Artif. Intell. 2023, 37, 2654–2662. [Google Scholar] [CrossRef]
Liu, S.; Zhang, H.; Li, X.; Yang, X. Retinexformer+: Retinex-Based Dual-Channel Transformer for Low-Light Image Enhancement. Comput. Mater. Contin. 2025, 82, 1969–1984. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis, Granada, Spain, 20 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Bai, J.; Yin, Y.; He, Q.; Li, Y.; Zhang, X. Retinexmamba: Retinex-based mamba for low-light image enhancement. In Proceedings of the International Conference on Neural Information Processing, Auckland, New Zealand, 2–6 December 2024; pp. 427–442. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Wang, H.; Liang, X.; Han, J.; Geng, W. Spike-RetinexFormer: Rethinking Low-light Image Enhancement with Spiking Neural Networks. In Proceedings of the Thirty-Ninth Annual Conference on Neural Information Processing Systems, San Diego, CA, USA, 2–7 December 2025. [Google Scholar]
Tavanaei, A.; Ghodrati, M.; Kheradpisheh, S.R.; Masquelier, T.; Maida, A. Deep learning in spiking neural networks. Neural Netw. 2019, 111, 47–63. [Google Scholar] [CrossRef]
Uddin, S.; Hussain, B.; Fareed, S.; Arif, A.; Ali, B. Real-world adaptation of retinexformer for low-light image enhancement using unpaired data. Int. J. Ethical AI Appl. 2025, 1, 1–6. [Google Scholar] [CrossRef]
Xu, S.; Chen, Q.; Hu, H.; Peng, L.; Tao, W. An unsupervised fine-tuning strategy for low-light image enhancement. J. Vis. Commun. Image Represent. 2025, 110, 104480. [Google Scholar] [CrossRef]
Feijoo, D.; Benito, J.C.; Garcia, A.; Conde, M.V. Darkir: Robust low-light image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 10879–10889. [Google Scholar]
Xu, R.; Niu, Y.; Li, Y.; Xu, H.; Liu, W.; Chen, Y. URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 21267–21276. [Google Scholar]
Wang, J.; Chan, K.C.; Loy, C.C. Exploring clip for assessing the look and feel of images. AAAI Conf. Artif. Intell. 2023, 37, 2555–2563. [Google Scholar] [CrossRef]
Baek, K.; Lee, M.; Shim, H. Psynet: Self-supervised approach to object localization using point symmetric transformation. AAAI Conf. Artif. Intell. 2020, 34, 10451–10459. [Google Scholar] [CrossRef]
Huang, T.; Li, S.; Jia, X.; Lu, H.; Liu, J. Neighbor2neighbor: Self-supervised denoising from single noisy images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event, 19–25 June 2021; pp. 14781–14790. [Google Scholar]
Mansour, Y.; Heckel, R. Zero-shot noise2noise: Efficient image denoising without any data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14018–14027. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
Yang, W.; Wang, W.; Huang, H.; Wang, S.; Liu, J. Sparse gradient regularized deep retinex network for robust low-light image enhancement. IEEE Trans. Image Process. 2021, 30, 2072–2086. [Google Scholar] [CrossRef]
Chen, C.; Chen, Q.; Do, M.N.; Koltun, V. Seeing motion in the dark. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3185–3194. [Google Scholar]
Chen, C.; Chen, Q.; Xu, J.; Koltun, V. Learning to see in the dark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3291–3300. [Google Scholar]
Wang, R.; Xu, X.; Fu, C.W.; Lu, J.; Yu, B.; Jia, J. Seeing dynamic scene in the dark: A high-quality video dataset with mechatronic alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 9700–9709. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Pambrun, J.F.; Noumeir, R. Limitations of the SSIM quality metric in the context of diagnostic imaging. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 2960–2963. [Google Scholar]
Nilsson, J.; Akenine-Möller, T. Understanding ssim. arXiv 2020, arXiv:2006.13846. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
Ding, K.; Ma, K.; Wang, S.; Simoncelli, E.P. Image quality assessment: Unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2567–2581. [Google Scholar] [CrossRef]
Yang, W.; Wang, S.; Fang, Y.; Wang, Y.; Liu, J. Band representation-based semi-supervised low-light image enhancement: Bridging the gap between signal fidelity and perceptual quality. IEEE Trans. Image Process. 2021, 30, 3461–3473. [Google Scholar] [CrossRef] [PubMed]
Kosugi, S.; Yamasaki, T. Unpaired image enhancement featuring reinforcement-learning-controlled image editing software. AAAI Conf. Artif. Intell. 2020, 34, 11296–11303. [Google Scholar] [CrossRef]
Moran, S.; Marza, P.; McDonagh, S.; Parisot, S.; Slabaugh, G. Deeplpf: Deep local parametric filters for image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 12826–12835. [Google Scholar]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 12299–12310. [Google Scholar]
Liu, R.; Ma, L.; Zhang, J.; Fan, X.; Luo, Z. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event, 19–25 June 2021; pp. 11197–11206. [Google Scholar]
Xu, K.; Yang, X.; Yin, B.; Lau, R.W. Learning to restore low-light images via decomposition-and-enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 2281–2290. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Learning enriched features for real image restoration and enhancement. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; pp. 492–511. [Google Scholar]
Xu, X.; Wang, R.; Fu, C.W.; Jia, J. Snr-aware low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 17714–17724. [Google Scholar]
Blau, Y.; Michaeli, T. The perception-distortion tradeoff. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6228–6237. [Google Scholar]
Cohen, R.; Kligvasser, I.; Rivlin, E.; Freedman, D. Looks too good to be true: An information-theoretic analysis of hallucinations in generative restoration models. Adv. Neural Inf. Process. Syst. 2024, 37, 22596–22623. [Google Scholar]

Figure 1. Based on the end-to-end LLIE framework, we propose (i) multi-scale training based on a pixel-reordering subsampler, and also a guidance of training procedure by exploiting CLIP’s rich features in both (ii) textual and (iii) visual spaces.

Figure 2. Validation set performance changes of the original Retinexformer on the LOLv2-Real dataset, in terms of evaluation metrics during training. The upper arrow indicates that a higher value means better performance, and vice versa.

Figure 3. Pixel-reordering subsampling (PRS) for efficient multi-scale training. Pixels of both the input and target images are rearranged in a 4-color-checkerboard-like manner to form 4 half-scale images. The numbers are arbitrarily assigned to denote the pixel at the corresponding location. Best viewed in color.

Figure 4. Qualitative results on real-world dataset LOLv2-real. The proposed method most appropriately recovers the global tone, and detailed texture.

Figure 5. Qualitative results on sRGB dataset LOLv2-synthetic. Proposed method accurately matches the color and sharpness of the target.

Figure 6. Qualitative results on RAW dataset SMID. The proposed method most accurately recovers the linear pattern in the dark area.

Table 1. Quantitative comparison results on the LOL, SID, SMID, and SDSD datasets and their variations. Best results for each column are notated in bold.

Method	Complexity		LOL-v1		LOL-v2-real		LOL-v2-syn		SID		SMID		SDSD-in		SDSD-out
Method	FLOPS	Params	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
SID [50]	13.73	7.76	14.35	0.436	13.24	0.442	15.04	0.610	16.97	0.591	24.78	0.718	23.29	0.703	24.90	0.693
3DLUT [8]	0.075	0.59	14.35	0.445	17.59	0.721	18.04	0.800	20.11	0.592	23.86	0.678	21.66	0.655	21.89	0.649
DeepUPE [24]	21.1	1.02	14.38	0.446	13.27	0.452	15.08	0.623	17.01	0.604	23.91	0.690	21.70	0.662	21.94	0.698
RF [59]	46.23	21.54	15.23	0.452	14.05	0.458	15.97	0.632	16.44	0.596	23.11	0.681	20.97	0.655	21.21	0.689
DeepLPF [60]	5.86	1.77	15.28	0.473	14.10	0.480	16.02	0.587	18.07	0.600	24.36	0.688	22.21	0.664	22.76	0.658
IPT [61]	6887	115.31	16.27	0.504	19.80	0.813	18.30	0.811	20.53	0.561	27.03	0.783	26.11	0.831	27.55	0.850
UFormer [30]	12	5.29	16.36	0.771	18.82	0.771	19.66	0.871	18.54	0.577	27.20	0.792	23.17	0.859	23.85	0.748
RetinexNet [14]	587.47	0.84	16.77	0.560	15.47	0.567	17.13	0.798	16.48	0.578	22.83	0.684	20.84	0.617	20.96	0.629
SparseRN [49]	53.26	2.33	17.20	0.640	20.06	0.816	22.05	0.905	18.68	0.606	25.48	0.766	23.25	0.863	25.28	0.804
EnGAN [25]	61.01	114.35	17.48	0.650	18.23	0.617	16.57	0.734	17.23	0.543	22.62	0.674	20.02	0.604	20.10	0.616
RUAS [62]	0.83	0.003	18.23	0.720	18.37	0.723	16.55	0.652	18.44	0.581	25.88	0.744	23.17	0.696	23.84	0.743
FIDE [63]	28.51	8.62	18.27	0.665	16.85	0.678	15.20	0.612	18.34	0.578	24.42	0.692	22.41	0.659	22.20	0.629
DRBN [58]	48.61	5.27	20.13	0.830	20.29	0.831	23.22	0.927	19.02	0.577	26.60	0.781	24.08	0.868	25.77	0.841
KinD [15]	34.99	8.02	20.86	0.790	14.74	0.641	13.29	0.578	18.02	0.583	22.18	0.634	21.95	0.672	21.97	0.654
Restormer [31]	144.25	26.13	22.43	0.823	19.94	0.827	21.41	0.830	22.27	0.649	26.97	0.758	25.67	0.827	24.79	0.802
MIRNet [64]	785	31.76	24.14	0.830	20.02	0.820	21.94	0.876	20.84	0.605	25.66	0.762	24.38	0.864	27.13	0.837
SNR-Net [65]	26.35	4.01	24.61	0.842	21.48	0.849	24.14	0.928	22.87	0.625	28.49	0.805	29.44	0.894	28.66	0.866
RetinexFormer [16]	15.57	1.61	22.78	0.883	20.06	0.862	25.61	0.955	24.38	0.677	29.12	0.813	27.38	0.886	28.54	0.864
Ours—Textual	15.57	1.61	22.71	0.876	20.98	0.851	24.70	0.919	24.12	0.668	28.60	0.802	27.84	0.876	28.45	0.856
Ours—Visual			23.47	0.883	20.23	0.864	25.64	0.954	24.22	0.670	31.34	0.832	28.08	0.886	28.21	0.862
Ours—Mixed			22.66	0.882	20.23	0.844	25.99	0.956	24.49	0.675	31.85	0.834	26.65	0.881	28.87	0.867

Table 2. Perceptual evaluation results (LPIPS [56] and DISTS [57]) comparing our proposed methods against the RetinexFormer baseline. Note that lower values indicate better performance for both LPIPS and DISTS. The best result in each column is bolded.

Method	LOL-v1		LOL-v2-real		LOL-v2-syn		SID		SMID		SDSD-in		SDSD-out
Method	LPIPS	DISTS	LPIPS	DISTS	LPIPS	DISTS	LPIPS	DISTS	LPIPS	DISTS	LPIPS	DISTS	LPIPS	DISTS
RetinexFormer	0.145	0.141	0.168	0.148	0.059	0.067	0.357	0.209	0.164	0.133	0.137	0.123	0.167	0.117
Ours—Textual	0.160	0.139	0.195	0.156	0.124	0.134	0.358	0.210	0.179	0.136	0.152	0.120	0.165	0.106
Ours—Visual	0.149	0.142	0.163	0.142	0.057	0.067	0.352	0.211	0.157	0.129	0.133	0.121	0.176	0.120
Ours—Mixed	0.140	0.137	0.164	0.142	0.055	0.064	0.355	0.210	0.154	0.131	0.129	0.119	0.164	0.118

Table 3. Ablation study of proposed method on LOLv2-syn dataset. Mixed variation is used for the proposed CLIP-LLA loss. The best result in each column is bolded.

Model	PSNR (dB)↑	SSIM↑	LPIPS↓	DISTS↓
Baseline	25.61	0.9545	0.0594	0.0673
w/o PRS	25.30	0.9313	0.0603	0.0671
w/o CL	25.10	0.9308	0.0584	0.0656
Ours—Mixed	25.99	0.9556	0.0546	0.0639

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, S. Noise-Resilient Low-Light Image Enhancement with CLIP Guidance and Pixel-Reordering Subsampling. Electronics 2025, 14, 4839. https://doi.org/10.3390/electronics14244839

AMA Style

Song S. Noise-Resilient Low-Light Image Enhancement with CLIP Guidance and Pixel-Reordering Subsampling. Electronics. 2025; 14(24):4839. https://doi.org/10.3390/electronics14244839

Chicago/Turabian Style

Song, Seongjong. 2025. "Noise-Resilient Low-Light Image Enhancement with CLIP Guidance and Pixel-Reordering Subsampling" Electronics 14, no. 24: 4839. https://doi.org/10.3390/electronics14244839

APA Style

Song, S. (2025). Noise-Resilient Low-Light Image Enhancement with CLIP Guidance and Pixel-Reordering Subsampling. Electronics, 14(24), 4839. https://doi.org/10.3390/electronics14244839

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Noise-Resilient Low-Light Image Enhancement with CLIP Guidance and Pixel-Reordering Subsampling

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Motivation and Preliminaries

3.2. CLIP-LLA Loss

3.3. Pixel-Reordering Subsampling (PRS): Multi-Scale Training for Noise Robustness

3.4. Summary of Proposed Algorithm

4. Experiment

4.1. Dataset and Implementation Details

4.1.1. sRGB Datasets

4.1.2. RAW Datasets

4.1.3. Evaluation Metrics

4.1.4. Training and Evaluation Protocol

4.2. Low-Light Image Enhancement

4.3. Ablation Study

5. Discussion and Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Detail on Network Framework and Algorithm

Appendix B. Additional Qualitative Results

Appendix C. Influence of Loss Weight λ

Appendix D. Detailed Computational Efficiency

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI