4.1. Datasets and Evaluation Metrics
We evaluate using nine commonly used LLIE benchmark datasets, including LOLv1 [
30], LOLv2 [
31], DICM [
32], LIME [
33], MEF [
34], NPE [
35], VV [
36], SICE [
37], and Sony-Total-Dark. For paired datasets, we evaluate using the Peak Signal-to-Noise Ratio (PSNR), the Structural Similarity (SSIM) [
38], and the Learned Perceptual Image Patch Similarity (LPIPS) [
39] evaluation metrics, while for unpaired datasets, we evaluate using the Natural Image Quality Evaluator (NIQE) [
40] evaluation metrics.
LOL. The LOL dataset includes two main variants: LOLv1 [
30] and LOLv2 [
31]. In LOLv1, there are 485 paired training images along with 15 test pairs. LOLv2 is further divided into LOLv2-real and LOLv2-synthetic: the former comprises 689 training pairs and 100 test pairs of real-captured images, while the latter contains 900 synthetic training pairs and 100 synthetic test pairs. During training, we process LOLv1 and LOLv2-real by extracting 400 × 400 patches from each image, followed by model optimization over 1500 epochs using a batch size of 8. For LOLv2-synthetic, we adopt a slightly different approach, cropping images into 384 × 384 patches and training for 500 epochs with a batch size of 1.
Unpaired Datasets. For unpaired datasets, such as DICM [
32], LIME [
33], MEF [
34], NPE [
35], VV [
36], we test the model trained on the LOLv2-synthetic dataset to evaluate its generalization ability.
SICE. The SICE dataset [
37] includes 4803 training images captured in varying lighting conditions, including low-light and overexposed scenarios. For testing, we employ two subsets SICE-Mix and SICE-Grad [
41] totaling 589 evaluation images. During training, images are divided into 160 × 160 patches, and the model is optimized over 1000 epochs using a batch size of 10.
Sony-Total-Dark. This dataset was modified and constructed based on a subset of the SID [
42], where the training set contains 1866 images captured by sensors under extremely low-light conditions, and the test set consists of 598 images captured at different exposure times, covering various lighting conditions from short-exposure low-light to long-exposure normal-light. To increase the complexity of the task, we processed the raw images by converting them to sRGB format without applying gamma correction, producing significantly darker outputs. We cropped the training images into 256 × 256 patches and trained the model for 1000 epochs with a batch size of 4.
4.3. Comparisons with States of the Art
Results on LOL Datasets. As shown in
Table 1, our method achieves optimal results in PSNR and LPIPS metrics. This is because the MSGD module effectively suppresses noise in low-light images, significantly reducing pixel-level errors and achieving optimal PSNR performance. At the same time, the LFIE module focuses on low-frequency information processing, avoiding local overexposure issues during image enhancement, and achieving lighting adjustments that are more in line with human perception. Therefore, the LPIPS performance is optimal. As shown in
Figure 4, in the first line, our method targets low-light images with uneven lighting distribution, effectively avoiding the problem of local overexposure during the enhancement process and solving the problem of detail loss during image enhancement. In the second line, compared to other methods, our approach not only effectively suppresses noise but also better preserves key details in the image. In the third line, facing the interference of high-brightness areas in low-light images, our method avoids the overall brightness decrease problem that occurs with other methods. In the fourth line, our method successfully solves the problem of overexposure or insufficient enhancement that other methods have in local areas. In the fifth line, compared to the color distortion caused by other methods, our method maintains accurate color reproduction when enhancing the detailed areas of the image.
Results on SICE and Sony-Total-Dark. To validate the performance of our model under extreme low-light conditions and mixed low-light and exposure conditions, we tested the model on the SICE and Sony-Total-Dark datasets. As shown in
Table 2, the test results on the SICE dataset indicate that our model can better adapt to different lighting distributions, owing to the LFIE module’s focused processing of low-frequency information in images. In addition, each image in the SICE-Mix and SICE-Grad datasets is mixed with three conditions: low-light, normal-light, and exposure. The test results of the model on these two datasets further demonstrate its excellent generalization ability and adaptability to extreme conditions. The test results on the Sony Total Dark dataset show that our model not only performs best under extreme low-light conditions but also effectively suppresses the noise generated by the sensor in extreme low-light environments.
Results on Unpaired Datasets. We use the NIQE evaluation metric [
40] to assess the visual quality of unpaired images after enhancement. As shown in
Table 3, our method achieves optimal performance on the DICM, LIME, and MEF datasets, and suboptimal performance on the NPE datasets. This indicates that our enhancement results have advantages in naturalness and visual realism, avoiding the problem of local exposure in images caused by excessive processing. As shown in
Figure 5, our method exhibits significant advantages in low-light image enhancement, especially when dealing with images with uneven lighting distribution. In the first four lines of the example, our method avoids the problem of local overexposure during low-light image enhancement. For the fifth line, compared with the LLFlow method, our method enhances the image while fully preserving the cloud details in the background, while the comparative method shows a significant loss of details.
4.4. Ablation Studies
Comparison between Various Modules. We conducted four quantitative ablation experiments on the LOLv1 dataset to validate the feasibility of each module. As shown in
Table 4, firstly, Experiment 2 and Experiment 3 are higher than Experiment 1 in all evaluation indicators, which directly reflects the feasibility of each module. Secondly, Experiment 2 achieves significant improvements in PSNR and SSIM metrics compared to Experiment 1, mainly due to LFIE’s focus on processing the low-frequency components of intensity maps. By optimizing and adjusting low-frequency information, the model can effectively improve the overall brightness distribution of the image, allowing darker areas to receive more enhancement while brighter areas remain relatively stable. This processing method makes the enhanced image closer to the real image in terms of brightness distribution, thereby directly reducing the pixel difference between the two, which is the main reason for the improvement of PSNR. At the same time, due to the improvement of brightness distribution, the similarity of the image structure is also enhanced, so the SSIM index is correspondingly improved. Finally, Experiment 3 achieves a more significant improvement in PSNR compared to Experiment 2. This is because MSGD directly acts on the HV color map, which can more accurately estimate the noise in the image and, thus, perform targeted denoising. As shown in
Figure 6, we compare an image on the LOLv1 test set. Experiment 2 achieves better restoration of details outside the left window compared to Experiments 1 and 3 under the lighting optimization effect of the LFIE module. At the same time, Experiment 3 shows superior noise suppression performance in the middle area of the image, with its noise reduction performance significantly surpassing that of Experiments 1 and 2. In addition, as shown in
Table 5, we tested the computational complexity of each module to provide more detailed information about each module.
Feasibility Study Focusing on Low-Frequency Processing. In the intensity map, the information of light distribution mainly exists in the low-frequency components. Based on this characteristic, we adopted a frequency-domain separation method to decompose the image into high- and low-frequency components and process the low-frequency information in a targeted manner to optimize the global illumination distribution and solve the problem of local overexposure during the enhancement process. To verify the above theory, we conducted four quantitative experiments. The first experiment extracted high-frequency information for independent processing, the second experiment fully segregated both high and low frequencies and processed them separately, the third experiment processed high and low frequencies jointly without any separation, and the fourth experiment isolated low-frequency information for dedicated processing. As shown in
Table 6, the low-frequency emphasis method performs the best in all evaluation indicators. This method achieves dual advantages by separating low-frequency information and processing it specifically to adjust illumination uniformity, while using a reduction coefficient to control the contribution of original features. It effectively addresses the issue of local overexposure while avoiding the loss of high-frequency information and the amplification of noise. In contrast, the high-frequency emphasis method has obvious shortcomings. Because the high-frequency region in the illumination intensity map contains fewer effective details and noise [
17], enhancing the high-frequency component will significantly amplify the noise and cannot improve the problem of uneven low-frequency illumination. Although the high- and low-frequency complete separation method performs well in some indicators, it significantly deteriorates the LPIPS index due to the destruction of the natural correlation between frequency bands, reflecting a decrease in perceptual quality. Although the method of not separating high and low frequencies is superior to the high-frequency focused method, due to the lack of band-specific processing, it cannot differentiate and optimize low-frequency lighting and high-frequency details, and it still enhances high-frequency noise. The overall comparison shows that the low-frequency emphasis method exhibits the most comprehensive and optimal performance due to its targeted frequency band processing strategy.
Residual Contribution Ratio. As shown in
Figure 2, in the LFIE module, we set a reduction coefficient to control the contribution of the residual part. By changing this value, the model can achieve different effects. As shown in
Figure 7, as the residual contribution ratio increases, the values of all evaluation indicators first become better and then worse. This is because the residual part contains a lot of high-frequency information, which includes texture features, edge details, and noise in the intensity map [
17]. If the residual is appropriately introduced, it can effectively compensate for the loss of image details after low-frequency separation. However, if the residual is excessively introduced, on the one hand, it will amplify the noise of the image, and on the other hand, it will suppress the processed low-frequency information with high-frequency information. At the same time, if the residual part is completely discarded, it is equivalent to losing some of the detailed information of the image, and the evaluation index will also significantly decrease. After considering all evaluation indicators comprehensively, we chose 0.1 as the reduction coefficient.
Comparison of Different Denoising Methods. We use a gate-controlled weighting mechanism to process the denoised features and then perform residual connections with the original features, aiming to effectively preserve the detailed information of the image during the denoising process. To verify the effectiveness of this method, we designed three sets of quantitative experiments. The first group weights the estimated noise through a gating weighting mechanism and directly denoises it, which is called direct denoising. The second group balances the original features and denoised features through a gating weighting mechanism, which is called weak preservation denoising. The third group uses a gate-weighted mechanism to process the denoised features, which is called strong preservation denoising. The following shows the specific formulas for three denoising methods, and the group numbers after the formulas correspond one-to-one with the group numbers in
Table 7:
As shown in
Table 7, the third group of experiments achieved the best performance in terms of the four evaluation indicators. For direct denoising, subtracting the weighted noise from the original features can effectively suppress noise, but it will excessively smooth the image, resulting in texture and edge loss. Compared to direct denoising, weak preservation denoising may seem to compromise denoising and detail preservation. However, if the gating value is small, the original features are amplified while the noise is not sufficiently suppressed. If the gating value is large, this method is equivalent to direct denoising, which will result in the loss of details. Finally, for strong preservation denoising, we model the original features as
. When the gate value approaches 1, it indicates that the estimated noise is approaching the true noise. Therefore, the denoising formula becomes
Although residual connections are used to introduce noise from the original features, the clean image portion is doubled in size. Through the layer-by-layer denoising of the UNet network’s encoder and decoder, the clean portion can be gradually enlarged while gradually reducing noise, achieving the goal of denoising while preserving image details. As shown in
Figure 8, under extremely low-light conditions, both direct denoising and weak preservation denoising methods will smooth out the detailed features of the image to some extent, leading to color distortion problems. In contrast, strong preservation denoising methods not only effectively denoise, but also better preserve image details and features. Therefore, we choose this method to denoise the HV color map branches.
Comparison of different dilation rates. In the noise estimation module, we adopt a multi-scale approach with a three-level hierarchical perception structure at each scale to estimate noise in low-light images. We test various dilation rate combinations to evaluate their impact on model performance. As shown in
Table 8, the progressive dilation rate combination demonstrates the best performance, which benefits from its ability to capture local details hierarchically, mid-range correlations, and globally distributed noise patterns, thereby achieving comprehensive noise modeling. In contrast, single global or local dilation rates each have significant drawbacks: global captures large-scale noise distributions but misses local details, while local preserves high-frequency information but fails to capture global structures. This confirms the necessity of multi-scale feature fusion in noise estimation. Additionally, although the skip dilation rate combination attempts to expand the receptive field range, the lack of continuity between dilation rates disrupts the correlation between local and global noise patterns, adversely affecting feature fusion. Without dilated convolutions, this case preserves high-frequency noise better but only captures pixel-level features, lacking contextual image information for accurate noise estimation. As shown in
Figure 9, the heatmap analysis provides more intuitive insights into the performance differences among various dilation rate combinations. For the global dilation rate combination, the heatmap shows that the model applies uniform attention across the image, failing to distinguish intensity variations in noisy regions, which hinders accurate identification of critical noise areas. The large gaps between dilation rates in the skip dilation combination weaken the correlation between local and global noise patterns, causing the model’s focus to deviate from actual noise distributions. As for the case without dilation rates, the heatmap shows that its attention scope is overly limited, capturing only pixel-level local noise features. Finally, while both local and progressive dilation rate combinations effectively focus on noise-dense regions, the latter exhibits clearer multi-scale attention in the heatmap: small dilation rates capture fine details, medium ones cover regional patterns, and large ones establish global correlations. This hierarchical attention pattern aligns well with the actual distribution characteristics of noise, leading us to ultimately select the progressive dilation rate strategy as the optimal solution.
Comparison of different color spaces. In order to justify the choice of HVI color space, we performed multiple color space transformations on the RGB images and decoupled them into illumination components and color components, which were processed using the same network architecture. As shown in
Table 9, although the LAB color space is designed to take into account the perceptual characteristics of the human eye on color change, so that its color difference calculation is more in line with human visual perception, the a-axis and b-axis obtained from its decoupling have the problem of difficult color adjustment during model processing, especially in the low-illumination region that is prone to the color bias phenomenon. While the YUV color space was originally designed for video transmission, the chromaticity downsampling technique it adopts effectively reduces the amount of data but inevitably causes the loss of color information. As shown in
Figure 10, in the HSV color space, the calculation of the Saturation depends on the Value. When the luminance decreases, the maximum effective saturation of the color will show a nonlinear decay characteristic, which leads to color distortion. Through the comprehensive analysis above, compared to other color spaces, HVI shows superior performance.