1. Introduction
Various algorithms have been proposed to enhance low-light images, particularly under nighttime conditions. These include methods based on single-scale Retinex, dehazing models, enhancement techniques for non-uniform illumination, and deep learning-based approaches. Liang et al. introduced a Retinex-based decomposition method using nonlinear diffusion filtering to accurately estimate the illumination component [
1]. This method incorporates surround suppression to effectively reduce unnecessary textures while preserving edge information, thereby minimizing halo artifacts often observed in traditional Retinex algorithms and enhancing local contrast. Lin and Lu proposed a novel Retinex-based enhancement model by combining Shrinkage Mapping with a Plug-and-Play ADMM framework [
2]. This model applies non-convex L
p regularization to the illumination component to suppress texture while preserving major edges. It also refines the illumination map using iterative self-guided filtering, which helps reduce edge blurring. This integrated strategy provides a balance between visual quality and computational efficiency.
In the context of dehazing-based methods, Yu and Zhu proposed a physics-inspired illumination model to explain the degradation process of low-light images through variables such as local ambient lighting and light scattering attenuation [
3]. Initial illumination is estimated based on Retinex theory and a Gaussian surround function, followed by iterative refinement of illumination and attenuation under constraints that minimize information loss. A weighted guided filter is used to suppress halo and block artifacts, while adjustments in brightness are performed in the HSV color space to reduce color distortion.
For handling non-uniform illumination, Pu and Zhu proposed a contrast/residual decomposition approach inspired by retinal visual mechanisms [
4]. Without relying on physical models or deep learning, this method achieves natural brightness enhancement and detail preservation by using multi-scale adjustment and a weighted guided filter to maintain local contrast and suppress halo artifacts. Wang and Luo proposed NPIE-MLLS [
5], which improves brightness imbalance and detail loss in low-light images by aligning the low-frequency components of multi-layer frequency decompositions with luminance statistics derived from high-quality natural images.
Among deep learning approaches, the recently proposed Zero-DCE offers a reference-free, curve estimation-based method for low-light enhancement [
6]. It predicts pixel-wise curve parameters from the input image and adjusts brightness iteratively. The model is trained using a set of non-reference loss functions—including spatial consistency, exposure control, color constancy, and illumination smoothness—and achieves real-time performance with a lightweight architecture. It effectively restores natural brightness and contrast and demonstrates competitive results compared to supervised methods.
While these methods significantly improve the visual quality of low-light images and enhance object visibility in dark regions, they face limitations when dealing with extreme overexposure caused by strong light sources, such as vehicle headlights or tunnel exits at night. As shown in
Figure 1, even after applying Zero-DCE, objects in overexposed regions remain difficult to recognize, and fine details are not sufficiently restored.
Moreover, the field of image-to-image translation has advanced significantly. Using deep learning, generative models are primarily applied for image translation; these models map input images to corresponding output images [
7]. This approach allows nighttime images to be transformed into daytime scenes, yielding clear visual outcomes even under low-light conditions.
One of the deep learning-based image translation methods, namely Pix2Pix, is a type of conditional generative adversarial network (CGAN), trained to map a conditional vector (or input image) to the corresponding image within the training dataset [
8]. The main objective of Pix2Pix is to minimize the pixel-wise difference between the generated and real images using CGAN and L1 losses. However, Pix2Pix has the disadvantage of requiring paired datasets, which can limit performance under restricted data availability. CycleGAN was introduced to overcome this limitation. This method enables image-to-image translation using unpaired datasets, eliminating the requirement for paired data [
9].
Various image translation algorithms have been developed based on CycleGAN, one of which is ToDayGAN, a method for converting nighttime images to daytime images [
10]. ToDayGAN improves performance by modifying the ComboGAN network as well as applying three different discriminators based on the CycleGAN concept. This modified network enables the transformation from day to night, as well as from night to day. To achieve this, fake B images transformed from A to B are used as input to the generator, which subsequently performs reverse transformation, converting the B images back to A. This structure demonstrates excellent performance in converting nighttime images to daytime images. Unlike traditional generative adversarial networks, which compare fake images with real images, ToDayGAN introduces a discriminator network with the same structure to compare fake and real images using the following three methods: Blur-RGB, Grayscale, and xy-Gradient, which refine the discriminator process. Similar to CycleGAN, ToDayGAN can effectively preserve the sky area in nighttime images while generating images that resemble daytime scenes. However, the fine details of objects present in nighttime images may not be clearly transformed.
The base–detail paired training method proposed by Son et al. [
11] is another CycleGAN-based technique for converting nighttime images to daytime images. This method uses limited daytime and nighttime image data by generating a base nighttime image via a bilateral filter and applying the single luminance adaptation transform (SLAT) technique to the daytime image to set the base–detail learning direction. The images obtained from the trained modules use a newly defined sigmoid function to generate a noise reduction weight map based on the detailed images. This post-processing step removes noise from unnecessary areas and enhances the details of the image, converting nighttime images to daytime images. However, this conversion method presents limitations. Perfectly converting completely dark nighttime images to daytime images is difficult, and depending on the distribution of the training dataset, incorrect objects, such as trees or traffic lights, may be generated owing to the arbitrary generation of unseen information.
To effectively represent object information in nighttime or low-light images, various CycleGAN-based techniques for deep learning image translation have been applied. However, in test images that contain dark regions, object information is difficult to discern, possibly generating incorrect objects. Additionally, as shown in
Figure 1, images that simultaneously contain low-light and overexposed regions are challenging to process for image translation and enhancement using a single sensor. To address this issue, the information in the input test images must be accurately complemented to prevent the arbitrary generation of unnecessary objects during image translation.
This study aims to generate improved images for nighttime or low-light image conversion without errors by acquiring visible-light and infrared (IR) images in multiple bands and synthesizing these images, thereby accurately conveying the information contained in the input test images. To achieve this, IR images were acquired and synthesized.
To capture IR images at night, an IR lamp is used as an artificial near-infrared (NIR) light source, as shown in
Figure 2, allowing NIR images to be acquired even in the absence of natural light [
12]. However, such existing imaging devices increase structural complexity and have limitations when capturing distant objects in NIR images. Therefore, a new device is required that reduces the complexity of the equipment while simultaneously capturing visible-light and IR images at the same location, even when objects are moving.
Conventional methods for synthesizing IR and visible-light images include deep learning- and algorithm-based methods. One deep learning-based method for IR and visible light image synthesis is DenseFuse (The version of PyTorch used for DenseFuse in this study is 1.11.0) [
13]. This deep learning network consists of convolutional layers, fusion layers, and dense blocks. The dense blocks connect the output of each layer to all the other layers, allowing the information from the input layer to be passed to the next layer. The deep learning-based fusion process extracts additional useful features from the original image during the encoding phase. The input test images, including the visible-light and IR images, are fused using two strategies, namely, the additive and
L1-norm strategies, which are applied through the trained module. The advantage of the dense blocks is that they preserve existing information for as long as possible and reduce overfitting via regularization effects. However, the resulting images are not as sharp as those generated by conventional algorithm-based fusion methods.
Another deep learning approach for visible and IR image fusion is SWIN fusion [
14]. To address the context information lost by the content-independent nature of convolutional operations, this study proposes a Residual Swin Transformer Fusion Network. The method consists of three stages—global feature extraction, fusion layer, and feature reconstruction—and employs a pure transformer-based backbone together with an attention-based encoding architecture to effectively model long-range dependencies that conventional CNNs overlook. An L1-norm-based fusion strategy is also introduced, measuring activity levels across both row and column dimensions to balance the preservation of infrared brightness and visible details. However, because extremely bright visible regions (e.g., vehicle headlights) are retained almost unchanged, the enhancement of LWIR details in scenarios such as tunnel entry/exit or under headlight illumination remains limited.
Next, the low-rank representation fusion algorithm, known as a traditional synthesis algorithm, decomposes the image into low-rank (global structure) and saliency (local structure) components [
15]. The low-rank fusion separates the low-rank (global structure) and saliency components (local structure), and the low-rank components are fused using a weighted average strategy to preserve edge information. The saliency components are subsequently fused using a sum strategy, and the final fused image is obtained by summing the low-rank and saliency components. However, this algorithm has limitations when capturing the fine details of IR images, reducing its effectiveness at representing edge information.
To extract the fine details of IR images and apply them to the fused image, the Laplacian–Gaussian pyramid [
16] is used to separate the base and detail components, followed by multi-resolution decomposition for image synthesis. The Laplacian–Gaussian pyramid and local entropy fusion algorithm are traditional methods used for visible-NIR image fusion, maximizing the information obtained from visible-light and IR images using local entropy [
17]. This algorithm decomposes the image into multiple resolutions using the Laplacian–Gaussian pyramid and subsequently smoothly fuses the images at each resolution level based on local entropy. Additionally, the multi-resolution fusion process generates a weight map to adjust local contrast and visibility to yield the final fused result. However, this method fails to sufficiently preserve the fine details of IR (NIR) images and has the limitation of producing unnatural color representations.
Another approach involves using the Laplacian–Gaussian pyramid to separate the base and detail layers and subsequently combining the visible and IR images using principal component analysis (PCA) to enhance image sharpness [
18]. Unlike traditional PCA computation, this method generates a weight map for specific detail areas and avoids PCA computation in unnecessary regions with many zero-pixel values. Afterward, the PCA-based fusion method generates a radiance map, and the Stevens effect is applied to the color representation model to enhance local contrast and details. Colors are corrected by calculating the brightness difference between the input visible-light image and the generated radiance map. This method reduces edge degradation in IR images to a greater extent than conventional methods, improving local contrast and detail and preserving the colors on visible-light images.
To effectively extract the fine details of IR images, a method uses the contourlet transform [
19] instead of the Laplacian–Gaussian pyramid to decompose the base and detail images and effectively synthesize the detail information. This method extracts directional detail information from both visible-light and IR images using the contourlet transform and calculates the optimal weights for combining the details of both images using the PCA fusion algorithm. Subsequently, the iCAM06 tone mapping method is applied to the base visible-light image to enhance the overall brightness and contrast of the fused image [
20]. This synthesis method yields sharper details than the traditional Laplacian–Gaussian pyramid fusion algorithm but has limitations in controlling halo artifacts and the bright light from vehicle headlights, which can complicate vehicle identification.
This study conducts multi-band image fusion using visible-light and long-wave infrared (LWIR) images. First, directional detail and base information are precisely extracted from IR and visible-light images through the contourlet transform. Instead of traditional PCA-based fusion, discrete cosine transform (DCT) [
21,
22] is used to effectively combine the base and detailed information of each band. Subsequently, iCAM06 tone mapping [
23] is applied to adjust the overall tone of the fused image naturally. The synthesized image is subsequently used as an input to the CycleGAN night-to-day translation module to enhance detail training. This approach minimizes the problem related to generating incorrect objects during image translation and sets the training direction to effectively represent the details of IR images using a multi-scale single-scale Retinex (SSR) module [
24,
25], unlike existing methods.
Conventional image fusion techniques, such as the algorithm-based Latent Low-Rank method and deep learning-based networks like DenseFuse and Swin fusion, have attempted to combine IR and visible images. However, these methods face limitations in effectively handling extreme nighttime lighting conditions, such as overexposure caused by vehicle headlights or underexposed backgrounds. In particular, DenseFuse and Swin fusion focus primarily on the simple integration of low-level features, while Latent Low-Rank and base–detail CycleGAN-based approaches rely on decomposing image details for synthesis and translation, yet fail to explicitly preserve directional or spectral detail information.
In contrast, the proposed method introduces a hybrid approach that integrates algorithmic and deep learning-based techniques to overcome these limitations. Specifically, it combines multi-band image fusion using the contourlet transform for directional decomposition, DCT-based fusion to preserve local details, and iCAM06 tone mapping for natural color representation. This is integrated with a CycleGAN-based night-to-day translation module to enhance visibility in low-light nighttime environments. Notably, the infrared image compensates for detail loss in overexposed regions that visible cameras cannot recover, while helping preserve the overall structure and natural appearance of the scene. This integrated strategy enhances object visibility and edge clarity in both bright and dark areas, demonstrating superior performance across multiple image quality evaluation metrics. Consequently, it is well-suited for real-world applications such as surveillance systems and autonomous driving.
3. Simulation Results
3.1. Visible and IR Image Alignment
This study confirmed that the positions of objects in the LWIR images captured using the equipment do not align. When the positions of objects differ, obtaining accurate image fusion results becomes challenging, necessitating the implementation of an image alignment process. To address this, the ASIFT algorithm was used to align the images, using the computed homography matrix in the process [
27]. LWIR images, which are thermal images, have significantly different characteristics from those of NIR and visible-light images; therefore, finding sufficient feature points for smooth alignment using the standard SIFT [
28] algorithm is difficult. Hence, the ASIFT algorithm was used to extract feature points, enabling highly accurate alignment with the visible-light images.
The ASIFT algorithm is an extended version of SIFT that simulates all possible image views to extract several keypoints. ASIFT is designed to maintain invariance to affine transformations and handles two camera angle parameters, namely latitude and longitude, which are disregarded in the original SIFT algorithm. By simulating changes in the camera viewpoint from the original image, ASIFT generates multiple views and subsequently applies the SIFT algorithm to extract keypoints. ASIFT effectively handles all six affine transformation parameters by applying the SIFT method to the generated images for comparison. Therefore, ASIFT provides mathematically complete affine invariance and can extract more keypoints than standard SIFT.
Figure 17 displays the resulting images when matching using the SIFT and ASIFT algorithms.
The image alignment method used for visible-light and IR images in this study is summarized as follows. ASIFT is an extended version of SIFT that provides invariance to affine transformations and demonstrates strong image-matching performance across various camera viewpoints. In particular, for aligning LWIR with visible-light images, the use of the ASIFT algorithm, instead of the standard SIFT algorithm, is essential. Although ASIFT incurs a higher computational cost than SIFT, the improvement in image-matching performance justifies this cost; therefore, ASIFT is particularly useful for aligning visible-light and IR images.
3.2. Comparison of Image Quality Assessment with Conventional Methods
Objective evaluation metrics are essential for assessing the proposed image processing methods. Herein, the enhanced images were evaluated in comparison with those obtained from existing visible-light and IR image processing methods using image quality assessment methods. The image evaluation was based on image quality and sharpness metrics. The main objective of this study is to represent objects clearly within images, making sharpness metrics crucial. Additionally, brightness and contrast were considered important evaluation factors to enhance object visibility. Therefore, relevant image quality metrics were included in the evaluation. A total of six image evaluation metrics were used, four of which were image quality metrics, two metrics were related to image distortion, two assessed image contrast, and two were sharpness metrics.
BRISQUE is a metric based on Natural Scene Statistics that evaluates image distortion by measuring the differences in pixel intensity distributions to assess image quality [
29]. This metric is particularly sensitive to various image distortions and calculates the quality score using an SVM model. PIQE focuses on analyzing the pixel contrast and relative brightness of an image to evaluate the degree of distortion [
30]. This metric is advantageous for detecting sharpness and noise and assessing the structural consistency of the image.
CEIQ is a metric that evaluates contrast distortion in an image [
31]. CEIQ improves contrast using histogram equalization and subsequently calculates SSIM, entropy, and cross-entropy to assess quality. MCMA is a method that optimizes image contrast while minimizing artifacts, evaluating dynamic range usage, histogram shape distortion, and both global and local pixel diversity [
32]. This method naturally enhances contrast without excessive contrast adjustment.
LPC-SI is a no-reference metric that evaluates image sharpness by analyzing high-frequency components and measuring local phase coherence [
33,
34]. Sharp images exhibit strong phase coherence in the high-frequency region, which can be used to assess sharpness. S3 is a metric that evaluates sharpness by simultaneously considering both spectral and spatial changes in the image [
35]. It calculates these changes in each image block and combines them to derive the final sharpness score.
These evaluation metrics objectively assess the effectiveness of the proposed image processing method and improve image quality. Each metric uniquely evaluates image distortion and sharpness, enabling accurate quality analysis. A summary of the evaluation metrics is listed in
Table 2.
The visible-light and IR images obtained using the proposed method were compared with those obtained using the existing deep learning- and algorithm-based methods to evaluate the fusion effectiveness of the proposed method. The first comparison included the DenseFuse method, which uses dense blocks to ensure continuous information flow between layers, preserving the important features of the visible-light and IR input images.
Algorithm-based methods were included in the comparison. The Laplacian–Gaussian pyramid entropy fusion method applies the Laplacian pyramid to capture multi-scale details and subsequently combines the key features from both modalities via entropy-based fusion. This method effectively balances the details between IR and visible-light images. Similarly, the Laplacian pyramid PCA fusion method uses the Laplacian pyramid combined with PCA to reduce dimensionality, improving sharpness and contrast in the fused result while maintaining essential image features.
Another algorithm-based method, low-rank fusion, separates the image into low-rank (global structure) and saliency (local structure) components. The low-rank component is fused using a weighted average, whereas the saliency component is combined using a sum strategy, enhancing global consistency and local details.
Moreover, methods using the contourlet transform were evaluated. The contourlet-PCA method uses the contourlet transform to extract multi-directional and multi-scale features and subsequently fuses features through PCA, providing an effective means to preserve edge and texture information. The method proposed in this study, contourlet-DTC, uses DCT fusion instead of PCA. This method particularly emphasizes high-frequency components, focusing on contrast preservation and detail enhancement.
Another proposed method integrates contourlet transform and DCT fusion and enhances the fusion results by adding a CycleGAN-based day-to-night transformation module, which is highly effective in improving image visibility in low-light environments. Additionally, an SSR module is used to improve illumination handling and detail preservation. The combination of these techniques, particularly the integration of contourlet-DCT fusion and CycleGAN-SSR enhancement, provides a superior fusion method that considerably improves sharpness, contrast, and overall image clarity with respect to those obtained from existing methods. This method is particularly suitable for applications such as surveillance and navigation in low-visibility environments.
3.3. Specification and Processing Time
To perform image translation using the proposed fused images, a preprocessing step is required. After all steps are processed in parallel, analyzing the images generated at a resolution lower than SD quality (560 × 448) revealed that the most time-consuming preprocessing method, SSR processing, required approximately 0.993 s/image (with a total time of 2.98 s required for the three SSR modules). The most time-consuming task in the hierarchical SSR module, the base–detail module, required 0.254 s. Additionally, post-processing for image reconstruction required approximately 0.223 s. The total time for image translation was approximately 1.47 s. Next, the algorithm-based DCT and PCA syntheses required an average of 4.455 and 25.852 s/image, respectively, indicating that PCA synthesis required approximately five times longer than DCT synthesis.
Summing the total time required for image synthesis and image translation yielded a total processing time for image translation using DCT synthesis to be 5.925 s, whereas PCA synthesis required 27.322 s. Therefore, a significant performance difference was observed between the synthesis algorithms.
The CycleGAN-based image-to-image translation method was implemented on a personal computer with the following specifications: Intel i9-10980XE 3.00 GHz CPU, 256 GB RAM, and NVIDIA GeForce RTX 3090 GPU. The proposed deep learning network was developed using the CycleGAN framework based on PyTorch 1.11.0. Optimization was conducted using the Adam optimizer, with β parameters set to 0.5 and 0.999. The batch size was set to 1, and the learning rate was initialized at 0.0002, decreasing linearly every 100 epochs. The total number of training epochs was set to 250, and the image crop size was fixed at 256 × 256 pixels.
The unpaired dataset used to train the CycleGAN day-to-night translation module consisted of 6450 and 7744 daytime and nighttime images, respectively. Of the daytime images, 1200 were collected from the proprietary dataset, and the remaining 5250 were obtained from the Dark Zurich [
36] and Adverse Conditions Dataset with Correspondences (ACDC) datasets [
37]. Of the nighttime images, 5400 were collected from the proprietary dataset, and the remaining 2344 were obtained from the Dark Zurich and ACDC datasets.
The dataset used to train the base–detail model and the three different sigma scale SSR models consisted of 4852 real daytime and synthesized nighttime images each. Of the real daytime images, 1005 were collected from the proprietary dataset, and the remaining 3847 were obtained from the Dark Zurich and ACDC datasets. Tests were conducted using real images captured via the proposed camera setup, and approximately 217 images were evaluated and compared. A summary of the times required and the datasets is listed in
Table 3.
3.4. Comparison of the Image Results Obtained from IR Fusion
To evaluate the individual contributions of each module in the proposed method, an ablation study was conducted focusing on four key components: (1) detail extraction method (Laplacian vs. contourlet), (2) fusion strategy (PCA vs. DCT), (3) the application of iCAM06 tone mapping, and (4) the number of sigma scales used in the SSR module during the image translation stage. The results are summarized in
Table 4.
First, when comparing Laplacian-PCA and contourlet-PCA, the contourlet-based method outperformed in overall quality metrics such as BRISQUE (36.545 → 34.741), PIQE (55.241 → 46.317). However, detail-specific indicators such as LPC-SI (0.876 → 0.861) and S3 (0.084 → 0.067) showed a slight decline. Although contourlet decomposition inherently captures multi-directional and multi-resolution information, the Laplacian-based approach may emphasize certain high-frequency components more strongly due to its explicit multi-scale filtering mechanism. Nonetheless, since contourlet effectively extracts directional structural features, its integration with iCAM06 tone mapping, particularly leveraging the Stevens effect, can further enhance detail representation in downstream stages.
Second, in the comparison between PCA-based fusion and DCT-based fusion strategies, the DCT approach yielded superior performance in detail preservation. Specifically, the LPC-SI score improved from 0.906 to 0.949, and the S3 score increased from 0.140 to 0.275. These results suggest that the block-based nature of DCT better retains localized structural and high-frequency information than the global projection used in PCA.
Third, the effectiveness of the iCAM06 tone mapping module was validated by applying it to contourlet-PCA fusion. With iCAM06, BRISQUE decreased from 34.741 to 30.563, PIQE from 46.317 to 31.761, and CEIQ increased from 3.120 to 3.380. These results demonstrate that iCAM06 significantly enhances local contrast and restores perceptual color consistency, especially in regions dominated by visible light.
Fourth, the effect of varying the number of sigma scales in the SSR module within the CycleGAN-based image translation stage was analyzed using 1, 2, and 3 scales. The configuration using three scales achieved the best balance between detail enhancement and halo suppression. In this setting, S3 reached the highest score of 0.214, while LPC-SI and PIQE also achieved their best values at 0.961 and 22.186, respectively. This indicates that using three parallel sigma scales effectively reinforces structural details while minimizing artifacts.
The final proposed method achieved the best performance among all configurations in PIQE and LPC-SI and demonstrated competitive results in other quality metrics such as BRISQUE and CEIQ. This confirms the method’s strength in preserving perceptual quality and fine detail. The consistently high performance across various evaluation criteria empirically validates the synergy of the proposed framework, integrating contourlet-DCT-based fusion, iCAM06 tone mapping, and a 3-scale SSR module.
To objectively compare the proposed method, LWIR images were used as an input with the proposed inverse difference map method applied, yielding results from the conventional yielding results from the conventional and proposed methods. The proposed method synthesized IR images that passed through the headlights, increasing the clarity of object separation. As shown in
Figure 18, the contourlet transform effectively enhances the contours of the vehicle and the road conditions; therefore, understanding the road situation while driving becomes easy.
Figure 19 depicts the improvement in the visible-light images that are obscured by strong headlights. The proposed method successfully revealed the position of the road mirror behind the headlights and enhanced contrast in tree-covered areas, improving object detection capabilities. This highlights a significant improvement in detection performance with respect to those of conventional visible-light methods.
Figure 20 illustrates the advantages of applying the proposed method to a scenario in which sunlight directly enters a tunnel. The characteristics of the IR image allowed for the identification of vehicles inside the tunnel, and the proposed method better represented the road situation, aiding the driver. Additionally, the details of the tunnel were captured more accurately than when using the previous methods.
Figure 21 illustrates that during nighttime driving, oncoming headlights can impair the driver’s field of view, making it difficult to accurately identify pedestrians or objects in poorly lit areas. LWIR imaging, however, can capture information in these regions. The proposed method effectively fuses the advantages of both input images, enabling simultaneous object detection in both dark regions and brightly illuminated areas (e.g., headlight zones). Nonetheless, increasing the brightness in the DCT transformation process and the night-to-day translation module results in noise in dark areas (e.g., regions with trees).
Figure 22 shows that in areas without streetlight illumination, pedestrians are present, and road cracks are visible. The proposed method accurately captures the contours of pedestrians as well as the road conditions, potentially providing significant support in driving scenarios.
Figure 23 demonstrates that, in the left section of the image, the proposed method most effectively captures the surroundings of the road in areas lacking illumination. In nighttime driving scenarios, object identification is important, and the resulting image from the proposed method clearly delineates the details of obscured trees and their surrounding environment.
Finally,
Table 5 compares the scores of the result images using image quality metrics. The proposed method received the highest BRISQUE and PIQE scores, which evaluate image distortion and naturalness, and ranked second in terms of brightness and contrast metrics such as CEIQ and MCMA, demonstrating significant image improvement. Additionally, the proposed method exhibited excellent performance in terms of sharpness metrics, confirming the ability to enhance object location and contours.
To validate the statistical significance of performance differences among the eight fusion methods, ANOVA and Tukey’s HSD tests were conducted using six image quality metrics: BRISQUE, PIQE, CEIQ, MCMA, LPC-SI, and S3. In all cases, the ANOVA p-values were well below 0.05 (e.g., BRISQUE: 1.83 × 10−292), indicating statistically significant differences.
The proposed method consistently demonstrated superior performance across most metrics. Tukey’s HSD test results revealed that it achieved the lowest BRISQUE and PIQE scores, with non-overlapping confidence intervals confirming statistical significance over methods such as Densefuse, SWIN fusion, and Laplacian-based approaches. For contrast-related metrics (CEIQ and MCMA), it performed comparably to contourlet-based methods while significantly outperforming other baselines. In sharpness evaluation, the proposed method ranked first in LPC-SI and second in S3, demonstrating strong preservation of both spatial and spectral details. These results, validated through ANOVA and Tukey’s HSD, confirm that the proposed approach provides statistically robust improvements in image quality compared to existing techniques. Turkey’s HSD graph is shown in
Figure 24.
To further evaluate the effectiveness of the proposed method under diverse nighttime conditions, a comparative analysis was conducted, with a particular focus on foggy environments. As illustrated in
Figure 25 and
Figure 26, the evaluation was performed using the publicly available M3FD dataset [
38], rather than images captured with the proposed camera system. To ensure a fair comparison, no additional preprocessing was applied to the LWIR images, and the original conditions of the M3FD dataset were preserved. Methods incorporating the contourlet transform generally exhibited superior infrared (IR) representation, enabling more accurate recovery of fine details obscured by fog compared to other techniques. Notably, the proposed method surpassed the contourlet-PCA approach by more effectively suppressing noise in sky regions while retaining essential details in fog-covered areas that are critical for object identification.
4. Discussion
This study proposes an image processing technique that effectively enhances low-light regions in nighttime images, thereby improving object recognition and detection performance. The proposed method combines multi-band image fusion with deep learning-based image translation to enhance contrast, restore color, and recover fine details simultaneously, achieving superior image quality compared to conventional fusion methods.
In particular, quantitative evaluations using image quality metrics (BRISQUE, PIQE, CEIQ, MCMA, LPC-SI, S3), along with ANOVA-based statistical analysis, confirm that the proposed method minimizes distortion and maximizes sharpness across most indicators. These improvements are statistically significant, underscoring the effectiveness of the proposed approach.
Conventional CNN-based fusion methods, such as DenseFuse and Swin fusion primarily rely on low-level brightness information or single-scale features, which limits their ability to handle the detailed IR information, complex contrast variations, and lighting saturation commonly encountered in nighttime scenes. In contrast, the proposed method effectively separates directional and multi-resolution features via the contourlet transform and selectively retains high-frequency information using DCT-based fusion. DCT also offers lower computational complexity than PCA and better preserves local details through block-based processing. Additionally, in the image translation stage, a sigma-scale-based SSR training module enhances visibility in foggy or light-scattering regions more reliably than using CycleGAN alone.
Despite the strong performance of the proposed method, some limitations remain. SSR, used during CycleGAN training for detail enhancement, may cause halo artifacts in high-contrast areas. To address this, future work will consider halo-suppression strategies such as the contrast/residual decomposition approach. Additionally, applying low-light enhancement as a preprocessing step before fusion will be explored to improve image quality in extremely dark or unevenly illuminated conditions.
Practically, the proposed method demonstrates strong potential for applications such as driver assistance systems, CCTV surveillance, and intelligent traffic monitoring. However, combining image processing with deep learning increases overall processing time, which poses challenges for real-time deployment. To overcome this, future work will explore optimized parallel architectures and lightweight implementations using FPGA or AI SoC platforms.
Moreover, while this study focused on low-light nighttime conditions, image degradation may still occur in more complex weather scenarios such as rain or snow. Future research will aim to incorporate robust restoration modules tailored for adverse weather, such as raindrop removal or diffusion-based image translation, along with sensor upgrades that support stable multi-band imaging under such conditions.