RadiantVisions: Illuminating Low-Light Imagery with a Multi-Scale Branch Network

: In the realms of the Internet of Things (IoT) and artificial intelligence (AI) security, ensuring the integrity and quality of visual data becomes paramount, especially under low-light conditions, where low-light image enhancement emerges as a crucial technology. However, the current methods for enhancing images under low-light conditions still face some challenging issues, including the inability to effectively handle uneven illumination distribution, suboptimal denoising performance, and insufficient correlation among a branch network. Addressing these issues, the Multi-Scale Branch Network is proposed. It utilizes multi-scale feature extraction to handle uneven illumination distribution, introduces denoising functions to mitigate noise issues arising from image enhancement, and establishes correlations between network branches to enhance information exchange. Additionally, our approach incorporates a vision transformer to enhance feature extraction and context understanding. The process begins with capturing raw RGB data, which are then optimized through sophisticated image signal processor (ISP) techniques, resulting in a refined visual output. This method significantly improves image brightness and reduces noise, achieving remarkable improvements in low-light image enhancement compared to similar methods. Using the LOL-V2-real dataset, we achieved improvements of 0.255 in PSNR and 0.23 in SSIM, with decreases of 0.003 in MAE and 0.009 in LPIPS, compared to the state-of-the-art methods. Rigorous experimentation confirmed the reliability of this approach in enhancing image quality under low-light conditions.


Introduction
With the proliferation of IoT devices, the integration of AI to ensure the security of these interconnected systems becomes imperative [1][2][3][4].One significant aspect of this intersection lies in the domain of visual data capture and transmission.In the field of AI security, especially in the construction of image databases, low-light image enhancement is a common image processing technique.In contemporary interconnected environments, security surveillance, autonomous driving, and various other IoT technologies seamlessly permeate daily lives.Ai and Kwon [5] introduced a novel convolutional network designed to address security concerns in surveillance cameras within smart city environments, leveraging primary support from deep learning and demonstrating efficacy even under extremely lowlight conditions.The emergence of these technologies has sparked an increasing demand for high-quality imagery, especially since these devices function under low-light conditions.In the construction or utilization of databases related to AI security, enhancing the quality of certain low-light images is crucial to augment the data.Whether facing low-light environments or challenging lighting scenarios, the robust capture and processing of visual data are essential for maintaining the integrity and reliability of AI-driven security systems.In this context, when delving into the challenges and advancements in AI security, especially concerning visual data from IoT devices, a crucial aspect to explore is the enhancement of low-light images in the realm of security cameras.Our focus revolves around addressing the task of improving the quality of images captured under low-light conditions.
The realm of low-light image enhancement has witnessed significant progress within the domain of deep learning, with the widespread adoption of convolutional neural networks (CNNs) [6] and transformers [7].CNNs, renowned for their ability to capture intricate features, are extensively applied to address the challenges presented under low-light conditions.The utilization of transformer models, originally designed for sequential data processing, has emerged as a promising avenue for capturing global contextual information, thereby enhancing the representation of low-light images.Building upon these advancements, various network architectures, including both single-branch and multi-branch structures, have been explored in the realm of low-light image enhancement.Single-branch architectures, such as CNNs, have gained prominence due to their adeptness at capturing intricate features, making them well-suited for addressing challenges posed by low-light conditions.In contrast, multi-branch architectures have also been investigated.However, a notable limitation is the lack of an inherent correlation between the branches, hindering a seamless information exchange.Our approach aims to overcome the inherent limitations of independent multibranch structures, enhancing their effectiveness in capturing both local and global features for improved low-light image enhancement.
Multi-scale feature extraction has consistently been a focal point in the field of computer vision.In various image enhancement tasks, multi-scale feature extraction has proven to be highly beneficial.Researchers have endeavored to enhance models' sensitivity to different scales through traditional approaches such as pyramid structures and spatial pyramid pooling, as well as emerging techniques in deep learning, like multi-scale convolutions and pyramid convolutions.In the context of prior low-light image enhancement models, including RetiNexNet [8], MBLLEN [9], 3D-LUT [10], and IAT [11], a common attribute is observed: the absence of multi-scale feature extraction in their architectures.Due to the inability to capture information at different scales, these models may encounter challenges in effectively enhancing dimly lit images, particularly when variations in illumination exist across different regions of the image.To alleviate this limitation, multi-scale feature extraction has been introduced into the model architecture, enabling a more comprehensive capture of both local details and global structures in an image.Analyzing an image at various scales allows the model to adapt more adeptly to diverse lighting conditions, thereby improving its performance and robustness.This strategic integration aims to ensure that the enhancement model excels particularly in handling images with non-uniform lighting, showcasing superior performance under such circumstances.
Image denoising [12] plays a crucial role in enhancing visual quality by mitigating the presence of noise artifacts, such as graininess and the loss of fine details prevalent in low-light photography.In the context of low-light image enhancement tasks, the issue of pronounced noise in post-enhancement becomes particularly evident.Recognizing this challenge, our approach employs specialized denoising functions to alleviate the noise issues arising from the enhancement of low-light images.This strategic integration of denoising functions stands as a crucial step in ensuring that the enhancement process not only addresses low-light conditions but also maintains high visual quality by mitigating unwanted noise artifacts.
In recent years, researchers have made significant progress in low-light image enhancement.However, under the challenging conditions of complex low-light scenarios, image enhancement methods still face some challenging issues.Restoring low-light images with uneven illumination distribution proves difficult, and low-light images often contain substantial noise.Additionally, the lack of inherent correlations between branch networks hampers effective information exchange.To address these issues, we propose a novel model named MSBN (Multi-Scale Branch Network) for low-light image enhancement.This model integrates multi-scale feature extraction, enabling a more comprehensive capture of both local details and global structures in images.Analyzing images at different scales allows the model to adapt to varying lighting conditions, thereby improving its performance and robustness.The model introduces denoising functions to alleviate noise issues and establishes correlations between network branches to overcome the inherent limitations of independent multi-branch structures, enhancing effectiveness in capturing both local and global features.This comprehensive method facilitates the nuanced processing of low-light images, effectively balancing the enhancement of image quality with the preservation of natural details and the reduction of noise artifacts.The proposed general model can be applied to various safety domains, including night-time surveillance, autonomous driving, urban lighting, and other safety-related areas.Our contributions are manifold and can be summarized as follows: • We integrate multi-scale feature extraction into MSBN.This integration enables the model to enhance images with non-uniform lighting conditions, preserving the original uneven illumination and retaining details across different scales after enhancement.

•
We introduce a custom denoising loss function tailored specifically for low-light conditions.This feature effectively alleviates the noise issues introduced in low-light images after enhancement, ensuring the clarity of the images.

•
Our model combines inter-branch correlations, employing weighted feature fusion to enhance the extraction and integration of prominent features.It strengthens the correlation between color, brightness, and the image itself, resulting in a more realistic effect in the enhanced images.

Image Signal Processor
ISP is a pivotal component in computer vision and digital imaging, significantly contributing to image quality enhancement and overall performance.The research landscape surrounding ISP is rich and diverse, with several noteworthy contributions.Liang [13] introduced a novel deep neural network addressing challenges like low-light image generation and RAW-to-RGB conversion, which outperformed existing methods with an efficient design.Park [14] proposed an ISP-focused pre-processing method for optimizing brightness and contrast, particularly benefiting edge detection in applications such as autonomous vehicles and defense.Wang [15] tackled exposure issues with a method leveraging local color distributions, introducing an LCDE module and a dual-illumination learning mechanism, showcasing superior performance on a newly constructed dataset.These studies collectively underscore the pivotal role of ISPs in advancing image processing through innovative algorithmic techniques, architectural designs, and diverse applications.

Multi-Scale Feature Extraction
Multi-scale feature extraction has been a pivotal aspect of computer vision and image processing, addressing the necessity to capture information at varying levels of granularity.Diverse methods in the literature have approached this challenge through various strategies.
In their work, Wang [16] introduced a novel multi-scale feature extraction and normalized attention neural network for image denoising, surpassing state-of-the-art methods by achieving higher PSNR, SSIM values, and overall visual quality in restored images.Liu [17] proposed a deep network for infrared and visible image fusion, incorporating a unique feature learning module and edge-guided attention mechanism.This method outperformed existing approaches across benchmarks and demonstrated robustness under challenging conditions.Qi [18] presented a novel underwater image enhancement network, leveraging semantic region-wise enhancement modules for multi-scale perception.This approach effectively improved color distortion and blurred details in underwater images.Collectively, these studies underscore the significance of multi-scale feature extraction in enhancing the robustness and discriminative power of computer vision systems.

Image Denoising
Image denoising, aimed at reducing noise while preserving details, has evolved significantly within image processing.Over the past years, a myriad of image priors have been proposed to restore clarity to noisy images, including sparsity, low rank, and self-similarity.Notable methods based on image priors, such as BM3D [19] and WNNM [20], have achieved remarkable progress in the domain of image denoising.With the advent of deep learning, researchers have increasingly turned to leveraging deep neural networks for image denoising.For instance, DnCNN [21] employed a deep residual network and integrated batch normalization layers to accelerate the training process.In a different approach, CBDNet [22] addressed noise comprehensively throughout the imaging process, utilizing the U-net architecture along with a sub-network to estimate noise levels, thus enhancing denoising performance.

Model Structure
Enhancing images under low-light conditions at the sensor level poses a multifaceted challenge in the field of digital image processing.The conventional ISP approach encompasses various subprocesses, such as white balance calibration, demosaicking algorithms, and denoising techniques.However, these traditional methods often amplify noise artifacts and reduce color fidelity, significantly hindering the attainment of optimal image quality.
During the image acquisition process with a camera, an RGB image is influenced by the ambient lighting condition Li.It undergoes processing via the ISP and is transformed into an sRGB image I i .This conversion typically results in a substantial loss of information related to the original lighting and color.Our method aims to reverse the ISP workflow [23] by extracting the original RGB values from the sRGB images and adjusting them through the introduction of new lighting conditions Lt [24].The objective is to generate an sRGB image I t that closely matches the one under the target lighting conditions.The network encoder f [25] is utilized to represent the inverse operation of ISP, and several individual decoders g t are added to the encoder f .The function maps f (I i ) to the target I t under the following lighting conditions: In this process, the network encoder f is utilized to map I i to the corresponding original RGB data.Additionally, multiple independent decoders g t are superimposed on the encoder f to generate the target image I t .For f (I i ) = I i M + A; multiplication for predicting mapping M and addition for mapping A are generated, achieving pixel-level adjustments [11].The equation of our MSBN model was formulated as follows: where W ci,cj denotes a 3 × 3 joint color transformation matrix, amalgamating the influences of white balance and color transformation.It is applied to each pixel of the input image to achieve white balance and color correction.Nine queries are employed to control the parameters of W ci,cj , with each query potentially corresponding to an element or a set of elements in the matrix.This allows for the fine-tuning of the color transformation matrix to achieve the desired effects.
The symbol γ denotes the gamma correction parameter, adjusting image brightness and contrast for a more natural appearance.Modifying γ influences the overall brightness of the output image.Additionally, ε is a small value preventing numerical instability, ensuring formula stability by averting division by zero or other numerical issues during computation.
Building upon the ISP theory outlined above, the MSBN model is proposed, emphasizing its multi-scale module, branch correlation module, and denoising loss module.
The MSBN model is illustrated in Figure 1; it undergoes multi-scale feature extraction as its initial step.By comprehensively utilizing feature information at different scales, our approach facilitates a comprehensive understanding of image content.Features at smaller scales excel at capturing intricate details and textures, while those at larger scales contribute to grasping global structures and contextual relationships.This integrated processing across multiple scales enhances the accuracy and robustness of the algorithm, effectively mitigating the impact of noise on image processing results.After undergoing the steps outlined above, the network is split into two branches [26]: the R branch and the G branch.The R branch is designed to generate M and A components, inversely transforming the input sRGB image I i to an RGB image.The G branch is designated for querying, decoding, and predicting W and γ, subsequently producing the sRGB image I t .
The R branch begins by expanding the channels through a 3 × 3 convolution for initial processing.Subsequently, 1 × 1 convolution and 5 × 5 depthwise separable convolution are employed for further feature extraction and integration.The output undergoes optimization via an additional normalization and 1 × 1 convolution layer.Feature capturing is enhanced through GELU activation and a coordinate attention layer.Ultimately, the channels are reduced to three using a 3 × 3 convolution, and M and A components are generated using ReLU activation.Throughout this process, four skip connections [27] and the layer scale [28] method (multiplied by factors g 1 /g 2 ) ensure rapid and stable convergence.
In the G branch, two convolutional layers serve as encoders to process and encode features.The encoded features are then forwarded to the prediction module.In this module, the component query (Q) is initialized to zero, and no additional multi-head self-attention mechanism is employed.Q is a learnable embedding associated with the keys (K) and values (V) generated from the encoded features.The position encoding of K and V is produced through deep convolution.After passing through a feed-forward network, which consists of two linear layers, the model incorporates two parameters with specific initialization.These parameters are utilized to output the color matrix and gamma values.

Multi-scale module
Integrating a multi-scale module into MSBN incorporates the Res2Net [29] structure and SE block [30], as illustrated in Figure 2. The Res2Net structure leverages grouped input feature maps in which each set of filters extracts features from a corresponding input group.This iterative process continues until all feature maps are processed.The resulting features are then concatenated and fused using 1 × 1 filters.As input features transform into output features, the equivalent receptive field expands, creating multiple feature scales through the combined effects of 3 × 3 filters.Simultaneously, the SE block enhances the model's attention mechanism and adaptively re-calibrates channel-wise feature responses by explicitly modeling inter-dependencies among channels.This synergistic approach enhances the model's capacity for sophisticated multi-scale feature extraction, contributing to its academic robustness.

Branch correlation module
Introducing the branch correlation adjustment parameter g 3 in the low-light image enhancement network improves the performance and adaptability of the network.As shown in Equation ( 4), the value of this parameter is calculated through tensors C 1 and C 2 , which are correlated with g 1 and g 2 in branch R. g 1 and g 2 denote the neural network parameters involved in the computation of M and A in Equation ( 2).The incorporation of g 3 is introduced to enhance the precision of calculating W and γ in Equation (2).
The effect of introducing g 3 is manifested in two main aspects.Firstly, it allows us to introduce a certain degree of correlation between branch R and branch G, aiding in coordinating the learning processes of the two branches.This coordination enhances the network's ability to extract and understand image features under low-light conditions.Secondly, by introducing the learnable parameter g 3 , the network becomes more flexible, enabling it to adjust the relationship between branches based on specific input data and task requirements.This flexibility is crucial for improving the network's generalization performance, ensuring excellent performance in various scenarios and lighting conditions.

Loss Function
In terms of loss function design, three main components are utilized for comprehensive optimization.The first part, smooth L1 loss, focuses on minimizing the pixel-level differences between the enhanced and high-quality images.The second part, perceptual loss, ensures that the enhanced images maintain similar feature representations with the original high-quality images in the high-dimensional feature space using a pre-trained VGG model [31].The pre-trained VGG model possesses powerful feature extraction capabilities, facilitating the preservation of semantic information in images.Simultaneously, its training on large-scale datasets enables it to capture features relevant to human perceptual similarity, thereby better maintaining consistency in image perception.Through transfer learning, the model can adapt more quickly to tasks under low-light conditions.
To further enhance image quality, we introduced a denoising loss function L denoising [32].The primary objective of this loss function is to minimize the noise discrepancies between the generated image and the original image, allowing the model to better preserve key features of the image.By considering subtle differences between the denoised result and the source, as well as target images, L denoising effectively guides the training process, ensuring that the generated images are more realistic and clearer while maintaining the structural details of the original image.The formula is as follows: This loss function is divided into two components.The first part is the residual loss, which quantifies the residuals between the generated image and the original image at different scales.The second part is the consistency loss, which measures the consistency between the generated image and the target image at different scales.D and T denote I t in Equations ( 1) and ( 2).The terms D 1 and D 2 denote the downsampled results of the enhanced image.Similarly, S 1 and S 2 denote the downsampled results of the normal image.T 1 and T 2 denote the twice-downsampled results of the enhanced image.
The entire loss function incorporates a weighted combination of three types of losses, as defined in Equation ( 5), resulting in the total loss (Loss total ) as follows: where smooth L1 loss is referred to as Loss 1 , indicating the utilization of high-dimensional features extracted via VGG as Loss 2 .

Experimental Settings
We used the PyTorch framework on an NVIDIA 4090 GPU to train our model for 250 epochs on the LOL-V2-real dataset [33].LOL-V2-real contains 689 low-/normal-light image pairs for training and 100 pairs for testing.This dataset comprised images from both indoor and outdoor scenes, each with a resolution of 600 × 400.To optimize MSBN, the Adam optimizer was chosen with an initial learning rate set to 2 × 10 −4 , accompanied by a weight decay of 5 × 10 −4 .To enhance the model's generalization and reduce the risk of overfitting, a data augmentation strategy was employed.In addition to the original images, the model was fed with horizontally and vertically flipped versions of the original images, enriching the dataset.This strategy significantly boosted the model's performance during both the validation and testing phases.Through experiments, remarkable improvements were observed in image quality and restoration accuracy for the low-light enhancement task.As depicted in Figure 3, the visual results demonstrate the effectiveness of our method in enhancing image details and overall image quality under low-light conditions.In this experiment, SSIM (structural similarity index measure), PSNR (peak signalto-noise ratio), MAE (mean absolute error), and LPIPS (learned perceptual image patch similarity) were employed as metrics to assess the image quality of the enhanced images.SSIM measures the structural similarity between two images, considering factors such as brightness, contrast, and structure, and it produces a score in the [−1, 1] range, where 1 indicates perfect similarity.On the other hand, PSNR measures the quality of an image by evaluating the peak signal-to-noise ratio, with higher values indicating better image fidelity.Additionally, MAE is used to quantify the average absolute differences between corresponding pixel values in the original and enhanced images.A lower MAE value signifies better image accuracy.Furthermore, LPIPS is employed to evaluate perceptual similarity, measuring the perceptual distance between images, considering human visual perception, and it is often used to assess the perceived quality of images.Together, these metrics provide a comprehensive evaluation of the enhanced image quality, considering both structural and perceptual aspects.

Visual and Perceptual Comparisons
We compared the enhancement effects under low-light conditions of the MSBN method with several other methods [8][9][10][11][34][35][36][37].As depicted in Figure 4, it became apparent that Zero-DCE [37] faces significant challenges when operating in low-light environments, as evidenced by its performance.The image processing results frequently yielded a perceptually darker visual representation of the images.In contrast, RetiNexNet [8] exhibits a notable issue primarily related to color restoration.It shows pronounced color discrepancies that may distort the authentic colors of objects in the image, thereby compromising image quality and fidelity.The IAT [11] exhibits a performance that approximates the ground truth to a greater extent than its predecessors; however, it is plagued by pronounced noise artifacts.In contrast, our method not only surpasses the IAT in terms of reduced noise but also yields clearer and more visually distinct images.Through a comparative analysis of the output images from each model, our method outperforms other models in processing low-light images.It not only significantly improves the brightness and clarity of the pictures but also makes the enhanced images more akin to real scenes.As illustrated in Figure 5, to effectively address the issue of uneven illumination distribution in low-light image enhancement, MSBN introduces multi-scale feature extraction and integrates denoising functions to reduce noise.In the ground truth image, the red box, when contrasted with the area above, revealed an uneven illumination distribution.In contrast to our MSBN, other methods did not exhibit this uneven illumination distribution.Simultaneously, the green box illustrates that, compared to other methods, MSBN exhibits less image noise.These results demonstrate the significant efficacy of MSBN in tackling these issues.Table 1 presents the performance of various methods in the low-light image enhancement task on the LOL-V2-real dataset.Four metrics, MAE, LPIPS, PSNR, and SSIM, were employed to assess the efficacy of each method.Notably, conventional methods like LIME and Zero-DCE exhibited lower PSNR and SSIM values while having relatively higher MAE and LPIPS values, indicating their limited effectiveness in low-light image enhancement.In contrast, advanced models such as UFormer, KIND, IAT, and MSBN demonstrated significant improvements, with MSBN achieving the highest PSNR and SSIM scores, as well as the lowest MAE and LPIPS scores.It outperformed the second-best IAT by 0.255 and 0.23 in PSNR and SSIM, and it was 0.003 and 0.009 lower in MAE and LPIPS, respectively, showcasing its superior capability in enhancing low-light images using the LOL-V2-real dataset.In terms of parameters, MSBN ranked third with only 0.16M, highlighting its effectiveness in achieving both quantitative performance and model efficiency.

Hyperparameter Details
Table 2 provides detailed information on the hyperparameter search and optimal configuration during the low-light image enhancement experiments on the LOL-V2-real dataset.We meticulously examined various hyperparameters, including batch size, display iteration, learning rate, number of epochs, and weight decay.The search range for each hyperparameter was specified, emphasizing the optimal configuration that yielded the best performance.We paid special attention to understanding the relationship between experimental results and different hyperparameter values throughout this tuning process to determine the optimal configuration.Ultimately, we found that setting the batch size to 8, the display iteration to 10, the learning rate to 2 × 10 −4 , the number of epochs to 250, and the weight decay to 5 × 10 −4 resulted in the best performance of our model using the LOL-V2real dataset.The selection of this optimal configuration underwent thorough experimental validation, ensuring the robustness and effectiveness of our proposed low-light image enhancement method in various aspects.

Ablation Experiment
According to the results of the ablation experiment concerning the low-light image enhancement framework in Table 3, a noticeable improvement in performance was observed upon the incorporation of individual modules.Upon the simultaneous integration of the three modules, there were decreases of 0.026 and 0.024 in MAE and LPIPS, respectively, while PSRN and SSIM showed increases of 1.791 and 0.036, respectively.
These results indicate that these three modules showed significant improvements in addressing uneven illumination distribution, suboptimal denoising performance, and insufficient correlation among branch network issues in low-light image enhancement.

Conclusions
We have proposed the MSBN model specifically designed for low-light image enhancement tasks.Integrating multi-scale feature extraction, branch correlation, and advanced de-noising techniques, MSBN effectively addresses uneven illumination distribution, suboptimal denoising performance, and insufficient correlation among branch network issues.With the incorporation of a vision transformer for enhanced feature extraction, the optimization of raw RGB data through image signal processor techniques results in refined visual output.The application of a composite loss function enhances robustness, showcasing significant advancements in luminance and noise reduction compared to traditional methods.Our method shows notable improvements on the LOL-V2-real dataset, demonstrating its effectiveness in addressing low-light challenges.The proposed general approach is applicable to and appropriate for many aspects of AI security, such as enhancing the performance of intrusion detection systems and improving the accuracy of facial recognition under low-light conditions.
In subsequent stages, we will explore avenues to refine the MSBN model in order to comprehensively tackle a broader range of low-light scenarios.Efforts will be directed toward leveraging transfer learning from related domains to boost the model's generalization capabilities.These endeavors are poised to contribute further to the model's efficacy, ensuring its applicability in diverse real-world low-light imaging scenarios.

Figure 1 .
Figure 1.Diagram of the structure of our branch network MSBN.Different colors are used to distinguish various network processes.

Figure 2 .
Figure 2. Network architecture with multi-scale feature extraction.Blue is used to differentiate from the 1 × 1 filters.

Figure 3 .
Figure 3.The results obtained from our method for low-light enhancement are shown in the figures.

Figure 5 .
Figure 5.A comparison of MSBN and other methods in detail.Red is used to compare uneven illumination distribution, and green is used to compare image noise.

Funding:
This research was supported by the National Natural Science Foundation of China (No. 52105167 and No. 62302539).

Table 1 .
Quantitative evaluation of various methods on LOL-V2-real dataset.↓ indicates that smaller values are better, and ↑ indicates that larger values are better.

Table 2 .
Hyperparameter search and optimal configuration.

Table 3 .
The results of the ablation experiments were obtained by adding different modules, with the baseline network represented as O, the multi-scale module as A, the denoising loss module as B, and the branch correlation module as C. ↓ indicates that smaller values are better, and ↑ indicates that larger values are better.