Low-Light Image Enhancement Based on Multi-Path Interaction

Due to the non-uniform illumination conditions, images captured by sensors often suffer from uneven brightness, low contrast and noise. In order to improve the quality of the image, in this paper, a multi-path interaction network is proposed to enhance the R, G, B channels, and then the three channels are combined into the color image and further adjusted in detail. In the multi-path interaction network, the feature maps in several encoding–decoding subnetworks are used to exchange information across paths, while a high-resolution path is retained to enrich the feature representation. Meanwhile, in order to avoid the possible unnatural results caused by the separation of the R, G, B channels, the output of the multi-path interaction network is corrected in detail to obtain the final enhancement results. Experimental results show that the proposed method can effectively improve the visual quality of low-light images, and the performance is better than the state-of-the-art methods.


Introduction
With the development of computer technology and camera sensors, computer vision has been applied in various engineering fields-for example, object detection in autonomous vehicles [1] and harvesting robots [2], detection and monitoring in the field of civil engineering [3,4], video surveillance [5], 3D reconstruction [6] and so on. Since vision tasks play an important role in a wide range of fields, reliable working performance is required. However, the tasks rely on scene illumination, and the performance of any camera-sensor-based perception tasks is highly degraded in poor illumination conditions such as low-light scenes [7]. In low-light scenes, when a camera cannot receive sufficient light or the camera sensor is not sufficiently sensitive, the captured images may have problems such as poor visualization and low image quality, making the valid information of the image disturbed and limiting the use of the image in computer vision tasks [8,9]. As we know, the degradation of the low-light images captured in a non-uniform illumination environment results in severe object information loss and makes the object detection more challenging [10]. The camera's night mode sometimes suppresses this degradation; however, a slight shake may introduce other problems such as blurring. Improving the illumination of the environment or updating the camera sensor is not feasible in some conditions [11]. Therefore, low-light image enhancement methods at the software end are needed.
At present, a large number of image enhancement methods have been proposed. Histogram equalization (HE)-based methods [12,13] redistribute pixel values according to the cumulative distribution function of the input image to expand the dynamic range. For example, Ibrahim et al. [14] smoothed the input histogram with a one-dimensional Gaussian filter, and then partitioned the smoothed histogram based on its local maximums. After each partition was assigned to a new dynamic range, the histogram equalization process was applied independently to these partitions. The last step in this method was to normalize the output image to the input mean brightness. Ying et al. [15] and Ren et al. [16] utilized the input image and camera response model to adjust the pixel values. The methods based on Retinex theory [17] adaptively adjusted the illuminance and reflectance components of the image, where the reflectance component was considered as an inherent attribute of the scene and was unchangeable in different lighting conditions [18,19]. Jobson et al. [20] extended a previously designed single-scale center/surround Retinex to a multiscale version that achieved simultaneous dynamic range compression/color consistency/lightness rendition. In order to correct the deficiency present in the extension, a method of color restoration, at the cost of a modest dilution in color consistency, was defined. Fu et al. [21] derived two inputs that represented luminance-improved and contrast-enhanced versions of the decomposed illumination using the sigmoid function and adaptive histogram equalization, and then fused the derived inputs with the corresponding weights in a multiscale fashion to adjust illumination. The method combined the advantages of sigmoid function and histogram equalization, and the final enhanced image was obtained by compensating the adjusted illumination back to the reflectance. Dong et al. [22] noticed that the inverted low-light images intuitively resembled images acquired in hazy lighting conditions; thus, low-lighting image enhancement has much in common with video haze removal. Therefore, they applied image de-hazing algorithm to the inverted image to enhance the image. These methods are simple and effective. However, the results may have undesirable illumination and amplified noise.
In recent years, with the improvement of computer performance and the establishment of publicly available datasets, image enhancement methods based on convolutional neural networks (CNNs) have been actively researched. The CNN-based method is one of the data-driven methods and uses paired images for end-to-end learning. Wei et al. [23] proposed a Retinex-Net learned on a real dataset, which includes a Decom-Net to decompose low-light images into illumination and reflectance components and an Enhance-Net to adjust the illumination component. Xu et al. [24] observed that noise exhibits different levels of contrast in different frequency layers, and it is much easier to detect noise in the low-frequency layer than in the high-frequency one. Therefore, they proposed a network that learns to recover image objects in the low-frequency layer and then enhances highfrequency details based on the recovered image objects. Chen et al. [25] used an exposure prediction network to generate under-/overexposure images and then fused them with the input image to obtain the enhanced image. Lv et al. [26] proposed a multi-branch network to extract rich features of different levels and then fused the multi-branch outputs to produce the output image. Wang et al. [27] considered the low-light image enhancement as a residual learning problem. They proposed a deep lightening network, which consists of several lightening back-projection blocks that perform lightening and darkening processes iteratively to learn the residual for normal-light estimations. Moreover, a feature aggregation block that adaptively fuses the results of different lightening back-projection blocks was designed to effectively utilize the local and global features. Ma et al. [11] transformed the original low-light image from the RGB to HSI color space and used the segmentation exponential method to process the saturation (S) while applying a specially designed deep convolutional neural network to enhance the intensity component (I). The final improved image could be obtained by going back to the original RGB space. Lore et al. [28] used a class of deep neural networks, a stacked sparse denoising autoencoder (SSDA), to enhance natural low-light images. They explored two types of deep architecture, including learning contrast-enhancement and denoising simultaneously, and learning contrast-enhancement and denoising sequentially. CNN-based methods are effective in preserving details and denoising. Nevertheless, existing methods may not perform well on color.
Another data-driven method is the generative adversarial network (GAN)-based method. Different from CNN-based methods, GAN-based methods do not require strictly paired images. They usually require careful selection of unpaired training data. Each GAN contains a generator to output enhanced images and a discriminator to determine whether the output produced by the generator is satisfactory. Jiang et al. [29] proposed to regularize the unpaired training using the information extracted from the input itself, and used a global-local discriminator structure to handle spatially varying light conditions in the input image, while adding the idea of self-regularization, which is implemented by both the self feature preserving loss and the self-regularized attention mechanism. Chen et al. [30] augmented the U-Net with global features and improved Wasserstein GAN (WGAN) with an adaptive weighting scheme, then used individual batch normalization layers for generators in two-way GANs to help generators better adapt to their own input distributions. The design improves the stability of GAN training for the application. Liu et al. [31] proposed a perceptual-details GAN utilizing ZeroDCE to initially recover illumination and combined a residual dense-block encoder-decoder structure to suppress noise while finely adjusting the illumination. In addition, the details were enhanced by using fractional differential gradient masks integrated into the discriminator. However, the generator may collapse due to the fact that the discriminator fails to discriminate its output, and it is difficult to obtain the desired output from two models with opposite objectives trained simultaneously [27].
In order to effectively enhance the brightness of low-light images while restoring the color and details, we propose an end-to-end learning method. The method consists of two cascaded subnetworks that first enhance the color channels and then adjust the details to obtain enhanced images with good color restoration. The enhanced images are expected to display improved visual quality and enhanced performance in computer vision tasks such as object detection and instance segmentation [32,33], and an example of text recognition is shown in Section 3.4. Overall, our contributions are as follows: (1) The low-light image enhancement task is simplified into three steps. The first step is the enhancement of R, G, and B channels; then, the reconstruction of the color image is performed, and the last step is the adjustment of details.
(2) We design a multi-path interaction network (MPI-net) to enhance the R, G, and B channels. Then, through the interaction across the parallel paths, the feature maps are potentially more accurate.
(3) With the help of exposure amplification loss in the detail correction network (DCnet) and other losses, the final enhanced images are more natural. The experimental results demonstrate that our method outperforms several state-of-the-art enhancement methods.

Proposed Method
Networks inspired by U-Net [34] are often a single path from high-resolution to lowresolution for encoding, and low-resolution to high-resolution for decoding, where usually only the skip connections directly concatenate the feature maps in the downsampling layer to its corresponding upsampling layer according to space resolution to increase the amount of information in the upsampling steps [35]. In order to increase the information representation of feature maps in the network, we design a multi-path interaction network (MPI-net), which extends the network structure based on the U-Net idea to further enhance the information interaction between feature maps of different resolutions, while retaining a high-resolution path and enhancing the utilization of information in the network. Retaining the high-resolution path in the network, rather than upsampling to high-resolution from low-resolution, potentially leads to more accurate feature maps [36].
We consider the image enhancement as the enhancement of three channels. At first, the R, G, and B channels of low-light images are trained separately to obtain the enhanced R, G, and B channels and recombine them into a color image. Since the three channels are trained separately, the correlation between the color channels is ignored and the obtained images may have unnatural colors and overexposure. Therefore, a detail correction network (DC-net) is used after the multi-path interaction network (MPI-net) to further adjust the color images generated by the output of MPI-net. The DC-net consists of several convolutions, and the last layer of the convolution is a residual map. The enhanced images can be obtained by subtracting the residual maps from the color images generated by the output of MPI-net. The overall architecture of the proposed method is shown in Figure 1. The enhancement process is divided into three steps: color channel enhancement, reconstruction, detail adjustment. In the color channel enhancement step, a subnetwork MPI-net enhances the R, G, B channels. In the reconstruction step, the enhanced R, G, B channels are concatenated in order to generate the preliminary enhanced image I mpi . In the detail adjustment step, we concatenate the I mpi and the input image as the input of DC-net, and the output of DC-net is a residual map. The final enhanced image I enh is obtained by subtracting the residual map from the I mpi .

Multi-Path Interaction Network
The first path of the multi-path interaction network (MPI-net) is a high-to-low and low-to-high resolution network (HL-net), and the number of HL-Net paths is gradually increased one by one to form more paths until the last path contains only high-resolution feature maps. The paths are connected in parallel, and the feature maps for the parallel paths of a later stage consist of the feature maps from the previous stage and an extra lower one. Meanwhile, there is a high-resolution path in the network. The architecture of MPI-net is shown in Figure 2. The MPI-net connects multiple paths to form a richer feature representation while retaining the ability of U-Net. At the same time, the existence of a high-resolution path and the interaction of information between feature maps of the same or different resolutions in different paths make the feature representation potentially more accurate [36]. In one path, each downsampling step is a convolution with stride 2. Each upsampling step contains a bilinear interpolation to expand the size of the feature map to twice the original. Moreover, three cascaded convolutional layers are included between two operations with different spatial resolutions. Each convolutional layer consists of a 3 × 3 convolution operation with padding, followed by a rectified linear unit (ReLU) activation function. In addition, skip connections directly concatenate the feature maps in the downsampling layer to the corresponding upsampling layer to increase the amount of information in the upsampling steps. The number of channels of feature maps with different resolutions in the first path is 32, 64, 128 and 256, respectively. Moreover, other paths are consistent with the first path in the number of feature map channels.
The exchange of information between feature maps of different resolutions leads to rich resolution representations [36]. Therefore, the exchange units are introduced across parallel paths in the MPI-net. An example is shown in Figure 3. Since the paths are connected in parallel, each path repeatedly receives the information from the other parallel paths. The feature maps for exchanging information are at the same depth in the network and usually have different resolutions. In the exchange unit, for different paths, they are transformed to the same resolution and concatenated on the path to complete the information exchange. Both upsampling and downsampling are used only once in one exchange unit.

Loss Function
The loss function of MPI-net consists of two components, the mean square error loss L mse and the structural similarity loss L ssim , expressed as follows: where λ is used to control the image structure.
The mean squared error (MSE) is the average of the squared sum of the corresponding pixel errors between the enhanced channel and the reference channel, and is used to evaluate the overall difference between two channels. A smaller MSE means a better result. Therefore, the mean square error loss L mpi−net mse is defined as: where I mpi_c is the enhanced c channel, I re f _c is the c channel of the reference image, || · || 2 means L 2 norm, H and W are the height and width of the image.
The structural similarity (SSIM; [37]) is used to evaluate the similarity of two channels in terms of luminance, contrast and structure. The value of SSIM ranges from 0 to 1, and a larger value indicates better similarity. The definition of SSIM is as follows: where the parameters mpi_c and ref_c are simple representations of the enhanced c channel and the reference c channel, µ mpi_c is the mean of the I mpi_c , µ re f _c is the mean of the I re f _c , σ mpi_c is the variance of the I mpi_c , σ re f _c is the variance of the I re f _c , σ mpi_c,re f _c is the covariance of the I mid_c and the I re f _c , C 1 and C 2 are constants and take the default values. In order to improve the structural distortion problems that usually exist in low-light images [26], we introduce structural similarity loss L The enhanced R, G, B channels are concatenated to generate the preliminary enhanced image (I mpi ). In order to avoid the loss of details caused by enhancing the color channels separately, the preliminary enhanced image and the low-light image are concatenated as the input of the detail correction network (DC-net) to adjust the details. As shown in Figure 4, the DC-net contains six 64-channel convolution layers, and the 3-channel feature map obtained from the last layer of convolution is a residual map. The final enhanced image is obtained by subtracting the residual map from the preliminary enhanced image. The activation function of the last convolutional layers in MPI-net and DC-net is none.

Loss Function
We introduce DC-net and design extra exposure amplification loss L ea and smoothing loss L smooth to suppress overexposure and make the enhanced images more natural. The total loss function of DC-net is expressed as: where λ 2 and λ 3 are used to control the degree of overexposure suppression and smoothing, respectively. The L dc−net mse is expressed as: where I enh is the final enhanced image and obtained by subtracting the residual map I res of DC-net from the preliminary enhanced image I mpi , and I re f is the reference image. The L dc−net ssim is expressed as: where the parameters enh and ref are simple representations of the I enh and the I re f . Through the gamma transformation, the pixel difference between the bright areas of the enhanced image and the reference image could be greatly increased, while the pixel difference between the dark areas is slightly increased. Therefore, using the average of the pixel difference between the gamma-transformed enhanced image and the reference image as the loss function helps to place more emphasis on the bright area and suppress overexposure. The exposure amplification loss L ea is defined as follows: where || · || 1 means L 1 norm, and γ is used to control the increase in the relative difference.
To smooth the enhanced image and make it more natural, we introduce the smoothing loss L smooth to minimize the difference between the horizontal and vertical gradients of the enhanced image and the reference image in the color channels. The definition of smoothing loss L smooth is shown below: where ∇I c enhx is the horizontal gradient of the enhanced image I enh in channel c, ∇I c re f x is the horizontal gradient of the reference image I re f in channel c, ∇I c enhy is the vertical gradient of the enhanced image I enh in channel c, ∇I c re f y is the vertical gradient of the reference image I re f in channel c. Figure 5 shows an example of the images used and generated in the proposed method, including the input images and its R, G, and B channels, the R, G, and B channels enhanced by the MPI-net and the preliminary enhanced images I mpi , the residual maps I res generated by the DC-net, and the obtained enhanced images I enh .

Training Details and Dataset
The experiments were carried out using Tensorflow 1.14.0, on a workstation with Intel(R) Xeon(R) CPU E5-2186 @ 3.80 GHz, Nvidia GeForce GTX 2080TI and 64 G RAM. The parameters λ 1 , λ 2 , λ 3 , and γ were set to 3, 1.3, 5, and 5 experimentally. The training images were normalized to [0, 1] and randomly cropped to patches with size of 48 × 48. The Adam optimizer was used with default parameters and the training epochs for the two subnetworks were both set to 30. For the learning rate, we first initialized it to 0.001 and reduced it by 10 times every 10 epochs. The training could be completed within 5 min.
The training dataset was from the LOL dataset, which is a real dataset containing a training dataset with 485 image pairs and a testing dataset with 15 image pairs. The scenes in the dataset are rich, and the image resolution is 600 × 400. We selected 234 image pairs of different scenes from the training dataset of the LOL-dataset as the new training dataset, and used the 15 images from the testing dataset of the LOL dataset and another 8 images from the LOL dataset (outside our training dataset) and SICE [38] as the new testing dataset. In addition, images from the LIME [19,39] were also selected to further demonstrate the effectiveness of the proposed model.

Visual Quality Comparison
We performed experiments on images with different lighting conditions. The images were from the LOL dataset (outside our training dataset), LIME, and SICE [38,39], and the results are shown in Figures 6 and 7. As shown in Figure 6, although some methods such as LIME have good brightness in local areas, they amplify the noise at the same time. More details can be seen in the last row of Figure 6, where the enhanced images of the proposed method are smooth and noise-free. The brightness of the images enhanced by BIMEF and LECARM is insufficient. The MSRCR has unsatisfactory performance in the contrast of images. In Figure 7, the results of LIME are unnatural in bright areas. Over-enhancement of the input image distorts the color of MF and Retinex-Net. Relatively speaking, the proposed method effectively enhances the brightness and the enhanced images are the most natural.

Evaluation
For a fair comparison, we use the representative metrics PSNR (peak signal-to-noise ratio), SSIM (structural similarity), and NIQE (natural image quality evaluator) to evaluate the image quality of the enhanced images. The PSNR can detect whether an image is distorted. The SSIM measures image similarity from the three aspects of luminance, contrast, and structure. The NIQE is a non-reference image quality evaluation method. We use the average of the images in the testing dataset as the test value and the results are shown in Table 1. The larger the value of PSNR and SSIM, and the smaller the value of NIQE, the better the results. The best results of PSNR, SSIM, and NIQE are bolded in this paper. It can be seen that the proposed method outperforms the other methods in all three metrics. Compared to the best results of the state-of-the-art methods, the proposed method offers a 12.682% improvement in PSNR (compared to LIME), 22.930% improvement in SSIM (compared to BIMEF), and 27.701% improvement in NIQE (compared to BIMEF). This means that the images obtained by the proposed method have the best quality.

Ablation Study
To demonstrate the effectiveness of MPI-net, DC-net, and loss function L ea , we conducted an ablation study and analyzed the experimental results. Specifically, we designed experiments: (a) removing the loss function L ea ; (b) comparing with the preliminary enhanced image I mpi . The visual comparison results are presented in Figure 8. As can be seen, without the loss function L ea , the bright area is easily overexposed, resulting in a loss of content. Moreover, the preliminary enhanced images I mpi have unsatisfactory performance in bright areas and details. However, the proposed method enhances the dark areas while suppressing overexposure, and the details are natural. Table 2 illustrates the comparison results in PSNR, SSIM, and NIQE values. We can find that the use of loss function L ea could effectively improve the quality of the enhanced image and the DC-net is necessary. As the color channels are first enhanced separately, sometimes, the DC-net may not perform well and bring color distortion. Nevertheless, the proposed method has satisfactory performance in most scenarios.

Application
To further illustrate the effectiveness of the proposed method in improving the accuracy of the computer vision task, we tested our output on Google Vision API (https://cloud.google.com/vision/, accessed on 12 June 2021). The results are shown in Figure 9. As can be seen, the Google Vision API could accurately recognize the text from the enhanced image while recognizing errors when using low-light images. The original image is from SICE [38].

Conclusions
In this paper, a multi-path interaction network (MPI-net) is designed to enhance the R, G, and B channels separately, and then a detail correction network (DC-net) and corresponding loss functions are used to adjust the details. Thanks to the information interaction between different paths in the MPI-net, the feature maps are potentially more accurate. Moreover, the enhanced images are more natural after the adjustment of DCnet. We compare our method with the state-of-the-art methods, and the experimental results show that the proposed method has better performance. The evaluation metrics of the proposed method are also superior to the state-of-the-art methods. With the wide application of computer vision, it is becoming increasingly important to improve the performance of computer vision tasks in low-light conditions. Our future work will focus on improving the generalization ability of the enhancement model and the enhancement effect in extreme environments, as well as building a complete enhancement and object detection system for nighttime autonomous driving and video surveillance.