Contextual Information Aided Generative Adversarial Network for Low-Light Image Enhancement

: Low-light image enhancement has been gradually becoming a hot research topic in recent years due to its wide usage as an important pre-processing step in computer vision tasks. Although numerous methods have achieved promising results, some of them still generate results with detail loss and local distortion. In this paper, we propose an improved generative adversarial network based on contextual information. Speciﬁcally, residual dense blocks are adopted in the generator to promote hierarchical feature interaction across multiple layers and enhance features at multiple depths in the network. Then, an attention module integrating multi-scale contextual information is introduced to reﬁne and highlight discriminative features. A hybrid loss function containing perceptual and color component is utilized in the training phase to ensure the overall visual quality. Qualitative and quantitative experimental results on several benchmark datasets demonstrate that our model achieves relatively good results and has good generalization capacity compared to other state-of-the-art low-light enhancement algorithms.


Introduction
Nowadays, in a world where multimedia equipment is much more easily accessible, images and videos have become the most ubiquitous ways to convey and record information. However, encountering imperfect capturing conditions, such as insufficient lighting and short exposure time, is inevitable, making the captured images suffer from detail loss and low brightness. Their practical value may be limited. Manual operations such as increasing exposure time and setting high ISO can mitigate the degradation. However, typical users may not quite have the required skills to operate the photograph device. Hence, enhancing contrast and details of low-light images using software algorithms is highly desirable, which can be beneficial to both low-level and high-level real-world image applications such as night imaging and video surveillance.
Low-light image enhancement has gradually become an active research topic in computer vision field in recent years. Current methods can be roughly classified into three categories: histogram-based algorithms, Retinex-based algorithms and data-driven algorithms.
An image histogram counts all pixels in an image and depicts the frequency distribution of pixel values. Histogram-based algorithms directly manipulate image histogram and stretch it to obey uniform distribution, which can widen the dynamic range of the low-light image. Some of them [1][2][3][4] adopted global histogram of input image to estimate the pixel transformation function, but they ignored the unevenly distributed darkness and might introduce over-exposure distortion in some areas of the enhanced result. To alleviate this problem, local histogram-based methods were proposed [5,6]. They used the local image histogram to make the entire transformation function more adaptive. Later on, some methods [7,8] adopted different restraints on image histogram to further consider correlation of adjacent regions and reduce overstretching distortion. While these methods or enough volumes of training data and automatically learn a non-linear mapping from a low-light to normal-light image. More powerful feature representation capacity usually requires a more complicated CNN structure, resulting in a large computational burden. While all the above methods have achieved promising performance, some results will still suffer from detail loss and noise amplification when images are taken under unbalanced lightness distribution.
In this paper, we propose an improved GAN for low-light image enhancement. We use U-net [26] as our backbone. U-Net has been applied to many image restoring tasks due to its encoder-decoder structure, which is beneficial to propagating semantic information across layers. To ease the gradient flow and boost feature propagation across different hierarchical layers, we build our encoder-decoder U-Net-like network using densely residual blocks. Previous works adopt attention mechanism [27,28] to make the enhancement model focus on important regions, but they all focus on spatial or channel dimension and ignore multiscale context information. Context information is important for image restoration tasks, where pixel-to-pixel correspondence is learned from input image to output image. In image restoration tasks, removing degraded image content and preserving desired spatial details can be realized by enlarging receptive field, leading to richer context information [29]. Thus, to reduce unwanted distortion in the enhanced results and integrate more context information in multiple scales, we introduce multiscale context attention module (MSCAM) to better refine feature maps and capture more salient features. To mitigate color deviation in the enhanced result, we further use a hybrid loss incorporating a perceptual loss in feature domain and a color-related component computing in HSV color space during the training phase. Compared to traditional techniques, our method can directly learn mappings from low-light images to normal-light images without heavy regularization design and manually selected parameters, and produce robust, visually pleasing results fast. Additionally, like previous works [30,31] using GAN in practical application, our method can be also helpful in a realistic visual monitoring system, such as enhancing visibility for face detection in the dark [25].
The main contributions of this paper are listed as follows: (1) We design a GAN framework built upon encoder-decoder structure using residual dense blocks for better gradient backpropagation and feature fusion across different layers, which plays an import role in image enhancement. (2) We introduce a novel and lightweight MSCAM module to further refine feature maps using context information in multiple scales. The MSCAM can make our model enlarge receptive fields and highlight important features, which can reduce the enhancement distortion. We train our GAN framework with the joint perceptual and color-related loss to mitigate color deviation and detail loss in the enhanced results. (3) We validate our method on several benchmark datasets. Experimental results demonstrate the superiority against many state-of-the-art methods qualitatively and quantitatively.
The remainder of this paper is organized as follows. Some related topics of this work are introduced in Section 2. We describe our proposed method in detail in Section 3. Experimental results and related analysis are presented in Section 4. Section 5 concludes this work.

Generative Adversarial Network
GAN, proposed by Goodfellow et al. [32], has attracted lots of attention and become the hottest research topic in the deep learning community. GAN is comprised of two forward networks, i.e., a discriminator and a generator. The generator is trained to produce fake images indistinguishable from real images to fool the discriminator, while the discriminator is trained to differentiate fakes images from real images as much as possible. The two networks have completely opposite training targets, and their relationship can be viewed as a minimum-maximum game in which they compete. To minimize (maximize) the training loss for the generator (discriminator), the adversarial loss can be formulated as follows: where D and G denote the discriminator and generator, and z and x denote random noise vector and real image, respectively. p z (z) represents sampling from random noise, and p data (x) represents sampling from real world.
Recently, GAN has been widely applied in many low-level image restoration applications. Ledig et al. [33] utilized GAN to recover fine texture details and infer photo-realistic natural images together using a novel perceptual loss during the training process. Jiang et al. [21] constructed an unsupervised GAN framework to perform low-light image enhancement without paired training samples. In addition, they used self-regularized attention mechanism and double discriminators, i.e., local and global discriminators, to handle the unevenly distributed lighting in the input image. All these works proved that GAN has great potential in low level image processing, and encoder-decoder structure in the generator plays a pivotal role for feature representation.

Attention Mechanism
The attention mechanism is inspired by the fact that human eyes can quickly scan the global image content to obtain target areas that need to be focused on and pay more attention to them [34]. It has been widely applied in various computer vision tasks, such as image classification [35] and image restoration [36]. Wang et al. [37] introduced non-local operation into CNN to capture distant information for spatial attention. Hu et al. [38] proposed a squeeze-and-excitation block to adaptively recalibrate important information along channel axes. In the field of low-light image enhancement, Atoum et al. [27] introduced color-wise attention map to provide auxiliary information for image enhancement. Lv et al. [28] proposed a fully CNN containing four subnets to perform brightness enhancement and denoising tasks with the guidance of two attention maps. While these works have demonstrated relatively good performance, they all capture spatial information within the same scale. In contrast, we sequentially apply the attention mechanism along both spatial and channel axes to refine feature maps, and also incorporate multiscale context information to alleviate distortion.

Dilated Convolution
Dilated convolution was first proposed for image semantic segmentation [39]. It uses dilated convolutions by plugging holes between adjacent locations in normal convolutional kernels, which can increase the size of receptive field without additional parameters and computing costs. Dilated convolution has been applied widely in the field of object detection [40]. In this paper, we further exploit multi-scale information hidden in feature maps by using multiple dilated convolutions with different dilated rates, which can adaptively aggregate contextual information without losing resolution.

Proposed Method
In this section, we elaborate on the detailed architecture and loss function of our proposed GAN for low-light image enhancement.

Overall Network Architecture
The overall architecture of our proposed generator is shown in Figure 1a. Our generator has adopted the encoder-decoder structure that has been proven effective in many image restoration tasks. The encoder extracts local and global information hidden in feature maps at different levels, such as image contrast and texture, with the increasing sizes of receptive fields. The decoder utilizes relevant information and up-samples the feature maps to produce the final enhanced result. ResNet, proposed by He et al. [15], has achieved great breakthroughs in many low-and high-level vision problems. Its identity mapping based on short connection can effectively solve the gradient vanishing problem occurred in training deep networks. Later on, Huang et al. [41] proposed DenseNet by connecting current layer with all preceding layers directly using short connections, which is helpful to feature reuse and information flow. Inspired by the structure of ResNet and DenseNet, we insert local residual dense blocks into the generator of our proposed GAN. The detailed structure of local residual dense block is illustrated in Figure 2. It consists of five convolutional layers with batch normalization and Leaky rectified linear unit (LReLU) activation, and each latter layer directly connects with all preceding layers. X I and X O denote the input and output of local residual dense block; the intermediate output X k after k-th convolutional layer can be formulated as: where Φ denotes the composite function of each convolutional layer, and [X 0 , X 1 , . . . , X k−1 ] refers to the concatenation of feature maps produced by all preceding layers and the input.
To preserve feed-forward and ease the gradient back-propagation, local residual connection is added within each dense block. Hence, the final output of each local residual dense block is as follows: Concretely, in the encoder part, we insert one local residual dense block between the convolutional layer and the max pooling layer. Similarly, in the decoder counterpart, we insert one residual dense block after the up-sampling and convolutional layer. Hence, in our proposed generator, hierarchical features can be fully exploited and fused across multiple convolutional layers.   At the same spatial resolution level, we introduce a short-path connection and insert our proposed MSCAM between the encoder and decoder part in order to remedy the information loss in the down-sampling process and further refine feature maps for better representation. Apart from the local residual and dense connections, inspired by DnCNN [42], we also adopt the global residual learning strategy to ease the training process.
The discriminator of our proposed GAN aims to judge whether the generated results can be distinguished from the real normal-light images. We adopt the structure of PatchGAN [43] without batch normalization as our discriminator, as shown in Figure 1b. The discriminator fully adopts a convolutional structure, mapping the input to an N × N matrix. Each element X ij represents the probability that each patch within a receptive field of input image is true. The final output of the discriminator is the average value of all X ij , representing whether the enhanced result is close to the real image.

Multi-Scale Context Attention Module
Previous works [21,27,28] adopt the attention mechanism to guide the enhancement process. However, local distortion may appear in their enhancement results. To reduce local distortion, non-local context information can be integrated into the model. Context information has been successfully applied to many works [40,44] due to the auxiliary information from surrounding content. The down-sampling operation in our generator will inevitably causes information loss with the spatial size of feature maps becoming smaller. To fully exploit hidden information and further refine feature maps in the generator, motivated by [40,45], we introduce MSCAM embedded at each spatial resolution level for better feature representation. As depicted in Figure 3, our MSCAM consists of two main parts: multiscale context spatial attention submodule (MCSA) and channel attention submodule. Figure 3. Detailed structure of multiscale context attention module.
X i ∈ R C×H×W denote the feature map after each residual dense block at layer i (i = 1, 2, 3, 4), where C is the channel number, and H and W represent the height and width of the feature map, respectively. Firstly, our MSCAM concatenates the output feature from the MSCAM located at the next layer X i+1 O , and merge the two feature maps using a 1 × 1 convolutional operation to form a new feature map X. Then, the following two attention modules further exploit and generate a 2D spatially refined feature map and a channel-wise refined feature map separately, which encourage the generator to capture more relevant information on spatial and channel dimensions. The whole process can be expressed as follows: where F MC denotes the multiscale context spatial attention module, X MC ∈ R C×H×W represents the spatially refined feature maps, F C represents the channel attention module, and X C ∈ R C×H×W represents the channel-wise refined feature maps. To facilitate convergence during the training process, we add residual connection to connect the input feature maps and the refined feature maps as follows: We also duplicate the output feature maps, denoted as X i O , and transmit them to the MSCAM at previous shallow layer for cross-layer feature interaction. Note that for the deepest layer, only feature maps from the current layer are needed.
The detailed structure of the MCSA submodule is inspired by the fact that local details are easily noticed by human eyes over global content, and different distances can make human eyes focus on different ranges. Therefore, we adopt four convolutional layers with different dilated rates to excavate multiscale context information hidden in feature maps, aggregating non-local information of different scales into the generator. As is shown in Figure 4, four dilated convolutional layers are arranged in four branches, inspired by the structure proposed in [40]. In each branch, we use the spatial attention module to highlight important regions. Concretely, we apply both average pooling and max pooling operations along channel dimension on feature maps after dilated convolutional operators, producing two feature descriptors with the size of H × W × 1. After concatenating them on the channel dimension, we use a convolutional layer with kernel size of 7 × 7 to squeeze its dimension to H × W × 1. Finally, the sigmoid function is employed to produce the attention map M i (i = 1, 2, . . . , 4), which is then multiplied with the input feature map in each branch X i (i = 1, 2, . . . , 4) element wise. The results from four branches are concatenated to produce the final refined feature maps X MC .
Apart from spatial attention mechanism that captures informative regions, feature channels also contain import cues. Different channels contain key information with different levels of importance. Inspired by previous work [38,45], we also adopt channel attention submodule to model channel interdependencies on channel dimension. We present the channel attention submodule in Figure 5. Feature maps after MCSA submodule are encountered with global and average pooling operations along spatial axes to obtain two channel descriptors with the shape of C × 1 × 1. Then, both descriptors are, respectively, sent to a multi-layer perceptron consisting of two shared fully connected (FC) layers with ReLU activation function. The length of the features after first FC layer is set to (C/r) × 1 × 1 to reduce complexity, where r is the reduction ration, while the length of the output after the second FC layer remains unchanged. After the shared network is applied to both descriptors, we adopt element-wise summation operation to merge the two output feature maps. A sigmoid activation function is followed to turn the merged feature into the final channel weights for different channels, denoted as M C . Finally, the input feature map X MC to channel attention submodule is multiplied with the channel attention map in an element-wise style to obtain the channel-wise refined feature map

Loss Function
Loss function plays a pivotal role in training deep neural networks. An appropriate loss function can stabilize the training process for better convergence and contribute to producing visually pleasing enhanced results. In this paper, we adopt a compound loss function L total consisting of three components: GAN loss L gan , perceptual loss L p , and color loss L c , which can be calculated as follows: where λ 1 and λ 2 represent corresponding weight coefficients for perceptual loss and color loss to balance relative importance of each component. We detail each loss component below. GAN aims at generating fake images that are similar enough to real high-quality images that can deceive the discriminator. Standard GAN loss tries to guide the generator to produce images with the distribution matching to that of real natural images. Here, we adopt relativistic average discriminator [46] which predicts the relative probability that real images are more realistic than fake images and leads the generator to produce more realistic results. The loss function of relativistic average discriminator can be written as: where C represents the discriminator, x r and x f represent samples from real distribution P real and fake distribution P f ake , respectively, σ denotes the sigmoid function, and E represents taking average value of all samples. We use least square GAN loss instead of sigmoid function in the discriminator. Therefore, the loss function of our generator and discriminator can be formulated as: where L G and L D represent the respective loss of generator and discriminator. During the training phase, the generator and the discriminator are optimized alternately. Perceptual loss proposed in [47] has been successfully applied in many low-level visual applications. It calculates and minimizes the distances between generated images and ground-truth images in feature domain by feeding images into a pretrained VGG-19 network, guiding the results to have visual appearances similar to the target. Unlike some previous works that calculate feature distances after only one layer in VGG-19 model, we choose feature maps after multiple convolutional layers since information within multiscale receptive fields can be involved. Hence, the perceptual loss is defined as the l 2 norm of feature maps: whereŷ and y denote the ground-truth image and the generated result, φ j (·) denotes feature extraction after j-th convolutional layer of pretrained VGG-19 model, I N(·) denotes instance normalization [48], H j and W j denote spatial height and width of feature map. Color deviation inevitably occurs during training process when enhancing low-light images. To mitigate color deviation and further improve image naturalness as much as possible, motivated by [49], we utilize color loss computed in HSV color space, which is closer to human perception and has been widely applied in image processing application. Color loss is formulated as: where H, S, and V represent the hue, saturation and value components of generated image respectively, andĤ,Ŝ, andV represent those of the target image.

Dataset Description and Evaluation Metrics
We conduct extensive experiments to evaluate our proposed method systematically. LOL [19] and SICE [50] are two common paired datasets in the low-light image enhancement research community. Like many previous works [25,51] did, we use them as our benckmark datasets to validate our model. Some previous works [52,53] also evaluate their methods on MIT-Adobe FiveK [54] dataset. Nevertheless, this dataset is originally constructed for enhancing aesthetic quality of images, and only a small portion of images were taken under low-light conditions. Thus, we do not use it as our benchmark dataset.
The original LOL dataset contains 500 real-captured low-/normal-light image pairs with the size of 400 × 600, which were captured in real scenes by change exposure time and ISO of camera. To further enrich content diversity, 1000 more low-/normal-light images pairs with the size of 384 × 384 were synthesized as supplements by analyzing the illumination distribution of real captured low-light images. The constructors of LOL dataset divide the 500 real-captured image pairs into the training set with 485 image pairs and the testing set with 15 image pairs. However, their original splitting scheme cannot guarantee the comparative methods' performance to be fully evaluated since the number of testing set is too small and the contents of training and testing sets are repetitive. Hence, we adopt the modified version of LOL dataset proposed in DRBN [55], which consists of 789 low-/normal-light image pairs. The 689 pairs are served as training samples, while the other 100 pairs are the testing set. We also use the 1000 synthesized image pairs in the LOL dataset as supplementary training data, so our training data contain 1689 pairs including real-captured and synthesized images. In addition, the contents of testing set are unseen during the training phase which is important for fair comparison.
The original SICE dataset contains 589 sequences of different scenes, and each sequence includes multiple images with different exposure levels and a high-quality reference image selected by multi-exposure fusion algorithms. We filter out SICE images containing misalignment caused by moving objects and image distortion during the collecting process. Additionally, we also reduce the spatial size of SICE images due to limited computing resources. The resized images have the same length-width ratio as the original images with the long side containing 800 pixels. Finally, we choose 1300 under-exposure images with their corresponding references as training set, and another 120 images as testing set. The contents of testing set do not overlap with the training set.
In addition, we also evaluate the generalization capacity of our method on several widely-used benchmark low-light image datasets, i.e., LIME [12]  These datasets have no corresponding normal-light images as ground-truth images. Their image contents are much more diverse and not included in LOL and SICE dataset. All the datasets were built in natural scenes, where images do not contain null data. In addition, we deleted completely dark images before experiments, rather than changing and re-scaling them [60], since they do not contain any useful information.
We use four common image quality metrics to measure the perceptual quality of enhanced results, i.e., peak signal-to-noise ratio (PSNR), structure similarity [61] (SSIM) index, natural image quality evaluator [62] (NIQE) and integrated local NIQE [63] (IL-NIQE). The PSNR is ratio between the maximum power of normal light image and the power of background noise which degrades image fidelity. SSIM measures image quality from the perspective of image structure since human eyes pay more attention to image edges and details, and image distortion degrades structural information. These two metrics require corresponding high-quality images as references when evaluating perceptual quality. NIQE extracts relevant features from high-quality natural images to build a perfect quality model, and measures image quality by calculating distance from it. IL-NIQE further extends NIQE by taking local patches into consideration. These two metrics can directly obtain perceptual quality scores from input images only without any references.

Implementation Detail
Our method is implemented using Pytorch framework and trained on a PC with Nvidia 2080Ti GPU and i7-8700k CPU. We adopt the Adam optimization algorithm [64] with the batch size of 4, and the default parameters of Adam (β 1 = 0.9, β 2 = 0.999) are fixed. The initial learning is set to 10 −5 . During the training process, we crop patches at random location in all training images with the unified size of 320 × 320 × 3. The loss weight λ 1 and λ 2 are all set to 5. For low-/normal-light image pairs, the cropping position in coupled images is kept same to avoid unwanted pixel misalignment. For data augmentation, we randomly flip training images horizontally. The training images are normalized into the range of [0, 1]. All image batches are packed into 4-D tensors to serve as the CNN inputs. Batch normalization is adopted after each convolutional layer for better convergence during the training process. All training parameters in convolutional layers are initialized using the method introduced in [65]. We have trained our network for 120 epochs in total, and the initial learning rate is decayed by multiplying with 0.5 per 30 epochs. During the testing stage, the batch size is set to 1. Hence, our network can process images with arbitrary spatial size.

Comparison with State-of-the-Art Methods
We evaluate our method on benchmark datasets and compare it with several stateof-the-art methods. The competitive methods include: (1) Retinex theory-based methods: LIME [12], BIMEF [66], NPE [13], and SRIE [14]; (2) deep-learning-based methods: RetinexNet [19], GLADNet [67], Zero-DCE [25], RRDNet [24], and EnlightenGAN [21]. We choose these comparative methods like many previous pieces of work [27,28,55] did since they are representative in the field of low-light image enhancement. For learning-based methods, we use source codes provided by their corresponding authors. We retrain and test them on the same training and testing dataset as the proposed method for fair comparison. For methods that do not need training process, we directly test them on the same testing set. Note that when evaluating on LOL and SICE, we use their respective training and testing samples.
We first list the quantitative results of all competing methods on the LOL and SICE datasets. PSNR and SSIM are chosen as evaluation metrics since both LOL and SICE have referenced ground-truths. Besides, we also use NIQE score as extra index to measure quality without reference. For PSNR and SSIM, a larger score means better perceptual quality, while for NIQE, a lower score means better perceptual quality. The average scores of all methods on each dataset are listed in Table 1. As is shown in Table 1, our method has outperformed all state-of-the-art methods on LOL dataset in terms of PSNR and SSIM even though we do not directly adopt MSE loss and SSIM loss in the optimization process, demonstrating the superiority of our method. For NIQE score on LOL dataset, our score has made the top 2 and is only inferior to GLADNet. For SICE dataset, our method has achieved good results except for SSIM, which is only slightly lower than GLADNet. Nearly all methods' results degrade on SICE dataset compared to LOL dataset. From Table 1, we can also find that methods without any training data, such as RRDNet and SRIE, perform poorly, indicating the importance of data-driven routine. Overall, learning-based methods perform better than traditional Retinex-based methods, attributable to the strong representation power of CNNs. Then we show some qualitative results of all competing methods. Figure 6 displays visual comparisons with state-of-the-art methods on the LOL dataset. To make image details more visible, we also zoom in on some details in the bounding boxes and show them in the bottom left corner. Specifically, in Figure 6b-e, traditional Retinex-based methods fail to generate enhanced results with sufficient brightness and their results contain intense noise. Zero-shot learning-based method RRDNet cannot enhance the low-light input, as shown in Figure 6f, indicating the importance of training data. In Figure 6g-i, these three learningbased methods also produce images with severe noise, such as the surrounding of the wheel under ping-pong table shown in red rectangles. On the contrary, in Figure 6j, GLADNet generates relatively good enhanced result compared to Retinex, Zero-DCE, and EnlightenGAN.
However, it still brings some noise and color deviation, making the results not as close to the ground-truth image as our method. Figure 7 also displays visual comparison against state-of-the-art methods on the SICE dataset. We can see clearly that LIME improves the illumination, but brings color deviation. NPE, SRIE, BIMEF and RRDNet yield results without sufficient brightness. RetinexNet and Zero-DCE yield results with much noise and color distortion. In Figure 7i, the result of EnlightenGAN has obvious over-saturation and under-exposure, such as "sky" and "building". It seems that the result of GLADNet and our method look very similar, which corresponds to the results in Table 1. However, in Figure 7k, our result contains a little more texture, such as "tree" in the red rectangle.   Apart from the above two benchmark datasets, we also evaluate our method on five real-world datasets widely used in many low-light enhancement algorithms to test the generalization capacity. Since these datasets have no ground-truth reference normal-light images, we adopt NIQE and ILNIQE as the evaluation metrics. A lower quality score means better perceptual quality. Before we test our model on these datasets, we train our model using LOL training samples. The quantitative results are listed in Table 2. We also list weighted average indexes for all methods, where the weight is calculated according to the number of images in each dataset. As can be seen from Table 2, our method obtains relatively good performances on all five datasets. Even though our method does not achieve the best results in terms of NIQE and ILNIQE scores on some datasets such as MEF and LIME, it is still in the top three results and the scores are only slightly lower than the best. Our method has achieved the best average indexes, which indicates that our method has relatively good generalization capacity. Some visual results on these datasets are shown in Figures 8-11. Figures 8 and 9 show representative enhancement results of different methods on two real-world datasets, i.e., MEF and VV. As can be seen from above, our model can effectively enhance real-world low-light images. Concretely, LIME can enhance visual visibility, but it produces result with over-exposure and color shift, such as "balloon" in Figure 8b and "plant" in Figure 9b. Obviously, NPE, SRIE, BIMEF and RRDNet fail to improve visual visibility and weaken image contrast. RetinexNet and Zero-DCE can restore image brightness but amplify tensive noise in some areas, such as top part of "balloon" in Figure 8 and "plant" in Figure 9. EnlightenGAN yields dark results near the characters on the "balloon" in Figure 8 and also contains halo effects near the "woman" contour in Figure 9. GLADNet also amplifies noise and produces over-exposure in "woman face" area. Contrarily, our method can produce results with rich texture, sufficient brightness, and less distortion.  Figures 10 and 11 are another two representative enhancement results from two realworld datasets, i.e., MFUSION and LIME. In Figure 10, LIME and GLADNet can enhance image lightness, but they tend to be enhanced in the "tree" region. NPE and RetinexNet can produce results with suitable exposure, but they introduce more color distortion on the pavement. The result of Zero-DCE looks unnatural, and SRIE, BIMEF, RRDNet, and EnlightenGAN can not improve image lightness sufficiently, such as tree texture, as our method. In Figure 11, NPE, SRIE, BIMEF, and RRDNet contain under-exposed region. LIME, RetinexNet, and EnlightenGAN have obvious color deviation on the pavement near the car. Compared to GLADNet, our result has clearer blue traffic sign and less blurred pattern on the pavement.

Ablation Analysis
To evaluate the effectiveness of each component in our model, we conduct several ablation experiments. Specifically, we remove one component of our model each time, and re-train the model using the same parameter settings and the LOL dataset. To verify the role of the proposed MSCAM, we replace them with short connection to directly link relevant layers at the same spatial level. To evaluate the effectiveness of multiple residual dense blocks (RDBs), we replace them with convolutional layers of same channel numbers, same activation function and batch normalization layers. Moreover, to demonstrate the performance gain introduced by different parts of loss function, we set the weights λ 1 and λ 2 of perceptual loss and color loss to 0, respectively. Our model is GAN-based method; hence, adversarial loss cannot be omitted. Each model variant is evaluated on LOL testing set. All the quantitative results are listed in Table 3.
As can be seen from Table 3, our model with all components achieves the best results, demonstrating all components in our model have their own contributions to the final results. Obviously, model without RDBs or MSCAM degrades the performance by a large margin. Besides, each component in loss function plays a critical role during the training process. To demonstrate the results in a more intuitive way, we also display some qualitative results in Figure 12. All model variants can improve the brightness, but they still contain some distortion. Without RDBs or MSCAM, the enhanced results contain color shift, namely, the "back wall area" in Figure 12b, and noise amplification, such as the "door" region in Figure 12c. Without perceptual loss, the enhanced result in Figure 12d yields a severe color shift. The result in Figure 12e contains less color deviation, but without color loss, the result still includes slightly color shift in "back wall area near the chair", degrading the enhancing result. In contrast, our result looks more visually pleasing, and contains vivid color and less noise, validating the effectiveness of each component. We also investigate the impacts of the number of convolutional layers in RDBs. We gradually change the number of convolutional layers from two to five. Each time we re-train and evaluate our model using LOL dataset with other configuration fixed. Using more than five layers will exceed the GPU resources at our disposal. The quantitative results are in Table 4. We can see that using five convolutional layers in each RDB can improve performance. In addition, we study the impacts of two weight coefficients λ 1 and λ 2 in the training loss function. Each time we vary one of them and fix the other to 1, then we re-train and test our model using the LOL dataset. We plot the performance curves with respect to λ 1 and λ 2 in Figure 13. We can see that in general SSIM and NIQE are stable with respect to λ 1 and λ 2 at a wide range. PSNR has larger fluctuations than SSIM and NIQE since it calculates pixel-wise difference between two images. To balance three metrics and intuitively observe loss curve in the training process, we empirically set both λ 1 and λ 2 to 5.

User Study
To validate the visual similarity between the enhanced results and the ground-truth normal light images subjectively, we have invited 20 volunteer college students without extra image processing knowledge to judge whether the enhanced results are similar to the groundtruths. We used ten images chosen from the LOL dataset, and requested all the volunteers to give their ratings on similarity level according to their visual perception. The ratings consisted of five discrete scores, where 5 denotes the best visual similarity to the ground-truth image and 1 denotes the worst. All the volunteers were sitting at a viewing distance about 2.5 times of the monitor's height to assign scores to every enhanced image. Each time we presented an enhanced result of one comparative method with its corresponding ground-truth image to one subject. The order of displaying method is random, and all subjects do not know which images are generated by our method. For each competitive method, we averaged the scores of all images given by all subjects as the final similarity score. Here, we only conduct subjective tests on learning-based methods. The results are listed in Table 5. It is observed that our method has obtained superior mean opinion score (MOS) compared to others, meaning our enhanced results are most similar to the ground-truth normal-light images. RetinexNet and RRDNet obtain the lowest scores among all methods, indicating their enhanced results have distinct differences to the ground truths. EnlightenGAN and GLADNet are slightly inferior to our method, which are consistent with previous analyses.

Computational Complexity
An ideal enhancement algorithm is expected to produce excellent enhancing results as well as to have low computational complexity. We also test the running speed of each comparative method. We choose an image from the LOL dataset with a fixed spatial resolution of 400 × 600 × 3 for all the methods. All the comparative methods are running on a desktop computer with a 3.7 GHz CPU and a 32 GB internal memory. For learningbased methods, we also use a NVIDIA GTX1080ti GPU to accelerate the testing phase. We list the running time of each method in Table 6. Our method ranks third among all the methods. Clearly, Zero-DCE has achieved the fastest inference speed since it adopts a light-weight architecture to estimate enhancement curve. Although our method has more extra computational burden than EnlightenGAN, it can still provide a relatively good trade-off between efficiency and effectiveness. Table 6. Runtime comparisons of different methods. The best result is highlighted in boldface.

Conclusions
In this paper, we propose an improved GAN for low-light image enhancement. To promote hierarchical feature fusion across multiple layers at different depths and ease information flow, we insert local residual dense blocks into the generator. Then, to alleviate local distortion and reduce noise, we introduce multiscale context attention module integrating more contextual information from multiple scales, and enable feature interaction across layers. Spatial and channel-wise dependencies are also modeled to adaptively refine and dynamically recalibrate feature maps. A hybrid loss function containing perceptual and color component is utilized to train the proposed model. Quantitative and qualitative results demonstrate that our proposed method can generate visually pleasing enhanced results. However, our method still has some limitations. For instance, it cannot process images taken in extremely dark scenes, and parameters in our model can be further reduced. In the future work, we will explore the use of transfer learning to reduce the complexity and further improve the performance.