LighterGAN: An Illumination Enhancement Method for Urban UAV Imagery

In unmanned aerial vehicle based urban observation and monitoring, the performance of computer vision algorithms is inevitably limited by the low illumination and light pollution caused degradation, therefore, the application image enhancement is a considerable prerequisite for the performance of subsequent image processing algorithms. Therefore, we proposed a deep learning and generative adversarial network based model for UAV low illumination image enhancement, named LighterGAN. The design of LighterGAN refers to the CycleGAN model with two improvements—attention mechanism and semantic consistency loss—having been proposed to the original structure. Additionally, an unpaired dataset that was captured by urban UAV aerial photography has been used to train this unsupervised learning model. Furthermore, in order to explore the advantages of the improvements, both the performance in the illumination enhancement task and the generalization ability improvement of LighterGAN were proven in the comparative experiments combining subjective and objective evaluations. In the experiments with five cutting edge image enhancement algorithms, in the test set, LighterGAN achieved the best results in both visual perception and PIQE (perception based image quality evaluator, a MATLAB build-in function, the lower the score, the higher the image quality) score of enhanced images, scores were 4.91 and 11.75 respectively, better than EnlightenGAN the state-of-the-art. In the enhancement of low illumination sub-dataset Y (containing 2000 images), LighterGAN also achieved the lowest PIQE score of 12.37, 2.85 points lower than second place. Moreover, compared with the CycleGAN, the improvement of generalization ability was also demonstrated. In the test set generated images, LighterGAN was 6.66 percent higher than CycleGAN in subjective authenticity assessment and 3.84 lower in PIQE score, meanwhile, in the whole dataset generated images, the PIQE score of LighterGAN is 11.67, 4.86 lower than CycleGAN.


Introduction
With the development of unmanned aerial vehicle (UAV) technologies, more important and complex tasks are performed by low or ultra-low altitude UAVs which embedded powerful functions, especially in the field of urban remote sensing [1]. Meanwhile, in the last decades, with the development of computer science and the popularization of its applications, intelligence has become the development tendency of modern cities [2]. In GAN [14]), and apply two evaluation criterion, visual subjective perception and objective no-reference image quality assessment. In the test set experiment, our algorithm achieves the best results in both evaluations. Subjectively, LighterGAN is 0.42 higher than state-ofthe-art EnlightenGAN which ranked the second place, meanwhile, the PIQE (perception based image quality evaluator) [15] score was 2.73 points lower than the second place. In the enhancement experiment of sub-dataset Y, LighterGAN also achieves the lowest PIQE score of 12.37, 2.85 points lower than second place. In terms of model structure improvement and generalization ability enhancement of LighterGAN, as an improved model based on CycleGAN, the promotion of its two improvements, attention mechanism and semantic consistency loss, has also been demonstrated by the combination of subjective and objective evaluation. Because the two sub-datasets used for training are all collected in reality, the authenticity of the generated image can be used as a criterion for evaluating the generalization ability of the model. After using the two mappings to generate test images, LighterGAN is 10 percent higher than CycleGAN in subjective authenticity assessment and 3.84 lower in PIQE score. Moreover, after using the two mappings to generate images for the whole dataset, PIQE score of LighterGAN is 11.67, 4.86 lower than CycleGAN. As the improved model is proven to be able to generate images with higher authenticity and higher quality, the benefits of improvements can be demonstrated.

Limitations of Size-Limited Paired Dataset in Urban-Wise Image Illumination Enhancement
In terms of implementation details of LigherGAN, a low illumination UAV image will be mapped to the feature space by the CNN based encoder and become feature maps, this process could be described as either feature extraction or down-sampling. Whereafter, in the up-sampling, the feature maps will be translated into illumination enhanced image by the decoder, items in the generated image will be characterized by sufficient illumination. In order to enhance an image under the premise of ensuring authenticity, theoretically, the larger the number of images in a dataset, the greater the probability of implementation. However, the image semantic feature information provided by the artificially generated paired dataset is not sufficient. In the urban-wise low illumination images, many scenes include interference light, affecting image readability, which is caused by tamps, orange light reflected by water and building surface, etc. but these scenes will not appear under the case of sufficient illumination environment. By using fixed-parameters, enhanced images will also result in the neural network mapping being inflexible and unable to cope with the diversified application environments where the illumination varies with the environment -even the images in an artificially generated paired dataset can be created simply by using the brightness linear adjustment algorithms. On the other hand, the GAN based model needs enough data to support the convergence in training [16]. In the case of training when only a small amount of data is available, it is difficult to achieve a satisfactory Nash equilibrium [17], and it is easy to fall into model collapse, in which case images generated by the model are not easily recognized by the discriminator, like meaningless noise or even things do not exist in reality. As such, for our study, an unpaired dataset is used. In the design of the model, we consider that when image illumination has enhanced, items in the image is expected to show some necessary sufficient illumination related characteristics, e.g., road and surface of the water will not reflect the light of a streetlamp significantly, resulting in the feature distributions changing. However, the transformation of these features is based on the model being trained on a more reasonable dataset, therefore, our unlabeled and unpaired dataset is more advantageous.

Image Quality Assessment (NR-IQA)
IQA (image quality assessment), the necessary evaluation for image enhancement algorithms, is divided into subjective assessment and objective assessment. The former evaluates image quality from subjective visual perception, while the latter uses mathematical models to give quantitative values with the goal of making objective judgments consistent with the subjective perception of visual perception. Subjective evaluation will inevitably be affected by display equipment, lighting conditions, visual ability and emotion of observers, etc., and it is difficult to handle a large number of tasks. Therefore, automatic and accurate prediction of the objective quality evaluation is necessary. IQA can be divided into 3 types according to the amount of information provided by the original reference image: FR-IQA (full reference-IQA), RR-IQA (reduced reference-IQA) and NR-IQA (no reference-IQA). FR-IQA has both the original image and the enhanced image, the core of which is to compare the information amount or feature similarity. The information possessed by NR-IQA is only the enhanced image itself without any other reference. RR-IQA, the type of method between FR-IQA and NR-IQA, possesses partial features extracted from the original image [18].
For FR-IQA, the most authoritative algorithms are PSNR (peak signal to noise ratio) and SSIM (structural similarity index) [19], both of which are based on the premise that there is a mapping relationship between the original and enhanced images. In this article, unsupervised learning and unpaired dataset are used, which means that each enhanced image does not have a unique correct solution. Therefore, PSNR and SSIM are not suitable for evaluating the quality of images enhanced by the LighterGAN.
Because people are fully capable of judging the quality of a specific image without any reference, NR-IQA with the same preset conditions is more practical in the evaluation of the performance of LighterGAN. In this article, perception based image quality evaluator (PIQE), a representative evaluation method, is applied. As a built-in function in the image processing toolbox of MATLAB, the quality evaluation ability and impartiality of PIQE have been proven, and it has been widely used in the evaluation tasks of images generated by deep learning model, especially in generative adversarial networks, such as [20][21][22] et al. For urban-wise UAV applications, high authenticity and avoidance of distortion and degradation are required by the subsequent intelligence image algorithms. Meanwhile, in each patch of urban UAV aerial photography captured image, items are greatly complex and various. Therefore, as a block-wise algorithm which uses the method of dividing the image into non-overlapping blocks for subsequent calculations, the PIQE is obviously appropriate. More specifically, there are three masks in PIQE, the activityMask can detect compression artifacts and noise in the input image and estimate the spatial quality, the noticeableArtifactsMask focuses on the blocking artifacts and sudden distortions in the activityMask, the noiseMask can locate and estimate the blocks contain Gaussian noise. In addition, the PIQE image quality level is based on the LIVE Image Quality Assessment Database Release 2 [23]. For an enhancement algorithm based on convolutional autoencoder, the performance of LighterGAN highly relies on the quality of image reconstruction, meanwhile the model generalization ability can also be evaluated by the quality of the generated image [24]. Considering comprehensively, in this article the PIQE score is used in image enhancement quality objective assessment and the quantitative evaluation of model generalization ability.

Generative Adversarial Network
Generally, there are two types of deep learning models in terms of tasks: the generative model and the discriminative model [25]. The discriminative model allowed a machine to learn the decision function or conditional probability distribution to output the prediction results. The generative model allowed the machine to learn the joint probability density function and made the conditional probability distribution as the output; GAN is the combination of two models. In 2014, the generative adversarial network was designed by Goodfellow, which allowed researchers to mix the spurious images with the genuine by using unsupervised learning [26]. In GAN, the generator learns feature distributions of real images to make the generated images more realistic to deceive the discriminator. The discriminator has to become more intelligent to discriminate the authenticity between both types of images with model training. After reaching the Nash equilibrium, the generated images achieve high quality and are hard to be identified.
In recent years, with the development of the GAN family, more effective GAN such as WGAN [27], DCGAN [28], and CycleGAN have been designed to improve the performance in the fields of dataset augmentation, image enhancement, image denoising, image classification, super-resolution, semantic segmentation, etc. [29]. Among these applications, image translation is the most eye-catching one in recent years, as GAN has shown its advantages in the feature transformation field. Initial image translation started with a CGAN-based supervised image translation model, the Pix2Pix, which was proposed by Isola [30]. 'Supervised' means it uses a paired dataset for training, where each image has a mapping with a certain image in another subset, shown in Figure 1a. It uses an end-to-end architecture based on GAN, where the input is the original image x, and the output is the translated image T(x). The original images and the translated images are combined to train the discriminator. Paired datasets are costly in certain cases and have to be well-labeled; therefore, more and more unsupervised image translation models have been proposed recently [31], the most classic one is the CycleGAN. CycleGAN uses an unpaired dataset for training and has a cycle consistency loss to replace the previous reconstruction loss in the achievement of image translation. In this article, the principle of image illumination enhancement of LigherGAN is also based on image translation, the enhancement will be achieved by translating images with low illumination features into sufficient illumination features after LigherGAN has been well-trained by the unpaired dataset, shown in Figure  1b.

Unpaired Dataset
As above-mentioned, in the illumination enhancement task, especially when researchers want to collect images in reality instead of adjusting brightness through algorithms, paired data is hard to collect [32]. So, in this paper, unpaired dataset is applied, two UAV imagery sub-datasets are collected, sufficient illumination sub-dataset . It is worth noting that there is no labeling information in both sub-datasets. Objects found in images included nearly everything in common urban areas, such as buildings, plants, streets, people, cars, grass, water, etc. The images in the dataset are collected by ultra-low altitude (flight altitude is 0-100 m) or low altitude (flight altitude is 100-500 m) UAV, the spectral range of the sensor is between 450 and 650 nm, and the radiometric depth is 8-bits, meanwhile, the dataset also contains images collected from websites [33][34][35].
Considering the authenticity of the enhanced image, ideally, the neural network may have sufficient comprehensive ability concerning what the objects in the image should be like in the reality and, using intelligible information to perform the illumination enhancement with image translation. Furthermore, the underlying relationship between two kinds of scenes (low illumination and sufficient illumination) has to be found out carefully. In LighterGAN, a wiser discriminator is needed to judge the authenticity of the enhanced image, as well as a more cunning generator to make an enhanced image as realistic as possible. Hence, details of an image have to give the neural networks as much information as possible. Aiming to guarantee the diversity of dataset content, the multiformity of scene depth in the collecting of dataset is considered. There are three image types, close shot, medium shot, and panoramic shot. Furthermore, for the low illumination images collecting, various exposure time has set from 0 to 2 s, to make sure that the image semantics are identifiable.
In the design of both sub-datasets, because the quantity of images in our dataset is significantly larger than other illumination enhancement datasets, traditional data augmentation methods such as image reversal and image stretch are not applied. This also guarantees the authenticity of the enhancement to a certain extent.

Normalizations
To prevent the influence of affine transformation, geometric transformation, and also speed up the gradient descent, normalization methods have to be used. There are two common normalizations, instance normalization and batch normalization. Each image has several channels in a convolution layer of a convolutional neural network (CNN), batch normalization refers to the same channel for each of feature maps to perform the normalization operation together. Instance normalization means that a single channel of a single image is individually subjected to achieve normalization operation.
Batch normalization is suitable for discriminative models, such as image classification [36] because it focuses on normalizing each batch to ensure the consistency of the data distribution. The result of the discriminant model depends on the overall distribution of the data; however, batch normalization is not suitable for all applications for two reasons:

•
It is sensitive to the batch-size. Because mean and variance are calculated on one batch, if the batch size is too small, the calculated mean and variance will not be enough to represent the entire data distribution.

•
The statistics involved in the calculation of the normalization output of a particular sample will be affected by other samples in the batch.
Instance normalization is suitable for generative models, particularly image translation which will be used in our image enhancement algorithm [37]. Because the results of image generation mainly depend on an image instance, normalizing the entire batch is not suitable for image translation tasks. Using instance normalization in image translation could not only accelerate model convergence but also maintain independence between each image instance.
where x is the sample image, n is the number of samples, c represents the channel, h is height, w is weight, µ nc is the means of x in one channel, σ nc (x) 2 is the variance of a certain channel, is a constant added to increase training stability, and y nchw is the normalization result. For the reasons above, instance normalization was used in the 2nd, 3rd, and 4th convolutional layers in the discriminator, the 1st and 2nd deconvolutional layers in the decoder, and also used in the residual block.

Network Structure
After the creation of the generative adversarial network, the original structure has been modified by many scholars and used to implement their own applications. Generally, when the GAN structure reaches a reasonable dynamic equilibrium, a stricter discriminator will compel the generator to provide generated images close to the real, making it difficult for the discriminator to make a judgment. The image enhancement is the basis of subsequent algorithms, and the authenticity of the enhanced image will also directly affect the performance of subsequent image processing algorithms. So, the advantages of better-performing generators and discriminators are obvious.
Generally, the factors that need to be referenced in the initialization of a neural network structure are: The function of the neural network; • Strength of computing resources; • Existing network structure with reasonable or excellent performance that could be modified for aiming application; Experimentally, the performance of the initial network structure has evaluated. When a reasonable structure has found, it optimized manually to fine-tune the depth of channels, transfer functions, normalization methods, etc.

Markovian Discriminator (PatchGAN)
Generally, a discriminator of a GAN will output a certain number (0 for fake, 1 for real) for an image; however, features in an image depend on pixels and their neighboring area, i.e., pixels far away have less relevance [38]. For an image enhancement model pursuing authenticity, it has to make a certain judgment for each part of the image, so a patchbased discriminator is applied. This discriminator takes advantage of a convolutive neural network and can process each image block independently in the same way; therefore, each result in the judgment matrix that is output by the discriminator is the result of performing individual judgment of each patch-this design makes a more reasonable judgment.
In the structure of discriminator, there are 3 down-sampling convolutional layers with the rectified linear units (ReLU) transferring a 256 × 256 × 3 input image x into a 32 × 32 × 1 feature map. For the high-frequency information, only the local patches are the focus of the image. The instance normalization is used to map data to the range { 0 ∼ 1}, which optimizes the efficiency of model training while avoiding gradient vanishing and exploding. At the last CNN layer, a 1 × 1 convolution is used to achieve dimensionality reduction [39], from 32 × 32 × 512 to 32 × 32 × 1, and achieve cross-channel information interaction and nonlinearity increase. An attention model is added at the 2nd and 3rd convolutional layers.

Autoencoder
To perform feature extraction of an image and efficient illumination enhancement mapping, hourglass-like convolutional autoencoder architecture (shown in Figure 2) is applied. As a typical unsupervised learning model, the autoencoder contains two major parts: an encoder and a decoder [40]. The encoder maps the input low-illuminance sample x to the feature space z, the decoder maps z (the abstract features) to the sufficient illumination feature space to obtain the sample y (illumination enhanced image). Convolutional neural network and deconvolutional neural network are used in the structure of the encoder and the decoder, respectively. A CNN (convolutional neural network) is a multilayer Hubel-Wiesel structure constructed by imitating the information processing process of the human optic nerve system [41]. One of the advantages of CNN is that it can avoid the complicated pre-processing of images. A CNN can also simulate the multi-level structure of the visual signal processing system in a human brain to establish a mapping from low-level features to high-level semantics, so as to achieve data hierarchical feature extraction.
When the classical neural network model is adopted (fully connected neural networks), the computer needs to read the whole image as the input of the neural network model. When the size of the image is large, its parameter scale and training computation will become huge. It is worth noting that human cognition of the world is generally to recognize certain parts first and then gradually recognize the whole [42]. Similarly, in the image spatial connection, pixels within a local range are closely related to each other, while those at a distance are weakly related. Each neuron in a neural network does not need to perceive the global image. It only needs to perceive one certain part and the integration of the local information can obtain global information. This is the receptive field mechanism of convolutional neural networks.
In the autoencoder structure, ReLU (the rectified linear unit) is used as the activation function to perform nonlinear mapping on the output of the convolutional layer [43], it can improve the neural network's perception of item features in sufficiently illuminated natural light images. In this study, one encoder was trained for both types (low/sufficient illumination) of images and decoders were trained for Mapping A and Mapping B, respectively. Through this end-to-end structure, low-to-sufficient and sufficient -to-low image illumination feature transform will be realized.

Encoder
The encoder is designed to generate multi-channel feature maps for a low illumination image and increases the nonlinear relationship between pixels and regions and realizes the extraction of image semantics. The input image is 256 × 256 × 3, and after a padding process, the input image will be transformed to 64 × 64 × 256 by using three convolution layers. To ensure the integrity of the image information and increase the depth of the neural network, 9 residual blocks are added after the convolution layers. This skip-connecting structure solves the problem of information loss caused by the increase of network depth and also helps to solve the problem of gradient vanishing and gradient exploding. This structure makes our model deeper in layers while ensuring reasonable performance. Finally, the input image was mapped to a high-dimensional space by the encoder, where the output feature map was 64 × 64 × 256.

Decoder
After using the encoder to map the input low-light image to the high-dimensional feature space, the decoder will reformat the 64 × 64 × 256 feature maps produced by the encoder into an enhanced sufficient illumination image. This up-sampling operation is completed by the deconvolution layer. In the operation, each point in the feature map (low resolution, high-level descriptions) will be used to reconstruct a region in the output image-it can be described as translation. Two deconvolution layers are designed to fix the output size (256 × 256 × 3). An attention module is used before the first deconvolution layer to make key semantic translation more prominent. Instance normalization is also used to improve translation results.

Attention Mechanism
In LighterGAN, the input for mapping is an image whose information is abstract and complex; therefore, a suitable mechanism is needed to allocate computing resources to something more important. There are 4 types of attention mechanisms: item-wise soft attention, item-wise hard attention, location-wise soft attention, and location-wise hard attention [44]. Item-wise attention focuses on list input (e.g., NLP (natural language processing)), whereas the location-wise attention is designed for feature map input. Location-wise attention is designed by mimicking human visual system characteristics (scan image globally to obtain the target area that needs to be focused, and then invest more attention in this area to obtain more details), Similarly, location-wise soft attention transforms the entire feature map so that the part of interest can be highlighted. Compared with the locationwise hard attention (non-differentiable), which discretely selects a sub-region from the input feature map as the final feature, the location-wise soft attention (differentiable) [45] can calculate the gradient through the neural network and learn the weight of attention through forward and backward propagation, which is more suitable for our requirements.
The weight of the feature map has been re-adjusted, and the important part became more prominent by letting the deep neural network learn the areas that need attention in the picture via network training. This design of the attention mechanism is inspired by the non-local neural network [10]. The attention mechanism is implemented by an additional neural network connected to the original neural network. The entire model is still end-to-end so the attention module can be trained simultaneously with the original model. The principle formula is shown as: where i is the index of an output position (in space, time, or spacetime) whose response is computed and j is the index that enumerates all possible positions. x is the input signal (image, sequence, video; often their features) and y is the output signal of the same size as x. A pairwise function f computes a scalar (representing relationship such as affinity) between i and all of j. The unary function g computes a representation of the input signal at the position j. The response is normalized by a factor C(x). For the presentation of the similarity between x i and x j , the embedding Gaussian function is used: In this formula, θ(x i ) = W θ x i and ϕ x j = W ϕ x j are two embeddings, W θ and W ϕ are weight matrices that to be learned. Then the nonlocal block could be defined as: where y i is given in Formula (4), W z is initialized as zero. Significantly, x i is used as a residual connection, to keep information transmission. In this paper, the attention block is shown in Figure 3. In the block, to achieve the attentional behavior, first, 3 feature maps copies were created, f , g, and h (shape of their tensors are shown in Figure 3), f and g reduced the number of channels to one-eighth of the original to reduce the computation cost, then f multiplied the g transpose, then the SoftMax operation is used to generate the output attention map. Then, the map is multiplied with h. In the end, the input feature map is added to generate the attention block output, which obtained the shape as same as the input. In order to better explore the improvement of attention mechanism module on enhanced image visual perception and no-reference image quality assessment, a comparative experiment is designed in Section 4.

Model Outline
In the outline of LighterGAN, with the purpose of extracting particular characteristics in both unpaired sub-datasets and to know how these characteristics could be used to enhance the brightness of urban-wise low illumination images while ensuring authenticity. Two major mappings inside two circular structures are designed, shown in Figures 4 and 5:   x comes from sufficient illumination sub-dataset X and y comes from low illumination sub-dataset Y, A(x) and B(y) are the translated low/sufficient illumination image produced by the Mapping A and Mapping B, both are used in the training of sufficient illumination image discriminator D A and low illumination image discriminator D B . Then, A(x) and B(y) are used in the Mapping B and Mapping A, respectively, the image generated in this step could be described asx = B(A(x)) andŷ = A(B(y)), which is then produced by the decoder for the calculation of cycle-consistency loss.
In our task, Mapping B will be used to achieve illumination enhancement, discriminator D A will judge the authenticity. However, it is worth emphasizing that, as a part of the circular structure, the prerequisite for the excellent performance of Mapping B is that each part of the entire structure is well-performing, and it cannot be separated from the overall structure. Therefore, in the training phase, careful training was conducted on Mapping A and discriminator D B . In the experimental section, in order to prove the generalization ability of the LighterGAN, Mapping A will be used to generate low illumination images for visual subjective evaluation and the evaluation of NR-IQA.

Adversarial Loss
The role of the discriminator D A was to make judgement between the sufficient illumination image B(y) enhanced by Mapping B and the real sufficient illumination images in sub-dataset X, discriminator D B aims to make judgement between the low illumination image A(x) which is generated by Mapping A and the real low illumination images in sub-dataset Y, the adversarial loss which content discriminator can be described as: where the data distributions are denoted as x ∼ pdata(x) and y ∼ pdata(y). In the training of the neural networks, to distinguish real one from generated one, the discriminator is expected to approach infinitely close to M ones when recognizing a real image, and to approach infinitely close to M zeros when recognizing a generated image. Furthermore, a more stringent discriminator also benefits from the performance of the generator (to produce the enhanced images realistically).
In the loss function, the L2 distance between the discriminator output and the 32 × 32 × 1 matrix M ones was calculated (each value in the matrix is 1) for the real image, while also calculating the L2 distance between the 32 × 32 × 1 matrix M zeros for translated images. The total loss function of the discriminator is represented by the addition of the loss functions above. It is expected that the value of the loss function will continually minimize and converge to a reasonable value approaching 0 by the end of the training.

Cycle-Consistency Loss
First, the original cycle-consistency loss is calculated referencing to the CycleGAN. Two types of cycle (sufficient illumination and Low illumination) generated imageŝ x = B(A(x)) andŷ = A(B(y)), which are used to calculate the L1 distance between two kinds of real images. In the sufficient illumination cycle, the Loss cycx (X, B(A(x))) is calculated, while Loss cycy (Y, A(B(y))) in a low illumination cycle. Then, the cycle-consistency loss Loss cyc (A, B) could be described as the addition of the two losses above.

Semantic Consistency Loss and Model Overall Loss
In LighterGAN, it has to achieve semantic consistency in the enhancement process, which means translating images while ensuring that the scenes in the enhanced images still maintain their original characteristics in reality (e.g., a car in a low illumination image will not be wished to be translated into a boat in the sufficient illumination). Therefore, in LighterGAN, we propose semantic consistency loss Loss c (A, B). As known, a CNN can extract image semantics and this work has been achieved by the down-sampling convolutional layers in the encoder. So, the L2 distance between two kinds of encoder results Loss cx (X, B(A(x))) was calculated for x andx, Loss cy (Y, A(B(y))) for y andŷ, then, semantic-consistency loss Loss c (A, B) could be described as the addition of the two losses above.
Loss c (A, B) = Loss cx (X, B(A(x))) + Loss cy (Y, A(B(y)) In order to demonstrate the improvement of the illumination enhanced image quality by adding semantic consistency loss function, as well as proving the improvement of attention mechanism, we used no-reference image quality assessment (NR-IQA) to compare the images before and after adding semantic consistency loss in the comparative experiment in Section 4.
In the training of the generator, the loss function Loss cd (A, B) is also added to determine whether the images were realistic enough. With this purpose, D B (A(x)) and D A (B(y)) are used to calculate the L2 distance with matrix M ones respectively, this loss could be described as follows: Then, Loss G sum (A, B, D A , D B ), the overall loss of this model, could be expressed as the addition of three losses motioned above.

Training Details
In the training strategy and environment, the most significant part has to be the selection of optimization methods and the learning rate. An excessively low rate will consume a large amount of time and an excessively high learning rate would lead to high fluctuation in the convergence curve, even for nonconvergence [46]. We applied adaptive moment estimation by using AdamOptimizer [47], with an initial learning rate set to λ = 0.0002 and the exponential decay rate for the 1st moment estimation set to 0.5. The model was built and trained by using TensorFlow v1.8 [48], the training equipment was a single-core GeForce RTX2080ti with 11 Gb of graphics memory, the environment was ubuntu 16.04 with Intel Xeon E5-2678 v3. The number of epochs was set as 140.

Visual Subjective Evaluation and NR-IQA
As mentioned in Section 2, the enhancement of low illumination UAV aerial photography captured image should not only focus on the enhancement performance, but also on image authenticity and quality. In this comparative experiment, to evaluate the quality of images which were enhancement by different enhancement algorithms, 44 images were randomly selected from the test set, and then, a combination of subjective and objective evaluation was used. In terms of subjective evaluation, statistical analysis of the human visual perception evaluation was used. In terms of objective evaluation, because traditional methods such as PSNR and SSIM were not suitable for the evaluation of LighterGAN, the score of perception based image quality evaluator was used as an objective evaluation standard. In both evaluation methods, the performance of LighterGAN was compared with LIME, DUAL, Retinex, the CycleGAN (trained by same dataset as LighterGAN) and the EnlightenGAN. LIME and DUAL are two traditional computer vision methods that have performed well recently, the former is based on the illumination map, the latter is based on exposure correction. Retinex is an instructive classic method and the EnlightenGAN is a cutting-edge GAN based deep learning illumination enhancement method. CycleGAN is the reference model, moreover, in image enhancement performance and generalization capabilities estimation, CycleGAN can be used to reflect the benefits of the improvements which are proposed in this article.
In detail, CycleGAN was trained on the same dataset. We trained the model and generated the images by using the sequential and simultaneous adoption of improvements. In reproductions of LIME, DUAL, Retinex and EnlightenGAN, the default settings suggested by the authors were used. In order to observe the performance of image enhancement more intuitively, partial comparison experiment results are shown in Figure 6. In the statistical analysis of the human subjective evaluation, 135 participants ranked the results of 6 different image enhancement algorithms in a total of 44 ranking questions based on brightness and authenticity, furthermore, we scored the different algorithms in each question by weighting the ranks, as shown in the following formula: Here i was the ranking, W i was the weight corresponding to the ranks, W i ∈ {1, 2, 3, 4, 5, 6}, Num i was the number of persons who given this rank, N was the number of participants. For 44 randomly selected urban-wise low illumination images, scores of each enhancement algorithms in 44 ranking question were calculated according to the criterion that the higher the rank, the greater the weight. With this method, the average score of each algorithm in visual subjective evaluation were calculated, the results were shown in Table 1. In the NR-IQA evaluation, to demonstrate quantitative comparisons of mentioned enhancement algorithms, the PIQE was used to score the 44 test images, the results were shown in Table 2. In this evaluation, it is worth emphasizing that, the lower the PIQE score, the better the image quality. It could be found that among the mentioned algorithms, the rank of PIQE scores and the rank of visual evaluation results were almost the same, especially in the ranking of the three algorithms with lower scores (LighterGAN, EnlightenGAN and CycleGAN). In the test set images comparative experiments, LighterGAN has achieved the best results in visual perception subjective judgment and NR-IQA objective quantitative judgment, furthermore, the results could illustrate that PIQE could objectively evaluate the performance of the illumination enhancement algorithms.

NR-IQA Evaluation of Sub-Dataset Y
Because LighterGAN used unsupervised learning and an unpaired dataset, meanwhile, it could be found from the test images that the model did not show the phenomenon of model collapse and overfitting, there were no unique correct results for the enhancements in the low illumination sub-dataset Y, moreover, the result of each low illumination image in the sub-dataset was in an unknown state before being enhanced. Therefore, in order to evaluate the performance of image enhancement more comprehensively and objectively, we have further expanded the scale of the number of images and designed a comparative experiment for all images of the sub-dataset Y for this purpose.
In Section 4.1, the objectivity of PIQE to the quantitative evaluation of the enhancement algorithm has been proven through the test set images subjective and objective comparison experiments results. Therefore, PIQE was also used to evaluate the improvement benefit from the attention mechanism in illumination enhancement. In this comparative experiment, LighterGAN without attention mechanism was trained and joined the quantitative evaluation.
From the PIQE evaluation of all 2000 images in the sub-dataset Y, it could be found that the average PIQE score of LighterGAN was 12.37, it still achieved the best result, the following two were still EnlightenGAN and CycleGAN. Comparing with the comparative experimental results using the test set, it could be found that three deep learning based algorithms (LighterGAN, EnlightenGAN and CycleGAN) still achieved better results, and the rank of these three algorithms were also consistent with the previous experiment. Moreover, from the results, it was worth to highlight that the attention mechanism has indeed achieved an improvement in image quality, in Table 3, compared with the LighterGAN without attention, the average PIQE score of LighterGAN has been reduced by 0.44, and the improvements brought by semantic consistency loss was also show by the difference between the scores of LighterGAN without attention and CycleGAN in the table.

Comparison of Generalization Ability with CycleGAN
As the predictive ability of a model for unknown data, generalization ability was an important criterion for evaluating deep learning models. Therefore, as an improved CycleGAN model, the rationality of the improved strategies in LighterGAN should be evaluated by using generalization ability. Because LighterGAN used an unpaired dataset, in the model training, the enhanced image was not provided with the only correct image. On the contrary, the model needed to provide a conjectural result after fully learning the features in the two sub-datasets which were collected from real scenarios (sufficient illumination subset X and low illumination subset Y). Therefore, in terms of the visual subjective evaluation, the authenticity of the enhanced image could be used as the standard, in addition, in terms of objective evaluation, PIQE was also used as NR-IQA method.
In the generalization ability contrast experiment, 99 images (including 55 sufficient illumination images and 44 low illumination images) in the test set were used. The same as the operation of CycleGAN, Mapping A and Mapping B in the cycle structure of Lighter-GAN were used to generate images respectively. In order to observe the performance of image enhancement more intuitively, partial comparison experiment results are shown in Figure 7. In the visual subjective evaluation, the average probability of the images generated by the two models being selected as the more authentic one was calculated and recorded separately. In the objective evaluation, PIQE was used to calculate the scores of the two models in 99 test set images and all images in the unpaired dataset (contained sub-dataset X and sub-dataset Y) and list them in Table 4.  In the experimental results, LighterGAN was 6.66 percentage points higher than CycleGAN in subjective evaluation. In the objective evaluation, the PIQE scores of test set images and full dataset images were 3.81 and 4.68 less than CycleGAN respectively.

Discussions
As an improvement to the original CycleGAN structure, the performance of the attention mechanism in improving the quality of illumination-enhanced images has been quantified from the experimental results using PIQE scores. In order to further explain the reasons for the decrease in PIQE scores in Table 4, representative images were selected and listed in Figure 8. In terms of prediction performance, as a GAN based image translation model, authenticity and quality of the image after translation is benefits from the predictive ability of the model. For LighterGAN, the predictive ability can be reflected in the objective evaluation score of the image and visual intuitive experience. From the perspective of PIQE, shown in Table 3, the score of LighterGAN decreased by 0.44 benefit from the attention mechanism. Although the gap of score was not significant, it has improved the visual performance of image quality indeed, especially in the edge area near the scene. In examples in Figure 8, it could be found that after adding attention mechanism, the puzzling ripple phenomenon which should not exist in reality was alleviated to a large extent, therefore, the authenticity of enhanced image could be considered as being improved.
Compared with the attention mechanism which has little effect on PIQE scores, it could be seen from Table 3 that LighterGAN (without attention) is 4.56 lower than CycleGAN after adding semantic-consistency loss, which could be objectively identified as the most critical factor to improve image quality. Therefore, the loss calculation in the feature space is more beneficial to improve the quality of the LighterGAN generated image and the generalization ability of the model.
In the urban-wise UAV low illumination images, there are many interferences caused by lamps and reflected light from water or road surface which are affecting the image readability. From the results of the visual perception, it could be observed that LighterGAN can more effectively alleviate the influence of the light which reduces the readability of the image, this advantage could improve the potential of the image in subsequent processing. In terms of generalization ability, the advantage of LighterGAN was proved by both subjective evaluation and objective evaluation, which can be considered as an improvement brought by the combination of attention mechanism and semantic consistency loss.
In terms of disadvantages, LighterGAN could currently only process UAV images with limited resolution and due to the limitation of processing efficiency, the images cannot be processed real-timely. With the rapid development of image sensors carried by UAVs, real-time image enhancement technology for ultra-high resolution images will be a challenge. In the future, we plan to further optimize the network to achieve real-time image processing with higher resolution and to try to use the circular structure to achieve data augmentation.

Conclusions
Images captured by UAV aerial photography can be used by various computer vision algorithms to accomplish intelligent and complex urban observation or monitoring tasks, the premise of their performance is the satisfaction of image readability. In this article, we proposed LighterGAN, an image illumination enhancement method based on generative adversarial network, which could be used to relieve the frequently encountered image degeneration (the insufficient illumination and light pollution). Compared with five classic or deep learning based image enhancement algorithms, LighterGAN was found to have the best performance for visual perception and PIQE score. In the comparative experiment using the test set and the sub-dataset Y, the illumination enhancement performance of LighterGAN has exceeded the state-of-the-art EnlightenGAN and CycleGAN in both subjective and objective evaluations. Moreover, in the performance evaluation of structural improvements, the combination of authenticity subjective evaluation and objective PIQE assessments also demonstrates the enhanced generalization ability of LighterGAN over CycleGAN by using the images generated by two mappings.