LAE-GAN-Based Face Image Restoration for Low-Light Age Estimation

Age estimation is applicable in various fields, and among them, research on age estimation using human facial images, which are the easiest to acquire, is being actively conducted. Since the emergence of deep learning, studies on age estimation using various types of convolutional neural networks (CNN) have been conducted, and they have resulted in good performances, as clear images with high illumination were typically used in these studies. However, human facial images are typically captured in low-light environments. Age information can be lost in facial images captured in low-illumination environments, where noise and blur generated by the camera in the captured image reduce the age estimation performance. No study has yet been conducted on age estimation using facial images captured under low light. In order to overcome this problem, this study proposes a new generative adversarial network for low-light age estimation (LAE-GAN), which compensates for the brightness of human facial images captured in low-light environments, and a CNN-based age estimation method in which compensated images are input. When the experiment was conducted using the MORPH, AFAD, and FG-NET databases—which are open databases—the proposed method exhibited more accurate age estimation performance and brightness compensation in low-light images compared to state-of-the-art methods.


Introduction
A human face contains biological information showing various attributes, such as identity, age, gender, emotions, and expressions. Numerous researchers have studied face recognition [1,2], facial expression recognition [3], gender classification [4], facial skin assessment [5], and age estimation [6] by analyzing such information. Specifically, age estimation has a wide range of applications in commercial areas, such as customer prediction and preference surveys according to age, security for controlling access based on age and statistical fields such as age surveys of an audience [6]. However, age estimation using human facial images entails several problems, including the uncontrollable, natural aging process, individual aging patterns, and large inter-class similarity and intra-class variation of subjects' images within age classes [7]. For overcoming these drawbacks, image representation techniques such as the active appearance model (AAM) [8], the active shape model (ASM) [9], the aging pattern subspace model (AGES) [10], feature extraction techniques such as Gabor filters [11], linear discriminant analysis (LDA) [12], and local binary patterns (LBP) [13] have been used in the past. The representative image and extracted features are applied with multi-classification, regression, and hierarchical approaches for age estimation [14]. However, since the emergence of deep learning, where feature extraction and learning are both involved in the process, using a convolutional neural network (CNN) has become popular in age estimation.
Previous studies on age estimation used clear facial images taken during the daytime with high illumination. However, in reality, most of the images are captured in low-light

•
It is the first study on age estimation considering low light; • Without separately applying pre-processing to low-light facial images, images are enhanced using LAE-GAN, which is proposed in this study; • In LAE-GAN, identity information of input data was preserved by removing an input random noise vector used in a conventional conditional GAN and adding an L2 loss function in the generator. Furthermore, high frequency information of the input image delivered through a skip-connection using a leaky rectified linear unit (ReLU) to the 6th and 7th decoder blocks of the generator was reinforced, and the ReLU was used in the 4th convolution layer of the discriminator; • Through [20], the trained LAE-GAN and CNN for age estimation are disclosed to be fairly evaluated by other researchers in terms of performance.
This paper is organized as follows. In Section 2, previous studies on low-light image enhancement and facial image age estimation are analyzed and compared with the proposed method. In Section 3, the LAE-GAN proposed for low-illumination facial image enhancement and CNNs for facial image age estimation are explained. In Section 4, the results of the experiment conducted using the method proposed in Section 3 are comparatively analyzed and discussed. Lastly, Section 5 proposes conclusions.

Related Works
Age estimation using human facial images is performed by extracting features based on the length, depth, and number of wrinkles, which change over time due to aging and skin condition [21]. Therefore, age estimation involves feature extraction and age learning steps for learning ages based on the extracted represented image. For feature extraction in previous studies, image representation techniques such as AAM [8], ASM [9], and AGES [10] as well as Gabor filters [11], LDA [12], and LBP [13] were applied; multi-class classification, regression, and hierarchical approaches were taken for age learning. Recently, however, methods using a CNN where feature extraction and age learning proceed end-toend are more commonly used. Table 1 presents previous studies on age estimation in which deep learning was employed. In previous studies, the mean absolute error (MAE) was used to evaluate the accuracy of age estimation. MAE is the mean absolute error between the estimated and ground-truth ages, and a detailed description of MAE can be found in Equation (14) of Section 4.3.
A study [22] proposed a simple CNN consisting of six layers: three convolutional layers, two pooling layers, and one fully connected layer. The dimension of these extracted features was compressed with principal component analysis (PCA), and age learning was performed using a support vector machine (SVM). A study [23] proposed a CNN consisting of three convolutional layers, two fully connected layers, and one output layer. Most age estimation using a CNN involves a shallow CNN; subsequently, a study [24] improved the performance through fine-tuning deep networks such as a visual geometry group (VGG)-16 [25] and databases such as IMDB-WIKI and ImageNet. In a study [26], a network consisting of the following three steps was proposed for mitigating the learning restricted by a dataset during age estimation: first, data are classified into age groups using an age group classifier; second, age is estimated using the mean value within the age groups; third, errors are revised using the predicted age. In one study [27], a method for predicting age based on rankings between sub-networks was proposed using a network tied with subnetworks predicting a single age label as binary outputs. Likewise, most previous studies conducted age estimation based on various types of databases and networks. However, no study has examined age estimation considering low light, which is more likely to occur. Noise and blur are generated when acquiring low-illumination images in general, which leads to performance degradation in various computer vision fields that rely on human facial images. For enhancing these low-illumination images, the methods employed in previous research can be classified into image processing-based techniques such as histogram equalization methods and Retinex filtering, machine learning-based techniques, and deep learning-based techniques [19]. Well-known cases of image processing-based techniques for improving low illumination in facial images are as follows. In a study [43], a method was proposed for enhancing illumination imbalance using discrete cosine transform (DCT) low frequency coefficients after applying histogram equalization to facial images. In a study [44], adaptive region-based image processing was suggested for compensating low-illumination images that appear differently depending on various lighting conditions. After partitioning an image into various regions according to lighting conditions, contract and edges were used in adaptive region-based histogram equalization. In a study [45], a selective illumination enhancement technique (SIET) was proposed for enhancing low-illumination facial images. SIET was utilized for improving changes in facial images due to the effects of non-uniform illumination; dark regions were isolated and compensated with a correction factor that was determined based on an energy function to enhance illumination. Image processing-based techniques were more commonly used for conventional low-illumination image enhancement than any other techniques. Several studies have been conducted in recent years, as interests in machine learning-based techniques and deep learning-based techniques are on the rise [46][47][48]. In a study [46], enhancement networks are proposed for preventing performance degradation of facial images being used for the mobile face unlock feature in low-light environments. Networks typically consist of a decomposition part for partitioning input low-illumination facial images into face normals and face albedos, and a reconstruction part for enhancing and reconstructing images using spherical harmonic lighting coefficients. In a study [47], a feature reconstruction network was proposed in which raw face images and illuminationenhanced face images were all used in deep learning-based techniques for face recognition in low illumination. A study [48] proposed REGDet, in which a recurrent exposure generation (REG) module for low-illumination enhancement is combined with a multi exposure detection (MED) module for face detection in low-light environments. These studies on improving low-light conditions are utilized in various fields, not only for facial images. In recent years, a GAN-based method has been actively researched, where data distribution of low-light input images is converted into the data distribution of high-light target images [49][50][51].
However, no study has examined age estimation considering low-light conditions. This study, therefore, proposes a LAE-GAN-based age estimation method where lowilluminated face images are enhanced, which are then subsequently used as input for a CNN. Table 2 presents the comparison between the proposed method and previous studies in which low-light facial images were enhanced.

Overview of the Proposed Method
The age estimation method proposed in this study, which is effective for low-illumination facial images, proceeds according to the four steps shown in Figure 1. The first and second steps are pre-processing for age estimation using facial images effective for low light. In the first step, the face and eye positions are detected in facial images using an adaptive boosting (Adaboost) algorithm [57]. The detected positions become the reference points for aligning facial images in the second step in order to compensate through in-plane rotation and redefine face region of interest (ROI). The pre-processing step is explained in detail in Section 3.2. The pre-processed facial images are input in the third step to LAE-GAN, which has been trained with pairs of low-and high-illumination facial image for low-illumination image enhancement. Finally, the enhanced facial images are used to the trained CNN for age estimation.

Pre-Processing
In general, the facial region is not aligned in the captured human facial images, which contain parts without age information such as the background. Misalignment in facial images affects the age estimation performance [58]. Therefore, pre-processing, as shown in Figure 2, was performed in this study. First, the Adaboost algorithm [57] is used to detect the face region in the image. Within the detected face region, the exact eye position is detected by designating an exploratory region where eyes may be located. The explored positions of the face and eyes are as shown in Figure 2b, and they are used for the redefinition of ROI and in-plane rotation compensation. Here, Equation (1) is used to proceed with in-plane rotation compensation based on the estimated in-plane rotation angle and bilinear interpolation; then, ROI of the human facial image is redefined with respect to the center of both eyes for removing the background image. In Equation (1), R x and R y are horizontal and vertical positions of the right eye, while L x and L y are horizontal and vertical positions of the left eye. The pre-processed image has the size of 256 × 256 × 3 as shown in Figure 2c.

Pre-Processing
In general, the facial region is not aligned in the captured human facial images, which contain parts without age information such as the background. Misalignment in facial images affects the age estimation performance [58]. Therefore, pre-processing, as shown in Figure 2, was performed in this study. First, the Adaboost algorithm [57] is used to detect the face region in the image. Within the detected face region, the exact eye position is detected by designating an exploratory region where eyes may be located. The explored positions of the face and eyes are as shown in Figure 2b, and they are used for the redefinition of ROI and in-plane rotation compensation. Here, Equation (1) is used to proceed with in-plane rotation compensation based on the estimated in-plane rotation angle and bilinear interpolation; then, ROI of the human facial image is redefined with respect to the center of both eyes for removing the background image. In Equation (1), and are horizontal and vertical positions of the right eye, while and are horizontal and vertical positions of the left eye. The pre-processed image has the size of 256 × 256 × 3 as shown in Figure 2c.

Pre-Processing
In general, the facial region is not aligned in the captured human facial images, which contain parts without age information such as the background. Misalignment in facial images affects the age estimation performance [58]. Therefore, pre-processing, as shown in Figure 2, was performed in this study. First, the Adaboost algorithm [57] is used to detect the face region in the image. Within the detected face region, the exact eye position is detected by designating an exploratory region where eyes may be located. The explored positions of the face and eyes are as shown in Figure 2b, and they are used for the redefinition of ROI and in-plane rotation compensation. Here, Equation (1) is used to proceed with in-plane rotation compensation based on the estimated in-plane rotation angle and bilinear interpolation; then, ROI of the human facial image is redefined with respect to the center of both eyes for removing the background image. In Equation (1), and are horizontal and vertical positions of the right eye, while and are horizontal and vertical positions of the left eye. The pre-processed image has the size of 256 × 256 × 3 as shown in Figure 2c.

Enhancement of Low-Illuminated Face Image by LAE-GAN
This study proposes a method for compensating a low-illuminated face image using LAE-GAN for age estimation, which is effective for low-light conditions. A conventional conditional GAN [59] performs adversarial learning using paired GAN based on a pair of input and target images. It consists of a generator, which outputs a generated image I Out by receiving the random noise vector z and input image I In , and a discriminator, which distinguishes between real and fake images by receiving I In and I Out or the target image I Target as input. In adversarial learning, the generator tries to deceive the discriminator by generating a realistic image I Out . The discriminator tries to distinguish between the generated image I Out and the target images I Target . The generator has an encoder-decoder structure. The encoder extracts the features of the input image I In , and a decoder maps the patches corresponding to the extracted features. Such learning requires that the data distribution of I In is converted to the distribution of I Target using the loss function shown in Equation (2) below, where G is the generator, D is the discriminator, log is the decimal logarithm, and E is an expected value (mean value).
L cGAN (G, D) = E I In ,I Target logD I In , I Target + E I In ,z log(1 − D I In , G I In , z ) (2) This study proposes LAE-GAN for compensating a low-illumination facial image to a corresponding high-illumination facial image. In a study [59], the random noise vector z allows image transformation to be easier and more diverse. The random noise vector z in this study, however, simply acts as noise when compensating from low-illumination facial image I In to high-illumination facial image I Out . Therefore, the loss function after removing random noise vector z is as shown in Equation (3) Due to the nature of adversarial learning of the generator and discriminator explained above, the generator aims to deceive the discriminator by generating I In into I Out image having a similar distribution as I Target . This tendency can be trained so as to deceive the discriminator rather than following the data distribution of I Target . Hence, this study adds the new L2 loss function, as shown in Equation (4), to the generator for maintaining the identity of the I Target image.
Ultimately, the final loss function used in this study is as shown in Equation (5) below. λ is the regularization term. The optimal λ was experimentally determined as 0.9 with training data, which showed the highest accuracy of age estimation with training data. arg min G max D represent the arguments of the generator and discriminator, which minimize and maximize the loss functions of the generator and discriminator, respectively.

Generator
The encoder-decoder structure is one of the networks used for generating images [60,61]. U-net [62] is one of the commonly used networks and consists of an encoder for extracting features and a decoder for mapping a patch corresponding to the extracted features of U-net; however, it has a skip connection for preserving the high frequency information of the input image. A skip connection is present between the i th layer and (n − i) th layer of U-net, and concatenates the features extracted in the i th layer to the (n − i) th layer. Therefore, it preserves the high frequency information of the input image as well as the original shape and detail. The U-net generator was used in this study, and its detailed structure is represented in Table 3 below and Figure 3a.
Each encoder consists of blocks comprised of a convolution layer, a batch normalization layer, and a leaky ReLU layer excluding the first encoder since the first encoder does not include a batch normalization layer. Each decoder consists of decoder blocks comprised of a deconvolution layer, a batch normalization layer, and a ReLU layer excluding the sixth, seventh, and last decoders. Concatenation occurs from the skip connection after batch normalization. The sixth and seventh decoder blocks emphasize the features of high frequency information delivered through skip connection using a leaky ReLU layer. The deconvolution layer uses transpose convolution, and the last decoder block consists of tanh function.  Each encoder consists of blocks comprised of a convolution layer, a batch normalization layer, and a leaky ReLU layer excluding the first encoder since the first encoder does not include a batch normalization layer. Each decoder consists of decoder blocks comprised of a deconvolution layer, a batch normalization layer, and a ReLU layer excluding the sixth, seventh, and last decoders. Concatenation occurs from the skip connection after batch normalization. The sixth and seventh decoder blocks emphasize the features of high frequency information delivered through skip connection using a leaky ReLU layer. The deconvolution layer uses transpose convolution, and the last decoder block consists of ℎ function.

Discriminator
The discriminator in this study concatenates I Target and I Out that are randomly input with I In through convolution layers and proceeds with feature extraction to generate a feature map of 30 × 30 × 1 in the last layer. The generated feature map can be considered as a set of 1 × 1 × 1 grids. The grids are used to analyze local information of a 70 × 70 receptive field instead of global information in which the local information that may be lost in the global information is utilized to adequately express detail and shape of the image. Therefore, such learning can reduce blurry results rather than applying L1 loss or L2 loss to the entire features; further, the information of the original image can be preserved as much as possible. For maintaining the disposition of the original image and discerning the authenticity of the input image, the discriminator consistently receives I In as input. The features extracted from I In will express the information that the image must consistently maintain and thus prevent improper learning of the generator between adversarial learning. The detailed structure of the discriminator is presented in Table 4 and Figure 3b.

Difference of Conditional GAN
The LAE-GAN proposed in this study has the following differences from the conventional conditional GAN [59]: • A random noise vector was used in the conventional conditional GAN for inducing image transformation, but it has been removed in this study as it has a stronger negative effect than noise in a 1:1 mapping structure between input data and target data for low-illumination image compensation; • L2 loss function was used in the generator to preserve the identifiable information of the input data; • Leaky ReLU was used in the 6th and 7th decoder blocks of the generator to strengthen the high frequency information of the input image delivered through skip connections; • ReLU was used in the 4th convolution layer of the discriminator.

Age Estimation
In this study, age estimation was performed by training various CNNs using facial images enhanced by LAE-GAN. Training was performed using VGG [25], which achieved high accuracy in conventional image classification. The residual network (ResNet) [63], various networks that produced good accuracy in age estimation [25,29,63,64], and age estimation performance were compared according to the compensation of low-illumination facial images.

VGG
VGG [25] is a well-known classification network that has achieved high performance in ImageNet and is used or applied in various age estimation studies [29,64]. In general, classification performance tends to improve in deep learning networks as the depth increases. The performance of VGG was compared by implementing CNNs of different depths. Filters of 5 × 5 size and 7 × 7 size can be replaced with continuous filters of 3 × 3 size while reducing computational complexity; non-linearity of a network was secured by using a 1 × 1 convolution. In this study, age estimation performance was evaluated using VGG-16, which is fairly well-known among various VGG networks.

DEX
In a study [64], a VGG-16-based network was used to produce good performance in the age estimation field in the ChaLearn competition. DEX is an ImageNet database for which VGG-16 was pre-trained using an extensive number of databases, including IMDB and Wiki. Moreover, instead of estimating age based on the probability value of a class, age was estimated as the sum of the product of a class label and the probability of the respective label, as shown in Equation (6): where X is the input image, while n is the entire class (age range). Accordingly, c i and p i are the label and probability of the i th class, respectively. As described above, DEX [64] is a VGG-based network which has 13 convolution layers and 3 fully connected layers. Like DEX, we used categorical cross entropy loss [65], as shown in Equations (7) and (8).
In Equations (7) and (8), f (·) is a softmax activation function, e represents an exponential function, t is a ground-truth age, and s is an estimated age. In addition, C is the number of classes, i is the i th class, and log is the decimal logarithm. An adaptive moment estimation (Adam) optimizer [66] was used in our experiments, whereas DEX adopts a stochastic gradient descent (SGD) optimizer.

ResNet
ResNet [63] is a prototypical classification network that has achieved high performance in ImageNet. Furthermore, it has been widely used in various studies that researched age estimation-particularly in studies that use unique residual blocks and skip connections. It consists of continuous filters having 1 × 1, 3 × 3, and 1 × 1 sizes, and has a bottleneck structure for giving reduction and expansion effects on the dimension of a feature map. A weights sum is applied to the feature maps before and after the residual block to resolve the vanishing gradient problem. A skip connection is also present for maintaining the identity of the input image. ResNet is a network which has various depths depending on the number of residual blocks; in this study, ResNet-50 and ResNet-152 pre-trained with the ImageNet database are used in the experiment.

Age-Net
In a study [29], VGG and Age-Net were used for age estimation, which resulted in excellent age estimation performance in the ChaLearn competition. Training included the first step involving VGG and the second step involving Age-Net in which VGG-pretrained with ImageNet-is fine-tuned using the MORPH database [67]. Then, various open databases are mixed and classified into two types to be trained using the KL divergence loss and softmax loss function. This process creates four fine-tuning models where a concatenated feature map is generated in the last layer of each model using a distancebased voting ensemble method. Secondly, Age-Net is trained with various open databases for which Kullback-Leibler (KL) divergence loss function is used. VGG and Age-Net have the same output dimension where the average of the two networks was estimated as the predicted age if the difference between the two networks was 11 or below; or, if the difference was greater than 11, the result of the first network (VGG) was then estimated as the predicted age.

Inception with Random Forest
In a study [68], the Inception v2 network [69] was applied with the random forest (RF) for age estimation. Inception v2 is a network that extracts features using convolution filters of various sizes and concatenates the extracted features to ensure the balance between a sparse nature and a dense nature of network training. Features are extracted using Inception v2 pre-trained with various databases as a feature extractor, and RF is used to perform age learning.

Experimental Data and Environment
In this study, the experiment was conducted using the MORPH [67], FG-NET [70], and AFAD [71]  Since open facial databases acquired in low-light environments and containing age information do not exist, the aforementioned open databases were transformed to lowillumination images to proceed with training and testing in this study. The same pre-processing explained in Section 3.2 was applied to the training images to redefine the ROI of facial images. The pre-processed low-light image and the original image are used as input images and target images for training. When illumination decreases in the actual environment, pixels with a large brightness value experience significant changes, while pixels with a small brightness value experience relatively smaller changes. For representing such a non-linear nature, a gamma correction [72] technique was applied in this study to generate low-illumination facial images. Original RGB images were converted to HSV images, which consist of hue, saturation, and value channels, expressed as H, S, and V channels, respectively. Gamma correction was applied to the V channel to decrease the non-linear brightness value. Blurry images are generated due to the exposure time of a camera in low-light environments in which noise due to a camera sensor is also generated. For applying these elements, a Gaussian blur was applied to generate a blurry image, while Gaussian and Poisson noises were applied to generate noise in this experiment. Equation (9) below shows the effects used for generating low-illumination facial images.
In Equation (9), is the V channel value of the HSV image, while is the V channel value of the low-illumination image generated as above. and are gamma correction Since open facial databases acquired in low-light environments and containing age information do not exist, the aforementioned open databases were transformed to lowillumination images to proceed with training and testing in this study. The same preprocessing explained in Section 3.2 was applied to the training images to redefine the ROI of facial images. The pre-processed low-light image and the original image are used as input images and target images for training. When illumination decreases in the actual environment, pixels with a large brightness value experience significant changes, while pixels with a small brightness value experience relatively smaller changes. For representing such a non-linear nature, a gamma correction [72] technique was applied in this study to generate low-illumination facial images. Original RGB images were converted to HSV images, which consist of hue, saturation, and value channels, expressed as H, S, and V channels, respectively. Gamma correction was applied to the V channel to decrease the non-linear brightness value. Blurry images are generated due to the exposure time of a camera in low-light environments in which noise due to a camera sensor is also generated. For applying these elements, a Gaussian blur was applied to generate a blurry image, while Gaussian and Poisson noises were applied to generate noise in this experiment. Equation (9) below shows the effects used for generating low-illumination facial images.
In Equation (9), I v is the V channel value of the HSV image, while I o is the V channel value of the low-illumination image generated as above. S and γ are gamma correction parameters for which S is 0.06 and γ is 2.5. B G is the Gaussian blur kernel, for which the standard deviation σ was randomly applied between 1.5 and 2. We selected these values based on previous studies [73,74]. Lastly, N G and N P are Gaussian and Poisson noise, respectively. Figure 5 shows the examples of the original facial images and lowillumination facial images generated for the experiment. Figure 5c shows the corresponding histogram-equalized images of low-illumination facial images of Figure 5b. Although the low-illumination images of Figure 5b are difficult to discriminate via the human eye, we can confirm that they have rough information of face images as shown in Figure 5c. Therefore, the algorithm does not estimate age from non-usable/non-visible images.
Mathematics 2021, 9, x FOR PEER REVIEW 14 of 30 mination images of Figure 5b are difficult to discriminate via the human eye, we can confirm that they have rough information of face images as shown in Figure 5c. Therefore, the algorithm does not estimate age from non-usable/non-visible images.
For the experiment, we used a desktop computer, which was equipped with a 3.5 GHz CPU (Intel Core™ i7-3770K) and 24 GB RAM. Windows TensorFlow (version 2.2.0) [75] was utilized for the training and testing procedure. We used an NVIDIA graphics processing unit (GPU) card including 1920 compute unified device architecture (CUDA) cores and 8 GB memory (Nvidia GeForce GTX 1070 [76]). To extract the face ROI, we used the Python program (version 3.5.2) [77] and the OpenCV (version 4.2.0) library [78].

Training of LAE-GAN for Image Enhancement of Low Illumination and CNN for Age Estimation
LAE-GAN, explained in Section 3.3, was used to enhance low-illumination images into high-illumination images, and various age estimation networks explained in Section 3.4 were used to estimate ages. LAE-GAN was trained with low-illumination images as For the experiment, we used a desktop computer, which was equipped with a 3.5 GHz CPU (Intel Core™ i7-3770K) and 24 GB RAM. Windows TensorFlow (version 2.2.0) [75] was utilized for the training and testing procedure. We used an NVIDIA graphics processing unit (GPU) card including 1920 compute unified device architecture (CUDA) cores and 8 GB memory (Nvidia GeForce GTX 1070 [76]). To extract the face ROI, we used the Python program (version 3.5.2) [77] and the OpenCV (version 4.2.0) library [78].

Training of LAE-GAN for Image Enhancement of Low Illumination and CNN for Age Estimation
LAE-GAN, explained in Section 3.3, was used to enhance low-illumination images into high-illumination images, and various age estimation networks explained in Section 3.4 were used to estimate ages. LAE-GAN was trained with low-illumination images as input images and high-illumination images as target images. As explained in Section 4.1, pre-processed training data were resized into 286 × 286 × 3 and then randomly cropped to 256 × 256 × 3 through online augmentation for training. An Adam optimizer [66] was used during training. Learning rate, beta_1, and beta_2 were set to 0.0002, 0.5, and 0.999, respectively, for training, which was conducted over 100 epochs. The optimal parameters of learning rate, beta_1, beta_2, and the number of epoch were experimentally determined with training data, which showed the highest accuracy of age estimation with the training data. Figure 6 shows the training loss graphs of the generator and discriminator when LAE-GAN was trained using the MORPH database. Figure 6a shows the loss graph of the generator, and Figure 6b shows the loss graph of the discriminator. In general, when the loss function converges to 0, the training can be regarded as progressing well. The discriminator has a binary classification problem that discriminates real and fake images, and the network is simple. On the other hand, the generator that enhances the image has a deep network. Therefore, the discriminator has a lower learning complexity than the generator. Consequently, the discriminator loss converges relatively quickly compared to the generator loss, and the converged loss value of discriminator is usually lower than that of generator. In this study, by adding the L2 loss, the loss of the discriminator temporarily increases. However, the discriminator loss converges at a similar time to the generator loss. As shown in Figure 6a,b, both generator and discriminator loss converged, which indicates that LAE-GAN was properly trained. Subsequently, the CNN was trained for age estimation using the facial images enhanced with trained LAE-GAN. Various age estimation networks explained in Section 3.4 were used for training. Previously trained networks were fine-tuned, in which the training was conducted for 200 epochs. Figure 7 shows the training loss and accuracy graphs of DEX [64], which exhibited the highest age estimation performance. The convergence of the loss function means that the error is reduced, so the accuracy should be improved. In Figure 7, as training loss stably converged and accuracy stably increased, the network could be considered adequately trained. 256 × 256 × 3 through online augmentation for training. An Adam optimizer [66] was used during training. Learning rate, beta_1, and beta_2 were set to 0.0002, 0.5, and 0.999, respectively, for training, which was conducted over 100 epochs. The optimal parameters of learning rate, beta_1, beta_2, and the number of epoch were experimentally determined with training data, which showed the highest accuracy of age estimation with the training data. Figure 6 shows the training loss graphs of the generator and discriminator when LAE-GAN was trained using the MORPH database. Figure 6a shows the loss graph of the generator, and Figure 6b shows the loss graph of the discriminator. In general, when the loss function converges to 0, the training can be regarded as progressing well. The discriminator has a binary classification problem that discriminates real and fake images, and the network is simple. On the other hand, the generator that enhances the image has a deep network. Therefore, the discriminator has a lower learning complexity than the generator. Consequently, the discriminator loss converges relatively quickly compared to the generator loss, and the converged loss value of discriminator is usually lower than that of generator. In this study, by adding the L2 loss, the loss of the discriminator temporarily increases. However, the discriminator loss converges at a similar time to the generator loss. As shown in Figure 6a,b, both generator and discriminator loss converged, which indicates that LAE-GAN was properly trained. Subsequently, the CNN was trained for age estimation using the facial images enhanced with trained LAE-GAN. Various age estimation networks explained in Section 3.4 were used for training. Previously trained networks were fine-tuned, in which the training was conducted for 200 epochs. Figure 7 shows the training loss and accuracy graphs of DEX [64], which exhibited the highest age estimation performance. The convergence of the loss function means that the error is reduced, so the accuracy should be improved. In Figure 7, as training loss stably converged and accuracy stably increased, the network could be considered adequately trained.

Testing with the MORPH Database
In the first experiment, the image enhancement performances of the LAE-GAN proposed in this study and other state-of-the-art networks were compared. CycleGAN [79], Attention GAN [80], Attention cGAN [81], and conditional GAN [59] were used to compare the illumination enhancement performance with LAE-GAN; the signal-to-noise ratio (SNR) [82], peak signal-to-noise ratio (PSNR) [83], and structural similarity (SSIM) [84] were used for comparing the similarity between the original image and the generated enhanced image. Equations (10)-(13) represent the equations for MSE, SNR, PSNR, and SSIM, respectively. SNR, PSNR, and SSIM values tend to be higher if the similarity between two images is higher. Io is an original image of high illumination and Ie is the generated image. m and n show the width and height of the image, respectively. and show the mean and standard deviation of the pixel values of an original image of high illumination, respectively. and show the mean and standard deviation of the pixel values of a generated image, respectively; is the covariance of the two images. 1 and 2 are the positive constant values, which make the denominator non-zero.
As shown in Table 5, there exist other methods that exhibited better performance than LAE-GAN in SNR and PSNR, whereas LAE-GAN resulted in the best performance in SSIM. However, PSNR and SNR cannot accurately evaluate the similarity and difference in the visual definitions of humans [85,86]. SSIM, on the other hand, is more suitable for evaluating similarities in definitions since it is a measurement designed for improving PSNR and SNR [84]. Accordingly, it can be confirmed that the proposed method resulted in the highest accuracy.

Testing with the MORPH Database
In the first experiment, the image enhancement performances of the LAE-GAN proposed in this study and other state-of-the-art networks were compared. CycleGAN [79], Attention GAN [80], Attention cGAN [81], and conditional GAN [59] were used to compare the illumination enhancement performance with LAE-GAN; the signal-to-noise ratio (SNR) [82], peak signal-to-noise ratio (PSNR) [83], and structural similarity (SSIM) [84] were used for comparing the similarity between the original image and the generated enhanced image. Equations (10)-(13) represent the equations for MSE, SNR, PSNR, and SSIM, respectively. SNR, PSNR, and SSIM values tend to be higher if the similarity between two images is higher.
µ o and σ o show the mean and standard deviation of the pixel values of an original image of high illumination, respectively. µ e and σ e show the mean and standard deviation of the pixel values of a generated image, respectively; σ eo is the covariance of the two images. C1 and C2 are the positive constant values, which make the denominator non-zero.
As shown in Table 5, there exist other methods that exhibited better performance than LAE-GAN in SNR and PSNR, whereas LAE-GAN resulted in the best performance in SSIM. However, PSNR and SNR cannot accurately evaluate the similarity and difference in the visual definitions of humans [85,86]. SSIM, on the other hand, is more suitable for evaluating similarities in definitions since it is a measurement designed for improving PSNR and SNR [84]. Accordingly, it can be confirmed that the proposed method resulted in the highest accuracy.  Figure 8 illustrates the images enhanced by various networks presented in Table 5. Figure 8c shows the corresponding histogram-equalized images of the low-illumination facial images of Figure 8b. Although the low-illumination images of Figure 8b are difficult to discriminate by the human eye, we can confirm that they have rough information of face images as shown in Figure 8c. Therefore, the algorithms are not getting better images from completely random/black images. In addition, as shown in Figure 8h, the proposed LAE-GAN successfully transforms the low-illumination facial images for Figure 8b. The LAE-GAN proposed in this study has more outstanding image enhancement effects compared to other networks, as shown in Figure 8.
For the next experiment, age estimation accuracy was compared using various networks explained in Section 3.4 for the images enhanced by LAE-GAN, as shown in Table 6. For evaluating the age estimation accuracy, MAE, which is the most often-used measure, is used as shown in Equation (14). A lower MAE value indicates higher age estimation accuracy. In the equation, n is the number of images, p i is the estimated age, and y i is the ground-truth age.
The experiment results showed that DEX had the best performance in age estimation. The age estimation performance of other networks were better than the age estimation performance based on low-illumination facial images, as shown in Table 6. Therefore, it can be concluded that the LAE-GAN used in this study performed better in enhancing low-illumination facial images for age estimation.
In Table 6, age estimation performance, or baseline performance, was measured in original images of high-illumination and low-illuminated images with or without LAE-GAN using DEX--which had the best performance in Table 6. In each case, DEX was fine-tuned using the training data, and accuracy was evaluated using the testing data.  Figure 8 illustrates the images enhanced by various networks presented in Table 5. Figure 8c shows the corresponding histogram-equalized images of the low-illumination facial images of Figure 8b. Although the low-illumination images of Figure 8b are difficult to discriminate by the human eye, we can confirm that they have rough information of face images as shown in Figure 8c. Therefore, the algorithms are not getting better images from completely random/black images. In addition, as shown in Figure 8h, the proposed LAE-GAN successfully transforms the low-illumination facial images for Figure 8b. The LAE-GAN proposed in this study has more outstanding image enhancement effects compared to other networks, as shown in Figure 8.  For the next experiment, age estimation accuracy was compared using various networks explained in Section 3.4 for the images enhanced by LAE-GAN, as shown in Table  6. For evaluating the age estimation accuracy, MAE, which is the most often-used measure, is used as shown in Equation (14). A lower MAE value indicates higher age estimation accuracy.

SNR
In the equation, n is the number of images, is the estimated age, and is the ground-truth age.
The experiment results showed that DEX had the best performance in age estimation. The age estimation performance of other networks were better than the age estimation performance based on low-illumination facial images, as shown in Table 6. Therefore, it can be concluded that the LAE-GAN used in this study performed better in enhancing low-illumination facial images for age estimation.

Method MAE
Age estimation using various age estimators with LAE-GAN VGG-16 [25] 13.99 ResNet-50 [63] 12.83 ResNet-152 [63] 12.76 DEX [64] 12.46 AgeNet [29] 15.33 Inception with RF [68] 15.01 Age estimation using original facial images or low Original 5.8 Low illumination (without LAE-GAN) 19.02 As shown in Table 6, a MAE of 5.8 years was found in the original images of high illumination, whereas a MAE of 19.02 years was found in the low-illuminated images without LAE-GAN. However, the MAE was significantly reduced to 12.46 years when LAE-GAN was used. In Table 6, in the case of "Original" images, we trained and tested with the original dataset. In the case of "Low illumination (without LAE-GAN)", we trained and tested with the low-illumination dataset. In case of "Enhanced by LAE-GAN (proposed)", we trained and tested with the image dataset enhanced by LAE-GAN. Therefore, they were fair comparisons, since the model was trained on one set of images and its performance was also evaluated on the same set.
For the next experiment, the age estimation performance of the LAE-GAN and other state-of-the-art networks were compared. For a fair evaluation, DEX was used as an age estimator for all cases. As shown in Table 6, LAE-GAN had the greatest effect on low-illumination facial image enhancement and age estimation performance improvement. Figures 9 and 10 show good cases and bad cases, respectively, of age estimation performance when age is estimated using DEX and LAE-GAN. The first and second rows of Figures 9 and 10 are original images and low-illumination images, respectively. The third rows of Figures 9 and 10 are facial images enhanced using LAE-GAN. In Figure 9, the images were enhanced to be very similar to the original images, unlike Figure 10, where high frequency information such as wrinkles or detailed information such as the skin texture of the original images were not adequately restored in the enhancement images. Consequently, a higher portion of bad cases was found when low-illumination images of older individuals were enhanced to appear as images of younger individuals-which resulted in less accurate age estimation.

Testing with the AFAD Database
For verifying the generality of the proposed method, an experiment was conducted using a different open database-the AFAD database. For the first experiment, age estimation accuracy was compared using the various networks explained in Section 3.4 for the Mathematics 2021, 9, 2329 19 of 28 images enhanced by LAE-GAN, as shown in Table 7. The experiment results showed that the best performance was exhibited by Inception with RF, unlike the MORPH database.   In Table 7, age estimation performance, or baseline performance, was measured in original images of high illumination and low-illuminated images with or without LAE-GAN using Inception with RF, which had the best performance in Table 7. In each case, Inception with RF was fine-tuned using the training data, and accuracy was evaluated using the testing data.
As shown in Table 7, a MAE of 7.08 years was found in the original images of high illumination, where a MAE of 16.10 years was found in the low illuminated images without LAE-GAN. However, MAE was reduced to 13.81 when LAE-GAN was used. Figures 11 and 12 show good cases and bad cases, respectively, of age estimation performance when age is estimated using Inception with RF with LAE-GAN. The first and second rows of Figures 11 and 12 are original images and low-illumination images, respectively. The third rows of Figures 11 and 12 are facial images enhanced using LAE-GAN.

Testing with the AFAD Database
For verifying the generality of the proposed method, an experiment was conducted using a different open database-the AFAD database. For the first experiment, age estimation accuracy was compared using the various networks explained in Section 3.4 for the images enhanced by LAE-GAN, as shown in Table 7. The experiment results showed that the best performance was exhibited by Inception with RF, unlike the MORPH database.

Method MAE
Age estimation using various age estimators with LAE-GAN VGG-16 [25] 14.10 ResNet-50 [63] 16.31 ResNet-152 [63] 14.35 DEX [64] 14.12 AgeNet [29] 15.17 Inception with RF [68] 13.81 Age estimation using original facial images or low-illuminated facial images without or with LAE-GAN Original 7.08 Low illumination (without LAE-GAN) 16.10 Enhanced by LAE-GAN (proposed) 13.81 In Table 7, age estimation performance, or baseline performance, was measured in original images of high illumination and low-illuminated images with or without LAE-GAN using Inception with RF, which had the best performance in Table 7. In each case, Inception with RF was fine-tuned using the training data, and accuracy was evaluated using the testing data.
As shown in Table 7, a MAE of 7.08 years was found in the original images of high illumination, where a MAE of 16.10 years was found in the low illuminated images without LAE-GAN. However, MAE was reduced to 13.81 when LAE-GAN was used. Figures 11 and 12 show good cases and bad cases, respectively, of age estimation performance when age is estimated using Inception with RF with LAE-GAN. The first and second rows of Figures 11 and 12 are original images and low-illumination images, respectively. The third rows of Figures 11 and 12    . Good cases of age estimation by the proposed method. The 1st, 2nd and 3rd rows show the original images, the low-illuminated images, and the images enhanced by LAE-GAN, respectively. Figures 11 and 12 show blurs in the enhanced facial images. However, blur is more severe in bad cases compared to good cases in Figure 11, and many enhanced images with severe noise were observed, which ultimately led to degradation in age estimation performance.

Testing with the FG-NET Database
For verifying the generality of the proposed method, an experiment was conducted using another open database--the FG-NET database. For the first experiment, age estimation accuracy was compared using various networks explained in Section 3.4 for the images enhanced by LAE-GAN, as shown in Table 8. The experiment results showed that the best performance was exhibited by DEX, similar to the MORPH database.  [63] 11.00 ResNet-152 [63] 9.74 DEX [64] 9.55 AgeNet [29] 10.40 Inception with RF [68] 10.14  Figures 11 and 12 show blurs in the enhanced facial images. However, blur is more severe in bad cases compared to good cases in Figure 11, and many enhanced images with severe noise were observed, which ultimately led to degradation in age estimation performance.

Testing with the FG-NET Database
For verifying the generality of the proposed method, an experiment was conducted using another open database--the FG-NET database. For the first experiment, age estimation accuracy was compared using various networks explained in Section 3.4 for the images enhanced by LAE-GAN, as shown in Table 8. The experiment results showed that the best performance was exhibited by DEX, similar to the MORPH database. In Table 8, age estimation performance, or baseline performance, was measured in original images of high illumination and low-illuminated images with or without LAE-GAN using DEX, which had the best performance in Table 8. In each case, DEX was fine-tuned using the training data, and accuracy was evaluated using the testing data.
As shown in Table 8, a MAE of 6.42 years was found in the original images of high illumination, whereas a MAE of 11.31 years was found in the low-illuminated images without LAE-GAN. However, MAE was reduced to 9.55 when LAE-GAN was used. Figures 13 and 14 show good cases and bad cases, respectively, of age estimation performance when age is estimated using DEX and LAE-GAN. The first and second rows of Figures 13 and 14 are original images and low-illumination images, respectively. The third rows of Figures 13 and 14 are facial images enhanced using LAE-GAN.
As shown in Figures 13 and 14, when LAE-GAN was trained using the FG-NET database, the overall color of the images changed, but detailed information and overall shape were expressed adequately in good cases compared to the bad cases. An enhanced image different from the original image was generated in some bad cases, which increased errors in age estimation. Enhanced by LAE-GAN (proposed) 9.55 In Table 8, age estimation performance, or baseline performance, was measured in original images of high illumination and low-illuminated images with or without LAE-GAN using DEX, which had the best performance in Table 8. In each case, DEX was finetuned using the training data, and accuracy was evaluated using the testing data.
As shown in Table 8, a MAE of 6.42 years was found in the original images of high illumination, whereas a MAE of 11.31 years was found in the low-illuminated images without LAE-GAN. However, MAE was reduced to 9.55 when LAE-GAN was used. Figures 13 and 14 show good cases and bad cases, respectively, of age estimation performance when age is estimated using DEX and LAE-GAN. The first and second rows of Figures 13 and 14   As shown in Figures 13 and 14, when LAE-GAN was trained using the FG-NET database, the overall color of the images changed, but detailed information and overall shape were expressed adequately in good cases compared to the bad cases. An enhanced image different from the original image was generated in some bad cases, which increased errors in age estimation.

Discusion and Analysis of Grad CAM
In our experiments, we used the AFAD database, which already includes images with severe slant angles (in-plane and out-plane rotations) and illumination variations as shown in Figure 15a. The number of images of these severe slant angles and illumination variations are almost 20% of the total number of images of the AFAD database. However, our LAE-GAN successfully transformed the low-illumination images (Figure 15b) of these severe slant angles and illumination variations into enhanced ones as shown in Figure 15c, and our method shows a higher accuracy of age estimation than the state-of-the-art methods, as shown in Table 7.

Discusion and Analysis of Grad CAM
In our experiments, we used the AFAD database, which already includes images with severe slant angles (in-plane and out-plane rotations) and illumination variations as shown in Figure 15a. The number of images of these severe slant angles and illumination variations are almost 20% of the total number of images of the AFAD database. However, our LAE-GAN successfully transformed the low-illumination images (Figure 15b) of these severe slant angles and illumination variations into enhanced ones as shown in Figure 15c, and our method shows a higher accuracy of age estimation than the state-of-the-art methods, as shown in Table 7. In addition, gradient-weighted class activation mapping (Grad-CAM) [87] images extracted from each layer of DEX, with the images enhanced using LAE-GAN as input, were analyzed. Figure 16a is the original facial image, while the pictures on the left and right sides in Figure 16b are low-illumination images and the images enhanced by LAE-GAN, respectively. Figures 16c through 16g are Grad-CAM images extracted from the first, fourth, eighth, and eleventh convolutional layers and the last max pooling layers. In addition, gradient-weighted class activation mapping (Grad-CAM) [87] images extracted from each layer of DEX, with the images enhanced using LAE-GAN as input, were analyzed. Figure 16a is the original facial image, while the pictures on the left and right sides in Figure 16b are low-illumination images and the images enhanced by LAE-GAN, respectively. Figure 16c through Figure 16g are Grad-CAM images extracted from the first, fourth, eighth, and eleventh convolutional layers and the last max pooling layers. The pictures on the left in Figure 16c-g are Grad-CAM images, while the pictures on the right are the LAE-GAN-enhanced images overlapped with the Grad-CAM images. As shown in Figure 16c,d, high activation areas, mostly in the high frequency areas such as the eyes, nose, mouth, and lines, in the Grad-CAM images are extracted from the As shown in Figure 16c,d, high activation areas, mostly in the high frequency areas such as the eyes, nose, mouth, and lines, in the Grad-CAM images are extracted from the front convolutional layers of DEX. As convolution proceeds, it can be observed in Figure 16e-g that activation areas are found in more global areas of the face, including the eyes, nose, and mouth. As shown in Figure 16g, the features effective for age estimation are adequately extracted through the eye, nose, and mouth areas of the face using the proposed method.

Conclusions
Human facial images acquired in low-illumination environments lose the information required for age estimation because various kinds of noise and blur are generated. Therefore, to overcome the problem of degradation in age estimation performance of human facial images captured in low-light environments, this study proposed a new LAE-GAN for enhancing low-illumination images and performed a CNN-based age estimation on the enhanced images. The results of the experiments conducted using open databasesincluding the MORPH database, FG-NET database, and AFAD database-showed that low-illumination images enhanced with the LAE-GAN proposed in this study produced better age estimation performance compared to state-of-the-art enhancement networks. However, in the case of enhancement by LAE-GAN, high-frequency information such as wrinkles and detailed information such as the skin texture of the original image were not fully restored in the enhanced image, or different images from the original image were generated. In addition, the restored images are a little blurred and include additional noise through the transformation by LAE-GAN.
For solving these issues in the future, adding a loss function to fully restore skin texture or strengthening an identity loss to prevent different enhanced images from being generated will be investigated further. Moreover, more research will be conducted on age estimation and image enhancement using images of various illumination and angles in addition to facial image compensation that is more effective against various environments that are found in the real world. Although the proposed method shows high performance, the processing time is increased by operating two models of LAE-GAN and an age estimator. In future work, we intend to investigate a method to combine these two models into one, which can enhance the processing speed without reducing the accuracy of age estimation.