NDSRGAN: A Novel Dense Generative Adversarial Network for Real Aerial Imagery Super-Resolution Reconstruction

: In recent years, more and more researchers have used deep learning methods for super-resolution reconstruction and have made good progress. However, most of the existing super-resolution reconstruction models generate low-resolution images for training by downsampling high-resolution images through bicubic interpolation, and the models trained from these data have poor reconstruction results on real-world low-resolution images. In the ﬁeld of unmanned aerial vehicle (UAV) aerial photography, the use of existing super-resolution reconstruction models in reconstructing real-world low-resolution aerial images captured by UAVs is prone to producing some artifacts, texture detail distortion and other problems, due to compression and fusion processing of the aerial images, thereby resulting in serious loss of texture detail in the obtained low-resolution aerial images. To address this problem, this paper proposes a novel dense generative adversarial network for real aerial imagery super-resolution reconstruction (NDSRGAN), and we produce image datasets with paired high- and low-resolution real aerial remote sensing images. In the generative network, we use a multilevel dense network to connect the dense connections in a residual dense block. In the discriminative network, we use a matrix mean discriminator that can discriminate the generated images locally, no longer discriminating the whole input image using a single value but instead in chunks of regions. We also use smooth L 1 loss instead of the L 1 loss used in most existing super-resolution models, to accelerate the model convergence and reach the global optimum faster. Compared with traditional models, our model can better utilise the feature information in the original image and discriminate the image in patches. A series of experiments is conducted with real aerial imagery datasets, and the results show that our model achieves good performance on quantitative metrics and visual perception.


Introduction
Unmanned aerial vehicle (UAV) aerial remote sensing images can provide information about the Earth's surface quickly, but medium-and low-resolution UAV aerial imagery has limitations regarding extraction of high-precision features, map updates and target detection. The development of high-resolution UAV aerial imagery makes the in-depth application of aerial imagery possible, thus providing favourable conditions for GIS data updates [1] and GIS applications [2]. It is also important for map updates [3], semantic segmentation [4] and target detection [5]. Computer vision technology has been widely applied in many fields such as multi-target recognition [6] and seismic performance evaluation [7].

•
We produce a new image dataset with paired high-and low-resolution real aerial remote sensing images, which are both obtained by actual photography.

•
We propose a novel dense generative adversarial network for real aerial imagery super-resolution reconstruction (NDSRGAN). In the generative network, we use a multilevel dense network to connect the dense connections in a residual dense block. In the discriminative network, we use a matrix mean discriminator that can discriminate the generated images locally. We also use smooth L 1 loss to accelerate the model convergence and reach the global optimum faster.
The rest of the paper is organised as follows. Related studies on super-resolution are introduced in Section 2. In Section 3, we detail our NDSRGAN. The datasets and experimental results are given in Section 4, and we discuss our method on an open-source dataset in Section 5. Finally, we conclude our work and discuss future research directions in Section 6.

Related Work
The main SR methods are divided into three categories, which are discussed in the following subsections.

Interpolation-Based Methods
Interpolation-based methods calculate the value of a point according to a certain formula by using the values of several known points around the point and the relationship between the surrounding points and the location of this point. Interpolation-based methods are simple, efficient and easy to understand, but they are also prone to blurring and jaggedness when recovering images, and they cannot recover image details because no new information is generated during the interpolation process. Common interpolation-based methods include nearest neighbour [12], bilinear [13] and bicubic [14] methods.

Reconstruction-Based Methods
Reconstruction-based methods establish an observation model for the process of converting high-resolution (HR) images into low-resolution (LR) images, and then achieve SR reconstruction by solving the inverse problem of the observation model. Typical reconstruction-based methods are nonuniform interpolation [15], iterative inverse projection [16], maximum posterior probability [17] and convex set projection [18] methods.

Learning-Based Methods
A learning-based image SR algorithm learns the mapping relationship between LR and HR images by training the image dataset to predict the missing details of feature information in LR images to reconstruct high-quality images. Learning-based SR methods are the mainstream research direction at present, and they achieve the best results. Commonly used learning-based SR methods include the neighbour embedding method [19], sparse representation [20] and deep learning [21].
In recent years, with the rapid development of artificial intelligence technology [22], the use of deep learning methods to achieve SR reconstruction has become mainstream, and many deep-neural-network-based models have made good progress in the field of SR reconstruction. The methods using deep learning for single-image SR reconstruction [23] can be further divided into two categories: convolutional neural network (CNN)-based SR reconstruction and generative adversarial network (GAN)-based SR reconstruction.

CNN-Based SR Reconstruction Methods
A simple SRCNN was proposed by Dong et al. [24], which used a simple threelayer CNN [25] to achieve SR reconstruction and obtained better reconstruction results than traditional methods. However, due to the large size of the convolutional kernel and the shallow depth of the network, the reconstructed image was too smooth, and the reconstruction details were lost.
To overcome this shortcoming, Dong et al., proposed a fast SR reconstruction CNN (FSRCNN) [26], which uses smaller convolutional kernels than SRCNN to complete feature extraction and nonlinear mapping of images, and abandons the initial interpolation of low-resolution images for enlargement, instead directly inputting low-resolution images and finally performing deconvolution [27] to amplify the image. After this series of optimisations, FSRCNN obtained a higher evaluation score and better reconstruction effect than SRCNN. However, the reconstruction details still require improvement because the network structure is too simple.
Following the wide application of residual networks (ResNet) [28] in the field of target detection and target recognition, Lim et al. applied ResNet to the field of image SR reconstruction and proposed an enhanced SR reconstruction residual network (EDSR) [29]. In this network, the authors removed the batch normalisation (BN) layer from the ResNet [30] and increased the number of residual layers from 16 to 32. The BN layer consumes the same amount of memory as its preceding convolutional layers. Thus, after this operation step is removed, EDSR can stack more network layers, so that more features are extracted per layer with the same computational resources, resulting in a better performance.
These CNN-based SR network models have made good progress in continuous optimisation and improvement. However, a common loss function for CNN-based SR network models is the mean squared error (MSE), which causes the generated images to have a high signal-to-noise ratio but lack high-frequency detail and have an over-smoothed texture.

GAN-Based SR Reconstruction Methods
Compared with traditional CNNs, the biggest difference and success of GANs is the discriminator, which can be trained to discriminate the input generated images. In the field of aerial photography, the image complexity in aerial images is greater than that in natural images [31]. Conventional discriminators are prone to local discriminative errors in discriminating images. They cannot make correct discriminations based on the correlation of surrounding features, thus seriously affecting the quality of generated high-resolution images.
The main structure of a GAN consists of a generator and a discriminator. The generator generates fake images that deceive the discriminator, and the discriminator discriminates whether the input image is from a real image or a fake image generated by the generator, to achieve Nash equilibrium [32] by alternately training the generator and the discriminator, finally determining the network model parameters [33].
Ledig proposed a single-image SR reconstruction method using GANs (SRGAN) [34], which was the first application of GANs in image SR reconstruction, to solve the shortcomings of MSE in CNN-based SR reconstruction models [35]. SRGAN uses perceptual loss as an optimisation objective. Perceptual loss was proposed by Johnson et al., inspired by the studies in [36,37], and consists of adversarial loss [33] and content loss. Adversarial loss maps the image to a high-dimensional feature space and uses a discriminative network to discriminate the reconstructed image from the original image, and content loss is based on receptive similarity rather than pixel similarity [38].
An activated 19-layer VGG network [39] is used to obtain the feature maps of the generated image and the original high-resolution image, and then to improve the quality of the generated image by minimising the error between the generated image and the feature maps of the original high-resolution image. However, because the generative network of SRGAN partly uses ResNet, the generated image is usually not sufficiently natural and contains some noise.
To address the shortcomings of SRGAN, Wang et al., proposed an enhanced SRGAN (ESRGAN) [40]. This uses a multilevel network structure unit RRDB instead of ResNet in the generative network of the SRGAN and removes the BN layer. Inspired by a relativistic GAN, it allows the discriminator to predict the probability that the generated image is more false, relative to the high-resolution image, rather than just the probability that the generated image is false [41]. It also uses feature maps with stronger supervisory information before activation, to constrain the perceptual loss function. After a series of optimisations, the ESRGAN model achieved better visual quality in its reconstructed images but did not reconstruct more desirable results for the edge contours of some features. Therefore, Ma et al., proposed a gradient-guided SR reconstruction network model (SPSR) [42], which adds a gradient branch and gradient loss to the ESRGAN network and can recover the contour details of the original image to a great extent.
However, the overall visual effect of the image is still different from that of the original image. Although the above research on GAN SR models has achieved great improvements in the SR effect, the mainstream models usually use some more basic residual networks and L1 loss, resulting in the inability to fully learn the features of the real aerial images, while most models ignore the optimisation of the discriminator.

Method
This section focuses on the design of the generative network, discriminative network and loss function of NDSRGAN.

Generative Network
The generative network architecture is shown in Figure 1. Firstly, we use one convolutional layer to extract the original features in the LR image to obtain the original feature Remote Sens. 2022, 14, 1574 5 of 23 maps I f . I f is further input into multiple DCRDBs connected by a dense network. Then, the input and output of the u-th DCRDB can be represented as follows: This section focuses on the design of the generative network, discriminative network and loss function of NDSRGAN.

Generative Network
The generative network architecture is shown in Figure 1. Firstly, we use one convolutional layer to extract the original features in the LR image to obtain the original feature maps .
is further input into multiple DCRDBs connected by a dense network. Then, the input and output of the u-th DCRDB can be represented as follows: Generative network Discriminative network represents LR images and represents the image generated by the generative network.
represents the convolutional layer. , represents the u-th DCRDB module. denotes the upsampling network. represents the original HR image and represents the output image of the generative network. In addition, k represents the convolution kernel size, c represents the number of convolution kernels and s represents stride. The arrows in the figure represent the flow of the feature map. Figure 1. Network structure of NDSRGAN. I LR represents LR images and I SR represents the image generated by the generative network. F conv represents the convolutional layer. F DCRDB,u represents the u-th DCRDB module. F upnet denotes the upsampling network. I HR represents the original HR image and I SR represents the output image of the generative network. In addition, k represents the convolution kernel size, c represents the number of convolution kernels and s represents stride. The arrows in the figure represent the flow of the feature map.
In Equation (1), I f represents the original feature maps of the LR image I LR after convolution and α represents the residual scaling factor [43]. In Equation (2), Q u represents the input of the u-th DCRDB, I u represents the output of the u-th DCRDB and F DCRDB,u (·) represents the composite network formed by the u-th DCRDB through dense connections.
The specific network structure of the DCRDB is shown in Figure 2a. Each DCRDB consists of σ dense blocks (DB) and a convolutional layer, connected using a dense network. These dense connections can ensure that the output feature maps of each dense block module can be reused in multiple levels, which can maximise the use of feature information on LR images. The v-th DCRDB internal calculation formula can be described as follows:

connections.
The specific network structure of the DCRDB is shown in Figure 2a. Each DCRDB consists of dense blocks (DB) and a convolutional layer, connected using a dense network. These dense connections can ensure that the output feature maps of each dense block module can be reused in multiple levels, which can maximise the use of feature information on LR images. The v-th DCRDB internal calculation formula can be described as follows:  In Equation (4), , is the input of the v-th layer of the DB in the u-th DCRDB, , is the output of the v-th layer of the DB in the u-th DCRDB and , represents the v-th layer of the DB in the u-th DCRDB. After this, we sum the outputs of the DBs through the dense connection and input them into the convolutional layer, which is calculated as follows: In Equation (5), ⋅ represents the last convolutional operation in the DCRDB and represents the output of the convolutional layer, immediately after which we access a residual connection, calculated as follows: In Equation (6), represents the input after the u-th DCRDB and represents the output after the u-th DCRDB. In Equation (4), Q u,v is the input of the v-th layer of the DB in the u-th DCRDB, I u,v is the output of the v-th layer of the DB in the u-th DCRDB and F DB,v represents the v-th layer of the DB in the u-th DCRDB. After this, we sum the outputs of the σ DBs through the dense connection and input them into the convolutional layer, which is calculated as follows: In Equation (5), F conv (·) represents the last convolutional operation in the DCRDB and I DBConv represents the output of the convolutional layer, immediately after which we access a residual connection, calculated as follows: In Equation (6), Q u represents the input after the u-th DCRDB and I u represents the output after the u-th DCRDB.
The DB structure is shown in Figure 2b. Each DB is composed of a series of CL modules and a convolutional layer, with dense connection. The CL module consists of a convolutional layer and a leaky ReLU (LReLU) layer [44,45]. The internal computational formula of the v-th DB in the u-th DCRDB can be described as In Equation (8), Q u,v,w is the input of the w-th CL module in the v-th dense block in the u-th DCRDB, I u,v,w is the output of the w-th CL module in the v-th dense block in the u-th DCRDB, F CL,w represents the w-th CL composite network in the v-th DB of the u-th Remote Sens. 2022, 14, 1574 7 of 23 CRDB and T concat (·) represents the concatenate operation that sums the channels of the input feature maps. Suppose we use ϕ CL modules, then: In Equation (9), I CL is the output of F conv (·) and F conv (·) is the last layer of the convolution operation of the DB. Following this, we access a residual connection In Equation (10), Q u,v is the input of the v-th DB in the u-th DCRDB and I u,v is the output of the v-th dense block in the u-th DCRDB.
After introducing the basic components of the DCRDB, we now introduce the generative network. Suppose we use ε DCRDBs. The output I ε−1 is connected with the output of each previous layer of DCRDBs and fed into the next convolutional layer: In Equation (11), F conv (·) represents the final layer of the convolution operation of the generative network and I y represents the output of F conv (·). Then, we access a residual connection In Equation (12), I z represents the output of I y connected to the I f residuals. Finally, I z is fed to the upsampling network, set up as follows: In Equation (13), F upnet (·) represents our upsampling network, I z represents the input of the upsampling network and I SR represents the output of the upsampling network, which is also the input of the final discriminator. The upsampling network of ESRGAN [40] is exploited in our network. It includes two magnification steps. Each step uses a nearest neighbour algorithm to magnify the image and connect a convolutional layer and an LReLU [44,45] layer.

Discriminative Network
The role of the discriminator is to discriminate the feature map distribution difference between I SR and I HR , and to discriminate whether I SR is real or fake. The output of the basic discriminator of a GAN is often a classification after a series of convolutions and a full concatenation score to indicate whether the overall category of the image is real or fake. The receptive field for classification is the whole image, which is why the network is insensitive to the local information of the image and cannot achieve a higher-fidelity image reconstruction in the case of rich feature details. Therefore, local feature extraction and discrimination of the image are needed to generate an SR reconstructed image that is closer to the real aerial image.
Inspired by [46], we used the idea of a matrix mean discriminator capable of local discrimination on the generated images, to construct our discriminative network. The network structure of the discriminative network is shown in Figure 1, set up as a fully convolutional network [47]. The first layer consists of a convolutional layer and an LReLU [44,45]. The second, third and fourth layers consist of a convolution layer, a BN layer and an LReLU [44,45]. The fifth layer consists of one convolutional layer. The size of the output matrix in each layer is shown in Table 1. Table 1. The size of the output matrix of each layer of the network, where h represents the size of the input image, k 1 − k 5 represent the sizes of each convolution kernel from the first layer to the fifth layer, p 1 − p 5 represent the amounts of padding from the first layer to the fifth layer, s 1 − s 5 represent the sizes of each convolutional step from the first layer to the fifth layer, n 1 − n 4 represent the sizes of the output matrix from the first layer to the fourth layer and n represents the size of the final output matrix from the discriminator.

Network Layers
Calculation Formula Size of the Output Matrix the first layer The discriminative network finally yields an n × n discriminative matrix. Each element in the matrix represents a receptive field in the original image and carries out a real or fake discrimination for each value in the n × n matrix to complete the local discrimination of the input image. The experimental results shown in Section 4 demonstrate that the image quality is significantly improved by our discriminator-optimised model.

Loss Function Design
Here, we introduce the design of the loss function of the generative network, which consists of three main parts: pixel loss, perceptual loss [38] and adversarial loss [33]. Pixel loss is calculated as follows: In which Is a more powerful objective function. In fact, in the interval of |x| ≥ 1, L 1 loss actually solves the problem of a too-large gradient and unstable training caused by a too-large difference between I SR and I HR at the early stage of training. In the |x| < 1 interval, which is actually L 2 loss, the derivative of the L 2 loss is relatively small when the difference between I SR and I HR is small in the late training period, thereby making the loss convergence more stable and aiding convergence to the global optimum [11]. The perceptual loss is calculated as follows: We use the smooth L 1 loss to design our perceptual loss. In Equation (16), VGG(·) represents a 19-layer VGG network [39], and we use this network before activation to extract the feature maps from the generated and real images. The goal of the perceptual loss function is to minimise the error between the feature maps. The addition of this loss function allows our model to generate images with a more realistic texture. We can now introduce the adversarial loss for the generator, which is calculated as follows: In Equation (17), D HS = σ(D(I HR ) − E(D(I SR ))) represents the difference between the matrix D(I HR ) of the real image I HR output through the discriminator D(·) and the element mean of the output matrix D(I SR ) of I SR after the discriminator D(·). Then, the resultant matrix is mapped to a probability matrix between 0 and 1 by the sigmoid function, and the D HS matrix has an optimisation target of 0 for each element. Each element represents an optimisation target of 0 for each receptive field. Here, σ(·) is the Remote Sens. 2022, 14, 1574 9 of 23 sigmoid function and D SH = σ(D(I SR ) − E(D(I HR ))) represents the difference between the matrix D(I SR ) of I SR output through the discriminator D(·) and the element mean of the output matrix D(I HR ) of I HR after the discriminator D(·). Afterwards, the resultant matrix is mapped to a probability matrix between 0 and 1 by the sigmoid function, and the optimisation target of each element in the D SH matrix is 1. Each element also represents an optimisation target of 1 for each receptive field. The optimisation target of the receptive field is 1. The final objective function of the generator can be defined as In Equation (18), β and γ represent the weight coefficients of the pixel loss and the generator adversarial loss.
Next, we introduce the discriminator loss function construction. The discriminator adversarial loss calculation formula can be expressed as The design of the discriminator adversarial loss differs from that of the generator in which each element of the D HS = σ(D(I HR ) − E(D(I SR ))) output matrix has an optimisation objective of 1, in that each element of the D SH = σ(D(I SR ) − E(D(I HR ))) output matrix has an optimisation objective of 0. The similarity lies in the fact that each element of the matrix represents each receptive field. The loss function of the discriminator can be expressed as In Equation (20), η represents the weight coefficient of L DA . We optimise L G and L D by alternate training to update the parameters of our generative and discriminative networks and obtain our SR reconstruction model.

Experiments and Results
In this section, we introduce the production of the SR reconstruction dataset based on real HR and LR aerial imagery (RHLAI), the details of training parameters, the analysis of changes in the reconstructed image quality metrics during NDSRGAN training and the analysis of the reconstruction results of the RHLAI dataset.

RHLAI Dataset
Most SR reconstruction methods use LR images obtained by bicubic interpolation of HR images. A difference exists between the LR images obtained in this way and the actual LR images. As a result, the trained SR models often do not have the SR reconstruction capability of the real LR images. To solve this problem, in this study we discarded the bicubic interpolation method of generating LR images from HR images used in most previous studies to construct paired HR and LR datasets. Instead, we used real LR aerial imagery and the corresponding real HR aerial imagery to produce the datasets, to enable the model to learn more complex and more realistic mapping relationships instead of simply learning the inverse process of bicubic interpolation.
To ensure that the HR and LR images corresponded to each other, we took HR and LR images by aerial flight at 100 m and 400 m, respectively, in the same place. These images were taken on 9 January 2021, at 2:00 p.m., at Yichang City, Hubei Province, China. We obtained real HR aerial imagery with 0.05 m resolution and real LR aerial imagery with 0.2 m resolution.
We used the georeferencing tool in ArcGIS to align the HR images and LR images by manually setting the control points, in order to make the features in the aligned HR images and LR images correspond to each other as much as possible. We used OpenCV to crop the 0.05 m HR images into 256 × 256 images and the corresponding 0.2 m LR images into 64 × 64 images, resulting in a dataset of 9288 pairs of HR and LR images. We selected 8385 pairs of HR and LR images as the training dataset and 903 pairs of HR and LR images as the test dataset. We discuss the performed cross-validation on the training dataset in Section 4.3.
We named this dataset RHLAI. This dataset can avoid the problem of the model simply learning the inverse process of bicubic interpolation and can truly reflect the reconstruction ability of the SR model.

Training Details
We set the batch size to 16 for the input images. All the input data were randomly cropped and data-augmented. We randomly cropped the HR images to 128 × 128 and the LR images to 32 × 32. NDSRGAN uses 23 DCRDBs and 3 DBs in each DCRDB; each DB in turn uses 4 CR modules.
The residual scaling factor α was set to 0.2, β was set to e −2 , γ was set to 2.5e −3 and η was set to 0.5. The learning rate at the beginning of training was set to 2e −4 . The number of training iterations was set to e 5 and the learning rate was reduced by half at 2.5e 4 , 5e 4 and 7.5e 4 iterations.
For the optimiser settings, NDSRGAN used the Adam optimiser [48], with β 1 = 0.9 and β 2 = 0.99. The generative and discriminative networks were trained alternately during the training process until Nash equilibrium was achieved.
We developed our model code using the PyTorch framework and completed the model training on two GTX1080Ti GPUs with 12 GB of global memory.

Cross-Validation
In this experiment, we used three accuracy metrics to evaluate the model: peak signalto-noise ratio (PSNR) [49], structural similarity (SSIM) [50] and learned perceptual image patch similarity (LPIPS) [51]. LPIPS is calculated according to the L 2 distance d(x, x 0 ), which was defined in [51] as The widely used perceptual metrics (PSNR, SSIM) are very simple functions that cannot reflect human perception well. Therefore, in this experiment, we added the LPIPS metric to evaluate the model. LPIPS is a type of computational evaluation metric for image depth features, which is closer to human perception in visual similarity judgment and performs better. LPIPS is a metric calculated by mapping the evaluated image and the original image to a high-dimensional space using a deep learning model, and then calculating the distances between the feature images of the evaluated image and the original image. It can be used to evaluate the quality of the image. The smaller the distance between two feature images, the closer the evaluated image is to the original image. A lower LPIPS value represents better image quality [51].
We conducted nine-fold cross-validation experiments on the training dataset. A total of 8385 images were divided into nine sub-datasets. We extracted one sub-dataset each time as the validation dataset and input the remaining eight sub-datasets into the model for training. Table 2 shows three metrics of the model for nine experiments. After observing that the fourth experiment had the highest performance, we chose the sub-datasets of the fourth experiment to obtain the final parameters of our model and selected them for conducting subsequent model comparison tests.

Image Quality Metrics Analysis during NDSRGAN Training
We saved the mean values of PSNR, SSIM and LPIPS for the validation dataset every 500 iterations while the model was trained and plotted them into three graphs for visual analysis.  Figure 3a shows the PSNR values during the training process. The horizontal coordinates are the number of iterations, and the vertical coordinates represent the PSNR values, from which we can observe that the oscillation amplitude of the PSNR values was very large at the beginning of the training period, and the difference between the highest and lowest peaks was more than 2. Our model gradually stabilised, and the amplitude stayed at around 0.2 when the number of iterations approached 100,000.  Figure 3b shows the curve of the SSIM value change during training, from which we can observe that the oscillation amplitude of the SSIM value was still very large at the beginning of training, and the difference between the highest peak and the lowest peak was more than 0.05. As the number of training iterations increased, the overall trend of  Figure 3b shows the curve of the SSIM value change during training, from which we can observe that the oscillation amplitude of the SSIM value was still very large at the beginning of training, and the difference between the highest peak and the lowest peak was more than 0.05. As the number of training iterations increased, the overall trend of the SSIM values gradually increased, the amplitude of oscillation gradually stabilised and the final amplitude remained around 0.01, which also indicates that our model gradually stabilises in the late training period. Figure 3c shows the curve of the LPIPS value change during the training process. It shows that the quality of our model improved continuously as the number of training iterations increased, and the model stabilised and converged when the number of iterations approached 100,000.

Reconstruction Results Analysis of RHLAI Dataset
We conducted experiments using SRCNN [25], EDSR [29], SRGAN [34], ESRGAN [40], SPSR [42], Real-ESRGAN [52] and NDSRGAN on the RHLAI dataset, under the conditions where the training iterations, initial learning rates and training hardware environments were guaranteed to be the same. Table 3 shows the PSNR, SSIM and LPIPS values of the reconstructed images on the test dataset. Higher PSNR and SSIM values and lower LPIPS values represent better image quality. Through experimental comparison, we found that the bicubic interpolation methods and the CNN-based SR reconstruction methods tended to obtain higher PSNR and SSIM values compared with the GAN-based methods, but a large gap still exists between the real reconstruction visual effect and the GAN-based methods' reconstruction effect, and the image visual effect is blurred (see Figure 4). Therefore, we believe that the PSNR and SSIM metrics do not represent the real image reconstruction quality well. The authors who developed ESRGAN [40] and SPSR [42] presented the same insights in their papers. For this reason, we calculated a metric closer to human visual perception, to continue the accuracy evaluation, namely LPIPS [51].Our model achieved the highest accuracy score, significantly exceeding the LPIPS value of the second-ranked SPSR model. It can be seen from Figure 4 that this metric is consistent with human visual perception. The lower the LPIPS value othe reconstructed image, the higher the reconstruction quality. We also performed a group of qualitative and quantitative analyses of the reconstructed images in terms of human visual perception. Figure 4 shows that the reconstructed images from our model have richer texture details than the reconstructed images from other models, and the reconstructed images are closer to the real HR images.
The figure also shows that the images reconstructed using bicubic interpolation and CNN-based methods are obviously blurred, and the images reconstructed by the GANbased methods are clearer. and SSIM metrics do not represent the real image reconstruction quality well. The authors who developed ESRGAN [40] and SPSR [42] presented the same insights in their papers. For this reason, we calculated a metric closer to human visual perception, to continue the accuracy evaluation, namely LPIPS [51].Our model achieved the highest accuracy score, significantly exceeding the LPIPS value of the second-ranked SPSR model. It can be seen from Figure 4 that this metric is consistent with human visual perception. The lower the LPIPS value othe reconstructed image, the higher the reconstruction quality.  As can be seen in the first and second rows of Figure 4, our model is able to reconstruct a clearer car outline. In the third row of Figure 4, our model is able to restore more of the white texture of the roof and produce fewer artifacts and in the fourth row of Figure 4, our model is able to generate a clearer gutter that is closer to the real HR image. In the fifth row of Figure 4, our model generates more regular and more detailed parking lines.
These results indicate that our model is able to take full advantage of the features of the original image and that the feature detail contours in the generated image are closer to the real HR image, due to the matrix mean discriminator utilised by NDSRGAN. The performance of NDSRGAN will be further discussed in Appendix A.

Discussion
To further demonstrate the good performance of NDSRGAN, we conducted a series of experiments using SRCNN, EDSR, SRGAN, ESGAN, SPSR, Real-ESRGAN and NDSRGAN on the DIV2K dataset [53], under conditions where the training iterations, initial learning rates and training hardware environments were guaranteed to be the same. Table 4 shows the PSNR, SSIM and LPIPS values of the reconstructed images in Set5 [54], Set14 [55] and Urban100 [56]. As can be seen in Table 4, the EDSR model obtained the highest PSNR and SSIM metrics in all three test sets. As described in Section 4.5, CNN-based models tended to obtain higher PSNR and SSIM values, but the image reconstruction visual results were not as good as those of the GAN-based models. The LPIPS metric is more consistent with human visual evaluation results, so the ranking of LPIPS values is more convincing regarding the superiority of the model. Our NDSRGAN model obtained the best LPIPS metrics in three test sets, further proving the superiority of our model's reconstruction ability.

Conclusions
High-resolution aerial images are important in aerial photography applications, but they have a high acquisition cost and a long processing period. Moreover, the current models do not achieve a good reconstruction effect on real aerial image super-resolution reconstruction. Therefore, reconstructing high-quality high-resolution aerial images from real low-resolution aerial images is a great challenge.
To address the difficulty of reconstructing real aerial images by using existing superresolution reconstruction models, we produced the RHLAI dataset with real aerial highand low-resolution pairing for aerial image super-resolution reconstruction and proposed a new aerial image super-resolution reconstruction model called NDSRGAN.
We used multilevel dense connection to connect multiple dense connections in the residual dense block, so that the designed generative network could maximise the use of the features of real aerial images, and we used a matrix mean discriminator to optimise our discriminative network, which does not discriminate the whole image of the input but instead discriminates the input image locally in chunks.
The experimental results demonstrated that the reconstructed images obtained by image local discrimination were of higher quality and closer to the real high-resolution images. In our model, smooth L 1 loss was used instead of the L 1 loss used in the mainstream models (ESRGAN [40], SPSR [42]), and this optimisation accelerated the convergence of the model, making it reach the global optimum faster.
The previous super-resolution reconstruction models and NDSRGAN were trained on the real aerial image dataset we produced, and the experiments showed that our proposed super-resolution reconstruction model for aerial images could reconstruct the real aerial images with better results, and that the reconstructed images achieved better results than previous methods regarding both quantitative metrics and qualitative evaluation.
Although the proposed method could reconstruct real aerial images with better results, the quality of the reconstructed images was not good when the model reconstructed images taken by different sensors. The lack of reconstruction generalisation ability of the superresolution model on images taken by different sensors remains a problem that needs to be studied and addressed in the future.

Institutional Review Board Statement:
The study did not involve humans or animals.

Informed Consent Statement:
The study did not involve humans.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
In this section, we report the results of performing experimental analyses of model ablation, residual scaling factor size, discriminative matrix size of discriminator and image size of the input during model training.

Appendix A.1. Ablation Study
We performed stepwise improvement and optimisation on the base network of the ESRGAN model to prove that each part of our model optimisation was effective. We designed three comparative experiments.
The first was to add a multilevel densely connected network to the base generative network part to achieve multilevel information utilisation and feature fusion between multilevel DBs and DCRDBs. The second was to use a matrix mean discriminator instead of the original discriminator, in addition to the MDC. The third was our final network model, which uses smooth L 1 in addition to MDC and MMD to compute the pixel loss and perceptual loss.
From the experimental results in Section 4.5, we concluded that PSNR and SSIM are not suitable for evaluating image quality, so we used only the LPIPS metric as the image quality assessment metric in the experiments discussed in the Appendix. The LPIPS values of the reconstructed images are shown in Table A1. After the MDC optimisation was added to the base network, the LPIPS value improved by 0.0065 compared with the base network, thereby proving that the MDC can effectively improve the image reconstruction ability of the network. Table A1. Mean value of LPIPS for different ablation experiments. Base represents the base network, MDC represents multilevel dense connection, MMD represents matrix mean discriminator and smooth L 1 means that the pixel loss and perceptual loss are computed by smooth L 1 loss. The best performance in bold. Meanwhile, after MMD was added, the LPIPS value of the reconstructed images significantly improved by 0.0236, which shows that MMD plays a crucial role in the optimisation of the model.

Methods LPIPS
Finally, after the smooth L 1 loss was added to the model, giving our final proposed SR reconstruction model NDSRGAN, the LPIPS value of the reconstructed images improved by 0.0184. From this result, we can conclude that our optimisation of the loss is also essential.
The reconstruction effects of each model in the above ablation experiments are shown in Figure A1. The first row of Figure A1 shows that the car parking lines in the reconstructed images of the base model are very blurred, which is directly related to the fact that the base model generative network does not make full use of the image features. After we added MDC optimisation to the generative network, the car parking lines in the reconstructed images became much clearer compared with those in base model, but the contours and colours of the parking lines still had some distortion. After we added MMD optimisation to the generative network optimisation, the contour of the parking line was significantly improved, and the colour of the parking line was closer to the original HR image. Finally, after we added smooth L 1 loss optimisation to the model, the reconstructed parking lines became smoother, and the contour shape was closer to the original HR image. From the second row of Figure A1, we can observe that the reconstructed images from the base model had serious colour distortion and obvious artifacts in the local zoomed images. After using MDC to optimise the generative network, we can see that the reconstructed image exhibited improved colour and fewer artifacts. When we optimised the discriminator with MMD, the colour of the reconstructed image was further corrected, and the overall image quality was improved. Finally, after we added ℎ loss optimisation, the final model reconstructed an image with the closest colour to the original HR image, the fewest artifacts and the best visual effect. The third row of Figure A1 shows that the roof lines in the reconstructed images of the base model were rather blurred, and there was a serious loss of features. When the MDC-optimised generative network was used, the roof lines in the reconstructed images became clearer. After we used MMD to optimise the discriminator, the roof lines in the reconstructed image became clearer, but some colour distortion occurred. Finally, after we added the ℎ loss optimisation, the final reconstructed image had the sharpest roof line, and the colour was closer to the original HR image.

Appendix A2. Residual Scaling Factor
In the MDC of the generative network, we used a residual scaling factor to set the feature information weights and we conducted group experiments by setting different From the second row of Figure A1, we can observe that the reconstructed images from the base model had serious colour distortion and obvious artifacts in the local zoomed images. After using MDC to optimise the generative network, we can see that the reconstructed image exhibited improved colour and fewer artifacts. When we optimised the discriminator with MMD, the colour of the reconstructed image was further corrected, and the overall image quality was improved. Finally, after we added smooth L 1 loss optimisation, the final model reconstructed an image with the closest colour to the original HR image, the fewest artifacts and the best visual effect.
The third row of Figure A1 shows that the roof lines in the reconstructed images of the base model were rather blurred, and there was a serious loss of features. When the MDC-optimised generative network was used, the roof lines in the reconstructed images became clearer. After we used MMD to optimise the discriminator, the roof lines in the reconstructed image became clearer, but some colour distortion occurred. Finally, after we added the smooth L 1 loss optimisation, the final reconstructed image had the sharpest roof line, and the colour was closer to the original HR image.

Appendix A.2. Residual Scaling Factor
In the MDC of the generative network, we used a residual scaling factor to set the feature information weights and we conducted group experiments by setting different residual scaling factors to select the most suitable residual scaling factor for the SR reconstruction of real aerial images. Preliminary experiments proved that when the residual scaling factor was larger than 0.4, the fusion of feature information was too redundant, and the model reconstructed the image poorly. Therefore, we chose the four residual scaling factors 0.1, 0.2, 0.3 and 0.4 to conduct the experiments.
The mean values of LPIPS on the test dataset are shown in Table A2. We can conclude from observation that when the residual scaling factor was 0.1, the LPIPS value of the reconstructed images was not the best, due to the loss of feature information in the process of MDC caused by a too-small residual scaling factor. When the residual scaling factor was 0.3 or 0.4, the LPIPS values of the reconstructed images became worse, compared with the value for a residual scaling factor of 0.2, because the residual scaling factor was too large. As a result, the feature information in the MDC process was too redundant, thereby negatively affecting the SR reconstruction of the image. When the residual scaling factor was 0.2, the LPIPS value mostly ranked first. This also shows that the reconstructed network can make the best use of the feature information in the multilevel densely connected network. Table A2. Mean value of LPIPS for models with different residual scaling factors, where α represents the residual scaling factor. The best performance in bold.

Residual Factor (α) LPIPS
We also present the reconstructed images of the model trained with different residual scaling factors ( Figure A2). From the first row of Figure A2, we can see that when the residual scaling factor was equal to 0.1, the overall visual effect of the red car was blurred due to the insufficient utilisation of feature information with a small residual scaling factor. When it was equal to 0.3, the reconstruction of the red car was too dark due to the feature redundancy caused by the large residual scaling factor. When it was equal to 0.4, the red car's overall visual effect became worse and more blurred. When it was equal to 0.2, the reconstruction of the red car had the highest quality and was closest to the original HR image.
From the second row of Figure A2, we can observe that the reconstructed image of the roof surface was too smooth, and the features of the roof surface gaps were lost when the residual scaling factor was equal to 0.1. When the residual scaling factor was equal to 0.3, the roof surface was reconstructed with more artifacts. When the residual scaling factor was equal to 0.4, the overall reconstructed image was too smooth, and there was a serious loss of texture detail. When the residual scaling factor was equal to 0.2, the roof surface had the fewest artifacts, and the features were most fully preserved.
The third row of Figure A2 shows that the roof reconstruction effect was very poor when the residual scaling factor was equal to 0.1, and the features of the original HR image were not learned. When the residual scaling factor was equal to 0.3, the overall reconstructed image was blurred due to the redundancy of feature stacking. When the residual scaling factor was equal to 0.4, the overall roof reconstruction effect was even worse, and the features of the white lines of the original image were not learned. When the residual scaling factor was equal to 0.2, the roof white lines were better reconstructed, and the overall visual effect was closer to the original HR image. residual scaling factor was equal to 0.1, the overall visual effect of the red car was blurred due to the insufficient utilisation of feature information with a small residual scaling factor. When it was equal to 0.3, the reconstruction of the red car was too dark due to the feature redundancy caused by the large residual scaling factor. When it was equal to 0.4, the red car's overall visual effect became worse and more blurred. When it was equal to 0.2, the reconstruction of the red car had the highest quality and was closest to the original HR image. Figure A2. Experimental results of the model corresponding to different residual scaling factors, where represents the residual scaling factor. The best performance in bold. Figure A2. Experimental results of the model corresponding to different residual scaling factors, where α represents the residual scaling factor. The best performance in bold.

Appendix A.3. Discriminative Matrix Size
To improve the discriminator's ability to discriminate local features of an image, we used an MMD. The MMD aims to finally output a discriminative matrix to discriminate the image. Different discriminative matrices correspond to different receptive field sizes in the original image, and a large output discriminative matrix corresponds to a small receptive field for each value in the matrix in relation to the original image. We used different sizes of the discriminative matrix to perform group experiments to select the most suitable discriminative matrix size for SR reconstruction of aerial images. We set up three sets of comparison experiments with discriminative matrix sizes of 8 × 8, 14 × 14 and 19 × 19.
The mean values of the accuracy metric LPIPS for the reconstructed images are shown in Table A3, from which we find that the LPIPS value for the reconstructed images was not good when the discriminative matrix size was 8 × 8. When the size of the discriminative matrix was 19 × 19, the LPIPS of the reconstructed images ranked second, which shows that the discriminative matrix of size 19 × 19 was not the most suitable for the SR reconstruction of aerial images. When the discriminative matrix size was 14 × 14, the reconstructed images ranked first for the LPIPS metric, and therefore the experiments indicate that a discriminative matrix of size 14 × 14 was the most suitable choice for our SR reconstruction of real aerial images. We also present the reconstructed images of the model trained with different discriminative matrix sizes ( Figure A3). In the first row of Figure A3, when the discriminative matrix size was 8 × 8, the edges of the red car in the reconstructed image showed artifacts. When the discriminative matrix size was 19 × 19, the edges of the red car were very blurred and showed colour distortion. When the discriminative matrix size was 14 × 14, the edges of the red car were the most complete and showed the least distortion and fewest artifacts. In the second row of Figure A3, when the discriminative matrix was of size 8 × 8, the overall colour of the roofline was dark and distorted. When the discriminative matrix was of size 19 × 19, the overall colour of the roofline was still dark, and some obvious white artifacts were observed. When the discriminative matrix was of size 14 × 14, the reconstructed image was closest to the original HR image.
In the third row of Figure A3, when the discriminative matrix was of size 8 × 8, the roof in the reconstructed image appears to show many artifacts. When the discriminative matrix was of size 19 × 19, the roof in the reconstructed image still appears to show more artifacts. When the discriminative matrix was of size 14 × 14, the roof in the reconstructed image produced the fewest artifacts and the best visual effect.

Appendix A4. Random Crop Size of Training Input Images
In this section, we discuss the effect of the random crop size of the HR input images during model training on the reconstruction results of the model. We used three random crop sizes of 64 × 64, 128 × 128 and 192 × 192 for group comparison experiments to select the most suitable random crop size for SR reconstruction of aerial images.
The mean values of LPIPS for the reconstructed images are shown in Table A4, from which we can see that the reconstructed images did not obtain good LPIPS metric values at the random crop size of 64 × 64. At the random crop size of 192 × 192, the LPIPS metric on the reconstructed image ranked second, thus indicating that a random crop size of the Figure A3. Experimental results of models using different discriminative matrix sizes. The best performance in bold.
In the second row of Figure A3, when the discriminative matrix was of size 8 × 8, the overall colour of the roofline was dark and distorted. When the discriminative matrix was of size 19 × 19, the overall colour of the roofline was still dark, and some obvious white artifacts were observed. When the discriminative matrix was of size 14 × 14, the reconstructed image was closest to the original HR image.
In the third row of Figure A3, when the discriminative matrix was of size 8 × 8, the roof in the reconstructed image appears to show many artifacts. When the discriminative matrix was of size 19 × 19, the roof in the reconstructed image still appears to show more artifacts. When the discriminative matrix was of size 14 × 14, the roof in the reconstructed image produced the fewest artifacts and the best visual effect.

Appendix A.4. Random Crop Size of Training Input Images
In this section, we discuss the effect of the random crop size of the HR input images during model training on the reconstruction results of the model. We used three random crop sizes of 64 × 64, 128 × 128 and 192 × 192 for group comparison experiments to select the most suitable random crop size for SR reconstruction of aerial images.
The mean values of LPIPS for the reconstructed images are shown in Table A4, from which we can see that the reconstructed images did not obtain good LPIPS metric values at the random crop size of 64 × 64. At the random crop size of 192 × 192, the LPIPS metric on the reconstructed image ranked second, thus indicating that a random crop size of the training input image of 192 × 192 was not suitable for SR reconstruction of aerial images. For the random crop size of 128 × 128, the reconstructed images ranked first for the LPIPS metric. Therefore, we used a random crop size of 128 × 128 for SR reconstruction of real aerial images. We present the reconstructed images of the model with different random crop sizes ( Figure A4). The first row of Figure A4 shows that, for a crop size of 64 × 64, the red car in the reconstructed image was very blurred, and the reconstruction effect was poor. For a crop size of 192 × 192, the red car in the reconstructed image had some distortion. For a crop size of 128 × 128, the outline of the red car was the clearest, and the visual effect was closer to the original HR image. metric. Therefore, we used a random crop size of 128 × 128 for SR reconstruction of real aerial images. We present the reconstructed images of the model with different random crop sizes ( Figure A4). The first row of Figure A4 shows that, for a crop size of 64 × 64, the red car in the reconstructed image was very blurred, and the reconstruction effect was poor. For a crop size of 192 × 192, the red car in the reconstructed image had some distortion. For a crop size of 128 × 128, the outline of the red car was the clearest, and the visual effect was closer to the original HR image. In the second row of Figure A4, when the crop size was 64 × 64, more reconstruction artifacts were found on the roof surface in the reconstructed image. When the crop size was 192 × 192, many reconstruction artifacts remained on the roof surface in the reconstructed image. When the crop size was 128 × 128, the fewest reconstruction artifacts were In the second row of Figure A4, when the crop size was 64 × 64, more reconstruction artifacts were found on the roof surface in the reconstructed image. When the crop size was 192 × 192, many reconstruction artifacts remained on the roof surface in the reconstructed image. When the crop size was 128 × 128, the fewest reconstruction artifacts were found on the roof surface, and the best visual effect was achieved.
In the third row of Figure A4, when the crop size was 64 × 64, the colour of the blue roof in the reconstructed image was dark, and serious colour distortion occurred. For a crop size of 192 × 192, the colour of the blue roof in the reconstructed image was still dark, and some black artifact patches were found. When the crop size was 128 × 128, the colour and detail texture of the blue roof in the reconstructed image were closest to the original HR image, and the visual effect was the best.