Deep Learning-Based Single Image Super-Resolution: An Investigation for Dense Scene Reconstruction with UAS Photogrammetry

: The deep convolutional neural network (DCNN) has recently been applied to the highly challenging and ill-posed problem of single image super-resolution (SISR), which aims to predict high-resolution (HR) images from their corresponding low-resolution (LR) images. In many remote sensing (RS) applications, spatial resolution of the aerial or satellite imagery has a great impact on the accuracy and reliability of information extracted from the images. In this study, the potential of a DCNN-based SISR model, called enhanced super-resolution generative adversarial network (ESRGAN), to predict the spatial information degraded or lost in a hyper-spatial resolution unmanned aircraft system (UAS) RGB image set is investigated. ESRGAN model is trained over a limited number of original HR (50 out of 450 total images) and virtually-generated LR UAS images by downsampling the original HR images using a bicubic kernel with a factor × 4. Quantitative and qualitative assessments of super-resolved images using standard image quality measures (IQMs) conﬁrm that the DCNN-based SISR approach can be successfully applied on LR UAS imagery for spatial resolution enhancement. The performance of DCNN-based SISR approach for the UAS image set closely approximates performances reported on standard SISR image sets with mean peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) index values of around 28 dB and 0.85 dB, respectively. Furthermore, by exploiting the rigorous Structure-from-Motion (SfM) photogrammetry procedure, an accurate task-based IQM for evaluating the quality of the super-resolved images is carried out. Results verify that the interior and exterior imaging geometry, which are extremely important for extracting highly accurate spatial information from UAS imagery in photogrammetric applications, can be accurately retrieved from a super-resolved image set. The number of corresponding keypoints and dense points generated from the SfM photogrammetry process are about 6 and 17 times more than those extracted from the corresponding LR image set, respectively.


Introduction
In most remote sensing (RS) applications, high-resolution (HR) images are usually more demanding in a wide range of image analysis tasks leading to more precise and accurate RS-derived products [1][2][3]. HR imagery is usually more desirable in all applications, including RS imagery, because improved pictorial information makes visual interpretation easier for a human and helps to purify representation for automatic machine perception [4]. In RS applications, the resolution of a 1.
An overview of the SR problem and DCNN approaches for SISR is provided with emphasis on generative adversarial network (GAN) architecture. GAN-based models are fully reviewed including their specific loss functions. Additionally, different learning strategies and image quality measures (IQMs) typically employed for SISR tasks are reviewed.

2.
A high performance DCNN-based SISR model based on GAN architecture [31], known as enhanced SRGAN (ESRGAN) [32], is adopted and trained on a set of LR UAS images virtually generated by downsampling the original HR image set by factor ×4. Additive white Gaussian noise is applied to the LR imagery to make the SISR task more challenging. Such noise can always appear in any digital imaging and image transmission systems due to the electronics, imaging sensor quality, and the interaction of the digital imaging system with the natural environment, such as the level of illumination, temperature, etc [33]. Model performance in recovering the degraded or lost image details and noise reduction in the predicted super-resolved images is then carried out using standard IQMs. In this experiment, IQMs include peak signal-to-noise ratio (PSNR), structure similarity (SSIM) index, and a qualitative analysis through visually inspecting resulting SR images.

3.
A task-based IQM using Structure-from-Motion (SfM) photogrammetry is carried out on the predicted SR image set.

4.
A comprehensive comparative analysis of SfM derived photogrammetric data products, resulting from processing of the LR, HR, and SR UAS image sets, is carried out. Those products include: the camera calibration and camera pose information, densified 3D point clouds, and digital surface models (DSMs).
In regard to the UAS-SfM task-based evaluation for SR described above, the primary objectives of the experiment are summarized as follows: 1.
The performance of the adopted DCNN-based SISR model on retrieving both the interior and exterior geometry of the UAS imagery is investigated. In SfM photogrammetry, the accuracy and reliability of all derived parameters, within the robust bundle adjustment (BA) computations, are closely related to the accuracy and reliability of extracted keypoint features from raw images. Any image distortions and artefacts introduced by adding noise or upsampling images can dramatically affect the reliability of derived parameters within BA computations. 2.
The potential of the employed DCNN-based SISR model to downgrade the level of inherent and additional noise introduced to the original HR images is investigated. In most image-based 3D reconstruction algorithms, including SfM photogrammetry, lower level of noise in the underlying image set results in estimating the imaging and scene geometry with higher accuracy. That is due to the fact that the feature detection operators, using sophisticated image processing algorithms, extract keypoints features with higher accuracy and lower uncertainty across multiple images in an UAS image set. To do this, the naive pre-trained ESRGAN model, with upscaling factor ×1, is taken as an image restoration network. The idea is to explore the effectiveness of the ESRGAN model, trained on a large number of images within several standard image sets, to downgrade the inherent noise and restore the original UAS HR images.
The remainder of this paper is organized as follows. Section 2 briefly describes image SR as an image upscaling technique to recover the degraded or lost image details in LR images. Section 3 introduces some of the pioneering DCNN-based SISR architectures. GAN-based architecture and its specific cost function for SISR task is later described in Section 3. Learning strategies in Section 4 introduce different cost functions that are usually used in DCNN-based SISR models. Different metrics developed for evaluating the quality of resulting SR images are explained in Section 5. Section 6 explains the experiment including the employed DCNN-based SISR model. Section 7 reports the qualitative and quantitative results showing the performance of ESRGAN model on virtually-generated LR UAS images based on standard IQMs and a task-based IQM using SfM photogrammetry. Section 8 discusses the results in detail. Lastly, Section 9 provides a conclusion and future perspective.

Image Super-Resolution
Image SR refers to techniques which aim to restore a HR image from its LR counterpart(s). Their main goal is to recover the high frequency details lost in LR images and remove the degradation caused by the imaging device and/or environment [34,35]. SR is a topic of great interest in digital image processing and many computer vision related applications including HDTV [36], medical imaging [37,38], satellite imaging [39], face recognition [40], security and surveillance [41]. The basic idea in most SR techniques is to extract the non-redundant image content in multiple LR images and combine them to generate a HR image [5]. Single image interpolation is an easy approach within many available SR techniques, which can be used to increase the image size [4]. However, several works showed that it does not provide any additional information and would dramatically decimate details of the image [4,24,42].
Generally, the SR problem assumes the LR image represents a downsampled, noisy, and blurred (by an unknown low-pass filter) version of HR data. Due to the non-invertibility of the degradation process, SR problem is inherently ill-posed [43]. In other words, it is an under-determined inverse problem, of which the solution is not unique. In the typical SR framework, as depicted in Figure 1, the LR image I x is modeled as follows [44]: where I y is the corresponding HR image, D represents a degradation function, and δ is a set of parameters, e.g., the parameters of the unknown convolutional kernel, the scaling factor, and some noise related factors, contributing to the degradation process. Under general conditions, the degradation process from D is unknown and only LR image, I x , is provided. Thus, the SR operation, the reverse path in Figure 1, is an extremely challenging task, which effectively results in a one-to-many mapping from LR to HR image space [25]. Researchers are required to recover the corresponding HR imageÎ y from the LR image I x , so that I y is identical to the ground truth HR image I y , as follows [44]: where F is the super-resolution model and θ represents the parameters of F . Generally, degradation models combine several operations as follows [44]: where (I y ⊗ k) represents the convolution between a blur kernel k and the HR image I y , ↓ s represents a downsampling process with factor s, and n ζ is some additive white Gaussian noise with standard deviation ζ. SR techniques typically assume that high-frequency image contents are redundant and can be reconstructed from low-frequency contents making the SR technique an inference problem [43]. Some SR techniques assume that for reconstructing a HR image of a certain scene, multiple LR instances of the same scene with different perspectives are available. These techniques are categorized as multi-image SR (MISR) approaches [16]. Such methods attempt to invert the downsampling process by exploiting the explicit redundancy and constraining the ill-posed problem with additional information. However, MISR methods are usually computationally expensive because they require complex image registration and fusion in LR image space, where the accuracy of those processes directly affects the quality of the resulting super-resolved images [43]. An alternative approach is single image super-resolution (SISR) [45]. These techniques attempt to exploit the implicit redundancy available in the LR images, in the form of local spatial correlation in an image or additional temporal correlations in a video, and recover lost or deteriorated high-frequency content from a single LR instance. In SISR techniques, prior information is usually required to constrain the solution space [46].

Deep Learning for SISR
Learning-based methods, also known as example-based methods [4,[47][48][49], aim at estimating an effective mapping from LR to HR image pairs due to their fast computation and superior performance relative to many other traditional techniques [25]. These methods usually exploit machine learning (ML) algorithms to learn the statistical relationships between the HR and corresponding LR images from a substantial number of training samples [25]. Traditional methods for SISR suffer from a few drawbacks [25,43]: (1) unclear and potentially very complex definition of the mapping between the LR and HR image spaces; (2) established sub-optimal high-dimensional mapping; (3) most traditional methods rely upon handcrafted features with expert domain knowledge. Recently, deep learning-based SISR methods have achieved remarkable improvements over all traditional and ML approaches [23][24][25]. These methods take advantage of the huge capacity of DL models to be able to provide an extremely nonlinear mapping in a very high-dimensional space from the input space to the solution space, and efficiently explore that space to find the best solution. These methods usually take a DCNN architecture for low to high-level feature encoding and nonlinear feature mapping.

DCNN Architectures for SISR
A variety of super-resolution models based on DCNN architectures have been proposed so far. Most of those models focus on supervised super-resolution, requiring both LR images and corresponding HR images, usually as ground truth (GT). These approaches are mostly composed of a set of major components and processing strategies including the model's main framework, upsampling method, network architecture, and learning strategy.
Super-resolution convolutional neural network (SRCNN) by Dong et al. [24,50] in Figure 2 is a pioneering work in DCNN-based SISR approach. Despite its striking success, SRCNN model suffers from the following issues [25]. (1) Inputs to SRCNN are LR images upsampled to coarse HR images at a desired size using traditional methods (e.g., bicubic interpolation). Introducing interpolated images as inputs to the network have three main drawbacks: (a) severe over-smoothing and noise amplification effects introduced to interpolated inputs can result in further inaccurate estimations of the image content; (b) employing interpolated versions of images, instead of the original LR image, as input is very time-consuming and increases computational complexity almost quadratically [51]; and (c) assuming an unknown kernel in the downsampling process makes adopting a specific interpolated input, as an estimation of the output, unjustified. (2) As mentioned previously, most SR techniques undertake the assumption that the high-frequency content is redundant and can be accurately predicted from the low-frequency data [52]. Thus, exploring more contextual information within large regions of LR images to capture sufficient information for retrieving high-frequency details in predicted HR images seems inevitable. Theoretical work in DL show more contextual information can be achieved by designing very deep architectures with larger receptive fields, which can result in expanding the final solution space [19,[53][54][55][56]. In some situations, effectively attaining more hierarchical representations can be achieved by increasing the DL network depth [53]. In recent years, many different CNN-based architectures have been developed, which exploit a very deep and sophisticated architecture, including residual and/or dense feature mapping [19,56], to solve complex problems more efficiently [25,44].

GAN for SISR
Introduction of recent innovative and deeper CNN-based architectures for SISR has already led to breakthroughs in accuracy and speed. Photo-realistic SISR GAN (SRGAN) [23], illustrated in Figure 3, was introduced for recovering the finer texture details when resolving at large upscaling factors. Those recovered fine details in SR images not only make predicted HR images more appealing to a human, but also have a great impact on the accuracy and reliability of imaging geometry and scene details when they are retrieved by the SfM phtotogrammetry process. The basic SRGAN model is built upon the residual blocks [19] and trained under the perceptual loss in a GAN framework, which makes it capable of predicting photo-realistic images for ×4 upscaling factor [23]. The SRGAN model has shown significant improvement on overall visual quality of SR images over all previously introduced PSNR-oriented methods [23,32].
GAN [31] introduced by Goodfellow et al. tries to solve the adversarial min-max problem [23]: where it allows the network to train a generative model G with the purpose of fooling a discriminator D that is simultaneously trained to discriminate the SR images from the original HR images. The formulated perceptual loss consists of a weighted sum of a content loss (L SR X ) and an adversarial loss component (L SR Gen ) as follows [23]: Gen adversarial loss perceptual loss (5) Content loss motivated by perceptual similarity chooses the solution based on the perceptual similarity from the high dimensional solution space [23]. Instead of relying on pixel-wise losses, Ledig et al. define VGG loss based on ReLU activation layers and 19 layers VGG network [53], where VGG loss is computed as the Euclidean distance between the feature representations of a reconstructed image G θ G (I LR ) and the ground truth image I HR as follows [23]: where φ i,j represents the feature map obtained by the j-th convolution (after activation) before the i-th maxpooling layer within the VGG-19 network. W i,j and H i,j describe the dimensions of the respective feature maps within the VGG network.
Adversarial loss, which is the generative component of SRGAN to the perceptual loss, encourages the network to favor solutions residing on the natural image manifold [23]. The generative loss (L SR Gen ) is evaluated, in a probabilistic framework, based on the performance of the discriminator D θ D (.) over a training sample set as [23]: where, D θ D (G θ G (I LR )) represents the probability that the generated image G θ G (I LR ) is a natural HR image. As a consequence of exploiting adversarial loss, the discriminator network is trained to push SISR solutions to the natural image manifold.

Learning Strategies
Learning the end-to-end mapping function F to map a LR image I LR to the corresponding reconstructed SR image I SR =Î HR , which is an approximation of the real HR image I HR , requires the estimation of network parameters θ. This is attained via minimizing the loss between the super-resolved images I SR = F I LR ; θ and the corresponding HR images I HR . In this section, different loss functions that are widely used in SISR techniques are introduced. For the sake of brevity, the subscript y is dropped from the ground truth (target) HR image I y and the reconstructed HR imagê I y in the rest of this section.

Pixel Loss
Pixel loss evaluates the pixel-wise difference between two images, mainly in the form of L 1 distance, i.e., mean absolute error (MAE), or L 2 distance, i.e., mean square error (MSE). In so doing, it attempts to capture and solve the inherent uncertainty in retrieving lost high-frequency components by minimizing related loss functions as follows [44]: L pixel−L 2 I HR , where h, w and c are the height, width and number of channels of the reconstructed images, respectively. Charbonnier loss [57,58], is a variant of L 1 loss, given by [44]: where is a small constant (e.g., 1e − 3) for numerical stability. The pixel loss constraint results in a super-resolved image I SR , which is close to the ground truth HR image I HR in the pixel values. In comparison with L 2 loss, the L 1 loss shows higher performance and better convergence [44,59]. Using pixel loss as the loss function favors a high peak signal-to-noise ratio (PSNR). According to its definition, PSNR is heavily correlated with pixel-wise deviation, where minimizing pixel loss directly maximizes PSNR [23]. Moreover, it is partially related to the image perceptual quality. Thus, pixel loss has become the most widely used loss function in SR field.
Minimizing the pixel loss encourages finding plausible solutions, based on pixel-wise average, in the high dimensional solution space. In return, such solutions can be overly-smooth with poor perceptual quality [23,60,61]. Thus, in order to capture the reconstruction error and image quality more efficiently, a variety of other loss functions, such as content loss [61] and adversarial loss [23], were introduced to the SR field.

Perceptual/Content Loss
To evaluate image quality based on perceptual similarity, perceptual-driven approaches have also been proposed [62,63]. More convincing results from the image perceptual point of view, for both SR and artistic style-transfer tasks, are offered in this category [23,63,64]. By minimizing the error in the feature space instead of the pixel space, perceptual loss or content loss, attempts to improve the image visual quality. Denoting feature maps computed within the l-th layer of the network as φ (l) (.), the content loss is evaluated using the Euclidean distance between corresponding feature maps from the original and super-resolved images as follows [44]: where h l , w l and c l represent the height, width and number of channels of the extracted feature maps in layer l, respectively. Content loss encourages transferring the learned knowledge of hierarchical image features from a pre-trained classification network, usually VGG or ResNet, to the SR task [12,23,32,65].

Adversarial Loss
Adversarial learning [31] is adopted for SR task in a straightforward way, in which SR model is considered as a generator, and a discriminator network is added to the model to discriminate the generated image I SR from the real image I HR . Adversarial loss for SRGAN [23] is as follows [44]: where L gan_G and L gan_D denote the adversarial loss of the generator G θ G , which is the SR model, and the discriminator D θ D , which is a deep CNN model for binary classification, respectively. θ G and θ D are the parameters of the generator and discriminator, and I SR = G θ G (I LR ) is the generated image approximating the corresponding ground truth HR image. In practice, some researchers employ a combination of multiple loss functions in their DCNN-based SISR architectures for more efficient learning and to better constrain different aspects of SR image reconstruction [12,23,57,66,67]. However, how to efficiently combine multiple loss functions with effective weights emphasizing their contribution in the learning process, remains an active area of SR research.

Image Quality Metrics
Image quality metrics, usually referred to as image quality measures (IQMs), are measures focusing on significant visual attributes of images where they attempt to quantify the perceptual assessments of an image when it is evaluated in a certain image quality assessment (IQA) approach [60]. IQA approaches are categorized into subjective methods, which focus on quantifying human perception, and objective methods, which are based on some computational models [60]. The subjective methods can be more accurate but they are usually inconvenient, time-consuming, and expensive to implement [60]. As a result, objective methods are currently considered the mainstream among IQMs. Since the objective methods cannot efficiently capture the human visual perception, the metrics evaluated under these methods may show some inconsistency with those from subjective methods [60].
Objective IQA methods are divided into three types [60] including: (1) full-reference methods requiring corresponding images with perfect or high quality image content; (2) reduced-reference methods, which apply IQMs on the extracted features from both images and their corresponding high quality counterparts; (3) no-reference methods, which try to evaluate image quality in a blind way without any reference images. In supervised SISR, high quality HR images are usually available for evaluating different IQMs. This section introduces some of the most commonly used IQMs, covering both subjective IQA methods and objective IQA methods.

Peak Signal-to-Noise Ratio (PSNR)
PSNR measure refers to the ratio between a signal's maximum power and the power of the signal's noise, which affects the quality of the signal's representation. Due to the very wide dynamic range (i.e., ratio of highest and lowest values) of most signals, the PSNR is usually expressed in the logarithmic decibel scale. PSNR is used to measure the reconstruction quality of lossy transformations including image compression and inpainting. For image SR task, PSNR is defined using the maximum possible pixel value in the underlying image, and the mean squared error (MSE) between two corresponding images. Given the high quality image I and the corresponding reconstructed (super-resolved) imageÎ, both of which include N pixels, the MSE and the PSNR measures are defined as follows [25]: L denotes the maximum possible pixel value in the image. For 8-bit image representations, for example, L equals to 255 and the typical values for the PSNR may vary from 20 to 40 dB, where the higher the PSNR value, the better the quality of the reconstructed image as it tries to minimize MSE between the images with respect to the maximum pixel value of the input image. When L is fixed, PSNR is only related to the pixel-wise distances between two images represented by MSE. The ability of MSE, and consequently PSNR, to capture perceptually relevant differences, such as high texture detail, is very limited meaning that PSNR does not care about human visual perception and photo-realistic characteristics of the image. This often leads to poor performance of PSNR when used to assess the quality of super-resolved images in natural scenes. However, due to the lack of an efficient and comprehensive IQM that considers image quality from all perspectives, PSNR remains the most widely used metric for evaluating image quality in SR tasks.

Structural Similarity (SSIM) Index
Similar to the human visual system, which is highly adapted for extracting structural information from the viewing scene, SSIM index provides a perceptual metric that quantifies image quality degradation based on perceived image quality [68]. Made up of three relatively independent terms, luminance, contrast, and structure, SSIM index estimates the visual impact of those factors when they are modified in the reconstructed image. Those modifications may comprise shifts in image luminance, alterations in image contrast, and any other remaining deviations collectively identified as structural changes [60].
For an original high quality image I and its reconstructed counterpartÎ, the SSIM index is defined as follows [69]: where α > 0, β > 0, and γ > 0 control the relative significance of each of the three terms of the index. In some implementations, α = β = γ = 1 [60]. The luminance, C l , contrast, C c , and structural, C s , components of the SSIM index are defined as follows [69]: where µ I , σ I and µÎ, σÎ represent the means and standard deviations of the original high quality image and the corresponding reconstructed image, respectively, and σ IÎ is the covariance of the two images. The constants C 1 , C 2 , and C 3 in Equations (17)- (19) help to avoid instability when the denominators are close to zero. The formulation given in Equation (16)  1 are very small constants for avoiding instability. According to the above formulas, SSIM can be represented as follows [69]: In addition, to deal with uneven distribution of image statistical features or distortions, it is more reliable to perform image quality assessment locally rather than globally. Thus, mean structural similarity (mSSIM) [60] is proposed for locally assessing SSIM. This technique splits the images into multiple windows in which the SSIM of each window is evaluated, and finally averages it over all windows across the image. Because it evaluates the image reconstruction quality from the perspective of the human visual system, SSIM index better meets the requirements of perceptual assessment. The efficiency of SSIM-based IQM outperforms those based on MSE and the related PSNR over natural images including a wide variety of image distortions [69]. Those properties make SSIM index a widely used IQM among others in most SR tasks [70,71]. However, in some cases, SSIM index may lead to similar results in evaluation of image performance with PSNR metric [60].

Task-Based Evaluation
Evaluating image reconstruction performance via other image analysis tasks is also an effective IQM [11][12][13]72]. Specifically, this technique feeds the original high quality image and the corresponding reconstructed image into a trained model for a specific vision task, and evaluates the reconstruction quality by comparing the relative impact of reconstructed images on the prediction performance with respect to that from high quality original HR images. The vision tasks used for this evaluation technique include face recognition [73,74], face alignment and parsing [65,75], and object recognition [12,76]. However, certain vision tasks may focus on some specific image attributes that are more favorable to the task, and may not be aware or care about the visual perceptual quality of the image. For example, most object recognition models mainly focus on the high-level semantics while ignoring the image contrast and noise. But on the other hand, in some domain-specific applications, such as super-resolving surveillance video for face recognition, task-based IQM may reflect the performance of the SR models.

Methodology
In this SISR experiment, enhanced SRGAN (ESRGAN) [32] model is employed which improves the original SRGAN model in three aspects. First, ESRGAN improves the network by designing a Residual-in-Residual Dense Block (RRDB), illustrated in Figure 4, which offers higher capacity and easier training. Second, the Relativistic average GAN (RaGAN) [77], which learns to distinguish a more realistic image from a corresponding less realistic image, replaces the original discriminator in SRGAN, which simply judges whether an image is real or fake. According to [77], this improvement allows the ESRGAN generator to recover more realistic texture details. Third, ESRGAN adjusts the perceptual loss in the original SRGAN model by using VGG features before activation, rather than features after activation. This empirically leads to sharper edges and more visually pleasing results. Some properties of ESRGAN model is discussed below in more details.
Network Architecture: ESRGAN employs the basic architecture of SRResNet [23] for feature learning in the LR feature space. ESRGAN introduces two modifications to the generator architecture of SRGAN to improve the quality of the super-resolved images, G: (1) it removes all batch normalization (BN) layers; (2) it replaces the original basic residual block (RB) in SRGAN with a more compact RRDB architecture. According to Figure 4, by optimally combining multi-level residual blocks, the RRDB design improves the perceptual quality of super-resolved images [32]. When the statistics of image batches for training and testing are significantly high, BN layers tend to introduce unpleasant artefacts limiting the generalization ability [32]. Removing BN layers, especially under the GAN framework which is more prone to artefact generation, leads to consistent higher performance, lower computational complexity, and better generalization in the network [32,59]. In addition to the architectural improvement, to facilitate training a very deep network, ESRGAN exploits residual scaling technique [55,59] to prevent instability in training by scaling down the residuals using a scaling factor between 0 and 1 before adding them to the main path. Moreover, ESRGAN employs a smarter initialization technique, which has empirically been shown to provide easier training when the initial parameter variance becomes smaller [32]. This definition estimates the probability that the input image I is the original HR (real) image or the super-resolved (fake) image. In contrast, a relativistic discriminator predicts the probability that the original HR image I HR is relatively more realistic than the super-resolved image I LR as shown in Figure 5. The Relativistic average Discriminator (RaD) [77] is formulated as: where D Ra is RaD function and x r and x f are the real (original HR) and fake (super-resolved) images, respectively. E x f [.] represents average over all generated or fake images in each individual mini-batch. The discriminator loss, L Ra D , is defined as follows [32] L Ra D = −E I HR log D Ra (I HR , I SR ) − E I SR log 1 − D Ra (I SR , I HR ) The adversarial loss for generator, L Ra G , is in a symmetrical form as [32]: where I LR and I SR = G(I LR ) stand for the input LR image and the predicted super-resolved image, respectively. In contrast to the adversarial loss for the generator in the original SRGAN model, L Ra Gen in Equation (7), in which only gradients from the generated images take part in adversarial training, the adversarial loss for the generator in ESRGAN, L Ra G in Equation (22), contains both I SR and I HR . This property causes the gradients from both real images and generated images to participate in adversarial training [32]. Perceptual Loss: ESRGAN suggests a more effective perceptual loss L percep by computing distances between corresponding feature maps before activation rather than after activation, as practiced in the original SRGAN model. Employing features before the activation layers overcomes two drawbacks in the original design including extreme sparsity in the activated feature maps, and inconsistent brightness reconstruction compared with the original HR image. Specially within a very deep network, sparsity within feature maps leads to weak supervision and inferior performance. The loss function for the generator in ESRGAN model is as follows [32]: where L 1 = E I LR G(I LR ) − I HR 1 is the content loss that evaluates the L 1 distance between super-resolved image G(I LR ) and the original HR image I HR , and λ and η are coefficients to balance different loss terms.

IQMs for SR Images
In this experiment, a comprehensive quantitative and qualitative assessment is performed on the resulting SR images by exploiting some standard IQMs that are frequently used for assessing the performance of different SISR models. Furthermore, a task-based IQM based on the SfM photogrammetry [78] procedure is carried out. Applying any type of image processing algorithm on a raw aerial image set can dramatically affect the precision and accuracy of retrieving the interior and exterior geometry of a camera at image acquisition time. That, consequently, may lead to a significant decrease in the quality and final accuracy of the main SfM photogrammetry products, such as point clouds, DSMs, and orthoimages. The authors believe that the chosen task-based IQM can more accurately exhibit the effectiveness and performance of DCNN-based SISR to enhance the spatial resolution of LR imagery in RS applications. More specifically, where highly accurate spatial products from processing RS images are required.

Standard IQM methods
PSNR and SSIM index are evaluated as standard IQMs for quantitative assessment of predicted SR images. Choosing those two IQMs enables performance comparison in DCNN-based SISR applications when it is applied on two different categories of images (general images and aerial RS images).

SfM Photogrammetry for Task-Based IQM
SfM photogrammetry procedure, as illustrated in Figure 6, is employed on all available image sets including HR ground truth, LR, and predicted SR image sets. SfM photogrammetry is a low-cost method, based on stereoscopic photogrammetry, for highly accurate topographic reconstruction using a series of overlapping images acquired from multiple viewpoints [78]. In contrast to traditional photogrammetry, in SfM photogrammetry, interior geometry of the camera, usually referred to as interior orientation (IO) parameters, position and orientation of each camera station with respect to the scene's global coordinate system, commonly called exterior orientation (EO) parameters, and the geometry of the scene, i.e., the 3D coordinate of each point of the 3D scene, are resolved automatically. All required parameters are calculated simultaneously based on the highly redundant and iterative bundle adjustment (BA) computations using a rich database of corresponding image features automatically extracted from a set of multiple overlapping images [79]. SfM photogrammetry addresses the key problem of determining the 3D locations of a large number of corresponding features extracted from multiple overlapping images, taken from different positions and angles with respect to the 3D scene. Most image-based 3D reconstruction software that work based on the SfM photogrammetry principle, first solve for camera IO and EO parameters followed by a multi-view stereo (MVS) algorithm to escalate the density of the sparse point cloud generated by the SfM algorithm [78]. In the first step, several overlapping images are imported into the software, and a keypoints detection algorithm, usually the popular scale invariant feature transform (SIFT) algorithm [80], is applied to detect keypoints and keypoint correspondences across and between all images using a keypoint descriptor. In the SIFT algorithm, for example, the keypoint descriptor is determined by computing local image gradients and transforming them into a representation substantially insensitive to some image feature variations, including illumination, orientation, and scale [80]. These descriptors are unique enough to allow features to be matched in large image datasets. The BA technique is performed to minimize the errors in the phase of finding point correspondences [78].
In addition to solving for IO and EO parameters, which indicate camera calibration and pose parameters, respectively, the SfM algorithm generates a sparse point cloud using the image coordinates of all corresponding keypoints, IO, and EO parameters of the camera in all imaging stations. The coordinate system related to the generated point cloud is arbitrary. In order to transform the point cloud coordinate system to any local or global coordinate system, a georeferencing phase should be adopted. In that phase, a few ground control points (GCPs) with known 3D coordinates in a local or global coordinate reference frame using land surveying or initial camera positions, e.g., using global navigation satellite system (GNSS), is typically required. In this experiment, it is not necessary to perform the georeferencing step since all images are processed in the same reference frame. The IO and EO parameters for each camera are used as the input to the MVS algorithm. Leveraging the known IO and EO parameters for each individual camera, MVS initiates an intense search algorithm to find more correspondences along all existing epipolar lines in all overlapping images. The accuracy of the MVS algorithm and the quality of the dense point cloud generated by the MVS algorithm is highly dependent on the reliability of the IO and EO parameters calculated from the initial BA computations [81].
Images captured at high spatial resolutions, in general, return the most keypoints and keypoints correspondences in overlapping images. In addition to the major contribution of the natural texture in the 3D scene, the quality of the generated point cloud highly depends on several other factors including the density, sharpness, contrast, and resolution of the image content within the image set [78]. Moreover, decreasing the image acquisition distance, or flight height above ground, leads to an increase in the image spatial resolution or a finer GSD. This will further enhance the spatial density and spatial resolution of the resulting point cloud [78]. However, the uncertainty in keypoints extraction and matching, which is a typical issue in all low quality LR images, may result in poor estimation of a camera's IO and EO parameters leading to a very inaccurate and erroneous 3D point cloud.

Study Site and Dataset
Port Aransas is a town located on Mustang Island along the southern Texas Gulf of Mexico coastline, USA Figure 7. In 2017, Hurricane Harvey, a category 4 hurricane, made landfall to the north of Port Aransas along San Jose Island on the night of 25 August 2017. The southern portion of the eye wall passed within close proximity to Port Aransas causing extensive damage, primarily due to extreme winds but also surge coming from the bay side of the island. A few days after the landfall of Harvey, a small UAS photogrammetric survey was conducted over a section of the town directly bordering the Gulf-facing shoreline Figure 7. The purpose was to inspect and evaluate structural damages to residential and commercial properties caused by the catastrophic storm. The flight mission covers almost 0.275 km 2 of Port Aransas. Phantom 4 Pro multi-rotor UAS (SZ DJI Technology C.o., Ltd., Shenzhen, China) was employed to conduct the survey. The platform was equipped with a 1 inch CMOS RGB sensor to capture 20 megapixel imagery at a resolution of 5472 × 3648 pixels. The flight altitude was designed to achieve a GSD of 2.5 cm, resulting in a flying height above ground level of about 90 m with forward lap and side lap around 80% and 70%, respectively. A total of 450 HR images were acquired over the study site. These images are used for the purposes of this study.

Data Preparation and Model Training
In order to fine-tune pre-trained ESRGAN parameters with the existing dataset, 50 non-overlapping images were chosen from the original HR dataset as ground truth for fine-tuning ESRGAN during training phase. Scaling factor of ×4 was set between LR and HR images. LR training images were obtained by down-sampling corresponding HR images. MATLAB bicubic kernel function was employed for image down-sampling, where its scale factor was set to 0.25. To make the SISR problem more complicated and realistic, additive white Gaussian noise with mean 0 and standard deviation of one-tenth of the standard deviation of each channel in RGB image was later added to the LR image set. Due to the high resolution of the original imagery, feeding the full-size images into the DCNN model rapidly exhausts the whole GPU's memory. However, in training phase, large image patches help very deep convolutional networks with wider receptive fields to capture more semantic information from the training samples. Therefore, this experiment was performed by extracting 1500 random image patches of resolution 1000 × 1000 pixels from the original HR images. Figure 8 illustrates a LR image and corresponding ground truth HR image for a training sample. The model is trained in the RGB channels, and data augmentation with random horizontal flips and 90 degree rotations is employed on the training image set. Testing and evaluation of model performance is then done on 1000 image patches randomly extracted from the remaining 400 images in the original HR and corresponding LR image sets.
It should be emphasized here that due to the large overlap between the employed UAS images, objects are sometimes captured by multiple images resulting in the appearance of the same object in the training and testing image sets. However, it should also be noted that such objects are captured from different viewing angles, causing different perspective and radiometric distortions for each specific object, or portion of the object, appearing in multiple images. Furthermore, the presence of such similar scenes within the training image set is necessary for performing transfer learning effectively, in which the weight parameters from a pre-trained DCNN model trained over a large dataset is applied to leverage complex mappings learned by very deep CNN models for performing a downstream task [82]. The weight parameters taken from the pre-trained model are, then, fine-tuned by training the model using a new dataset specific to the prediction task. In fact, one of the main reasons behind the transfer learning technique is to help the DCNN model to effectively capture a priori information related to the new task by fine-tuning the parameters of the underlying model using a new dataset for a different but related task. In the SISR technique, such a priori information can be provided to the SISR model by introducing information related to objects that are present in the acquired scene. Furthermore, the main goal of this study is to show the effectiveness of the SISR technique for recovering degraded or lost image details in the LR UAS images by fine-tuning a DCNN-based SISR model on a very limited set of HR UAS images. The original ESRGAN model, before fine-tuning, is also employed to investigate the capability of the pre-trained ESRGAN, to enhance the image content and downgrade the inherent noise in the original HR images. The idea is that such a pre-trained model, trained on some standard datasets, may be capable of capturing the behavior of some types of noise that might be common in many imaging systems. To do this experiment, the original HR image set is fed to the original pre-trained ESRGAN with scaling factor of ×1. The pytorch [83] implementation of ESRGAN model was chosen for training over the UAS dataset. The training process starts by initializing the ESRGAN model with weights from the pre-trained network trained on some of the well-known benchmarks in SISR such as the DIV2K dataset [84], the Flickr2K dataset [85], and the OutdoorSceneTraining (OST) dataset [66], which include thousands of high quality HR images with a broad diversity in texture and contextual information. The performance of the trained model has already been tested on widely used SR benchmarks such as Set5 [47], Set14 [49], BSD100 [86], Urban100 [87], and the PIRM self-validation dataset [88]. Table 1 summarizes the information related to the ESRGAN model setup and optimization settings for training the model on the UAS image set. According to the table, dense block architecture for generator was set to 64 × 5 × 5, which includes 64 kernels of size 5 × 5. The generator is comprised of 23 residual-in-residual dense blocks (RRDBs). The learning rate α was set to 0.0001, and Adam optimizer was chosen for updating weights during training. Two exponential decay rate parameters in Adam optimizer β 1 and β 2 , were set to 0.9, and 0.999, respectively. parameter in the optimization algorithm was set to 1 × 10 −7 to avoid any division by zero. The experiment was carried out with 100 epochs on Google Colab, Google's free cloud service, with one Intel(R) Xeon(R) CPU 2.30GHz and one high-performance Tesla K80 GPU, having 2496 CUDA cores and 12GB GDDR5 VRAM. Fine-tuning the network took around 48 hours and inference time for predicting the super-resolved image was 10 sec/image.

Results
This section provides comprehensive qualitative and quantitative experimental results on predicted super-resolved, SR pre , images from LR images, virtually downsampled form original (ground truth) HR, HR gt , UAS image set with additive white Gaussian noise. Also, the result of applying ESRGAN model on HR gt with scale factor ×1, as an image enhancement network, to generate enhanced HR images, HR enh , is investigated. Furthermore, the results of the task-based IQM using the SfM photogrammetry procedure implemented with the original and super-resolved imagery is reported. Figure 9 illustrates the qualitative assessment of the SISR performance using ESRGAN model on two different test samples. According to the visual inspection, and as observed in Figure 9, the ESRGAN model is able to upscale the LR images by factor 4 and predict SR images with high similarity in perceptual and visual quality when they are compared with the corresponding HR counterparts. A closer look at the qualitative results in this experiment reveals some noise removal properties learned within the SISR model trained on a sufficient number of LR and corresponding HR images. Figure 9. Illustration of the qualitative comparison between the predicted SR image and corresponding LR and ground truth HR images for two test images.

Quantitative Results
For quantitative evaluation of the SISR performance, in this experiment with ESRGAN model, PSNR value and SSIM index were calculated for the test image set and enhanced HR (HR enh ) image set. Table 2 illustrate the lowest, highest, and average PSNR values and SSIM indices for both image sets. The range of values for both PSNR and SSIM index in Table 2, resulting from evaluating ESRGAN performance on SR pre image set, is comparable in values reported for those IQMs when ESRGAN, or any other high-performance DCNN-based SISR model, is applied on standard SISR image sets [23,25,32]. The values of the standard IQMs represented in Table 2 confirm that SISR can be effectively applied for recovering lost or degraded details in LR UAS imagery, and hopefully on a wide range of imagery in RS applications, including aerial and satellite imagery, with a comparable performance.

Task-Based IQM and Related Results
Further investigation of ESRGAN model performance in a task-based image quality evaluation using SfM photogrammetry reveals more about the impact of image super-resolving on the internal and external camera imaging geometry and the geometry of the reconstructed 3D scene. All available UAS image sets including the downsampled noisy LR image set (LR), the original ground truth HR image set (HR gt ), the predicted super-resolved image set (SR pre ), and enhanced HR image set (HR enh ) were separately imported to Agisoft Metashape software [89] for SfM photogrammetric processing. Each image set was processed using the exact same settings and workflow procedure to ensure a fair comparative evaluation could be made on the impact of SR imagery to the BA computations and 3D reconstruction (i.e., point cloud).
BA computations, using keypoints extracted from each individual image in each image set, also result in an accurate estimation of camera calibration (IO) parameters in a self-calibration procedure using a pre-defined camera calibration model. Camera parameters evaluated within BA computations include the focal distance f , principal point coordinates (C x , C y ), radial distortion coefficients (K 1 , K 2 , K 3 , K 4 ), decentering distortion coefficients (P 1 , P 2 , P 3 , P 4 ), and affinity and skew transformation coefficients (B 1 , B 2 ), which represent a specific distortion in digital imaging sensors accounting for scale distortion and non-orthogonality of pixel elements in the x, and y directions of the digital sensor [90]. Table 3 illustrates the camera calibration results for LR, HR gt , SR pre , and HR enh UAS image sets. According to Table 3, the evaluated values of IO parameters for SR pre image set, especially, the sensor element (or pixel) size, focal distance, f , principal point offset C x , C y , and the first coefficient of radial lens distortion, K 1 , which are among the most critical camera calibration parameters, closely approximate the real values derived from HR gt image set. Referring to Table 3, the calibrated IO parameters for LR image set are different from IO parameters for HR gt , SR pre , and HR enh , meaning that the parameters defining the internal imaging geometry in LR UAS image set is different than those in the other HR UAS image sets. It should be emphasized here that the number of selected keypoints and the level of certainty in finding their correspondences in multiple images within an image set can have a significant impact on the stability of BA computations and the accuracy of the estimated IO and EO parameters.  Figure 10 displays plots representing the average reprojection error vectors from BA computations across the image space for LR, SR pre , HR enh , and HR gt UAS image sets. This error quantifies the distance between a certain keypoint location on an image and the location of the corresponding 3D point reprojected on that image. The magnitude of reprojection error in the image space depends on the quality of estimated camera calibration parameters and pose parameters, as well as on the quality of the extracted keypoints on each individual image [89]. Maximum and RMS of reprojection errors across the image space, and the average camera location errors with respect to the 3D scene have been depicted in Table 4 for LR, HR gt , SR pre , and HR enh image sets. According to the table, both the maximum and RMS of the reprojection errors in SR pre image space are closely comparable with those derived from HR gt image set. The errors related to the quality of the 3D space, reconstructed by SR pre image set, confirm the same quality in scene reconstruction when HR gt image set is employed. In addition, Figure 11 illustrates a graphical view of the camera locations and their errors represented by the error ellipsoids for all UAS image sets.
The process of point cloud densification was carried out on each individual UAS image set after BA computations and digital surface models (DSMs) were later generated from the 3D point cloud data by the post-processing within the SfM photogrammetry software. Figure 12 displays the dense point cloud over a small area of the study site for all UAS image sets. Moreover, Table 5 summarizes the processing report from SfM photogrammetry for each individual image set. According to Figure 12 and Table 5, visual and quantitative inspections on the density of the resulting dense point cloud, which is the average number of points per square meter, demonstrate that the dense point cloud generated from HR gt , SR pre , and HR enh are about ×17 denser than the dense point cloud generated from the LR image set.
To investigate how closely the DSM generated based on the SR pre image set approximates the corresponding DSM generated from HR gt image set, DSM from SR pre was subtracted from the DSM generated from HR gt image set. Figure 13 displays the resulting differential surface. Referring to Figure 13, the average height difference between the two DSMs is about −0.5 cm. However, there are some areas showing large height differences. These areas are mostly related to the edges of tall man-made and natural objects. Areas with lack of texture, such as water bodies, also contribute to the large height differences observed in Figure 13. The histogram in Figure 14 displays a statistical representation of the pixel-wise height differences based on the frequency of occurrence for pixel values in differential DSMs after filtering blunders.    Figure 11. Camera locations and related uncertainties for image data sets. Ellipse color represents Z error. Errors in X and Y directions are represented by ellipse shape. Black dot within each individual ellipse represents estimated camera locations.

Discussion
Visual inspection of image samples in SR pre and corresponding HR gt image sets confirms that the ESRGAN model performs much better over man-made objects and natural objects with definite boundaries than other targets, as shown in Figure 9. One reason may be due to the fact that natural objects usually comprise extremely intricate structures and severely random patterns with very fine details. In addition, natural objects, such as vegetation, may be moving due to the wind during image acquisition in an outdoor environment, inducing dynamic image motions in the recorded images. More accurate visual inspection on SR pre images demonstrates that the model is able to predict super-resolved images with lower level of noise and blur when they are visually compared with the corresponding HR gt images. This noise reduction property of the model, however, may result in removing unpleasing pseudo-noise patterns within some natural targets, such as vegetated areas. This noise reduction capability of the ESRGAN model is more evident over man-made structures and surfaces as illustrated in the right example of Figure 9.
Such image enhancement and noise removal characteristics can also be observed on both natural and man-made objects that appear in HR enh image set, where the HR gt images were used as input and the naive pre-trained SISR model, with scale factor ×1, was used as an image restoration network. This observation demonstrates that pre-trained ESRGAN, on several standard image sets for SISR, has been able to capture, to some extent, the behavior of some types of noise that are common in almost all digital imaging systems. Considering the fact that this model has already been trained to predict SR images with scale factor ×2 and ×4, the observations with scale factor ×1 divulges that there might be some types of noise that may commonly appear in different image scales where the pre-trained network has been able to differentiate them from the real signal.
The high IQM values reported for the HR enh image set in Table 2 is due to the high degree of similarity in image content and quality between corresponding images in HR enh and HR gt image sets. This observation demonstrates that pre-trained ESRGAN can be used as an image restoration network when it is employed with scale factor ×1.
It is worth mentioning that employing pre-trained ESRGAN, without fine-tuning the parameters using LR and corresponding HR gt UAS image sets for predicting the super-resolved images (SR pre ), decreases the model performance around 15% for both PSNR and SSIM index in this experiment. The relatively high values for those standard image quality metrics on SR pre UAS image set, whose contents are intrinsically different from those on which the vanilla ESRGAN model has been trained, verifies that the transfer learning technique and fine-tuning of the pre-trained parameters significantly helps the DCNN-SISR model to extract more related semantic information from the UAS images. This information is optimally encoded as abstract information within multiple layers of a DCNN-SISR model. Interestingly, according to Table 2, the vanilla ESRGAN model trained on standard image sets, resulted in high values for PSNR and SSIM index when it was employed on the HR gt image set as an image restoration network. This is regardless of the fact that the model did not previously see the UAS images for which it has been employed to predict on in this experiment.
Results of the task-based IQM using SfM photogrammetry adds more to the previous findings. Referring to Table 3, calibrated sensor element size, or image pixel size, for LR images is about 4 times bigger than that for images in other image sets, which is compatible with our experiment. The calibrated focal lengths in SR pre and HR enh image sets closely approximate the real focal length evaluated in HR gt ground truth image set. The difference in calibrated focal length for LR, SR pre , and HR enh image sets from the calibrated focal length for HR gt image set are −0.010 mm, −0.030 mm and 0.020 mm, respectively. Furthermore, calibrated C x and C y values shows an accurate estimation of the principal point location in SR pre images with respect to the HR gt images. For LR images, however, those calibrated parameters show a very different location for the principal point in LR image space.
Referring again to Table 3, the remaining calibration parameters, including radial and decentering lens distortion coefficients, affinity, and skew transformation parameters in SR pre and HR enh image sets show a high degree of compatibility with HR gt parameters confirming that lens distortion parameters and other sensor related distortions can be accurately estimated in both super-resolved SR pre images and restored HR enh images. However, interpreting the values of those coefficients, especially between LR and HR gt images, is not very meaningful because some of them are usually highly correlated with other parameters, especially the focal length, principal point location, and the first coefficient of radial lens distortion [90,91].
Referring to Figure 10, the behavior of the average reprojection error in SR pre image space accurately approximates that in the original HR gt image space. This finding can be supported further by our above findings when referring to the calibrated camera parameters, where results showed that the internal geometry of the sensor can be accurately recovered in the SR pre images. The plot related to the average reprojection error in LR image space represents less similarity with the error behavior in HR gt and SR pre image space, especially in the center of the image space. On the other hand, the average reprojection error plot for HR enh image space (Figure 10d) is very similar to the reprojection error plot for the HR gt image space (Figure 10b). This observation demonstrates that image restoration processing carried out on the HR gt images within the pre-trained ESRGAN has not meaningfully changed the IO parameters of the camera derived from the SfM analytical self-calibration procedure.
According to Table 4, investigation on maximum reprojection error and its RMS in the SR pre and HRenh image spaces shows that they closely approximate those values in the HR gt image space with sub-pixel magnitudes. However, RMS of reprojection error in HR enh image space is about 20% less than it is in HR gt image space. Part of this decrease in reprojection error might be due to the noise reduction process in HR enh image space with respect to the original HR gt image space. Referring to the average camera location errors in Table 4, SR pred and HR enh image sets closely approximate those in the original HR gt image set. This suggests that the SISR process employed with factor ×4 on the LR image set, and employed with the image restoration process on HR gt , preserves the external imaging geometry with respect to the 3D scene. As depicted in Table 4, pre-trained ESRGAN model with scaling factor ×1, as image restoration network, resulted in 3% improvement on total error in camera positions for HR enh image set. There is also 2% improvement in that error for SR pre dataset. Figure 11 shows that camera locations and their positional errors in the HR UAS imagery can be accurately retrieved in the predicted SR image set. Furthermore, it shows that image enhancement performed with the employed pre-trained ESRGAN model does not dramatically change the external imaging geometry.
Carefully exploring the differential DSM in Figure 13 reveals that large differential offsets are occurring in areas that include natural and man-made water bodies with lack of texture and along the edges of tall natural and man-made structures. Filtering out those areas from the original differential DSM and calculating some statistics over them shows that the minimum, maximum, and standard deviation (SD) of height difference in those areas are −8.308 m, 8.075 m, and 30 cm respectively. The height-difference histogram in Figure 14, for filtered differential DSM, confirms that the geometry of the reconstructed 3D scene, as reflected by the DSM, can be accurately retrieved with a SD around 2.50 cm. The minimum, maximum, and mean of height-differences within the filtered differential DSM are about −4.85 cm, 5.73 cm, and −0.02 cm, respectively.
It is worth mentioning that there are numerous environmental and sensor-related factors as well as flight design parameters which contribute to the quality and the spatial resolution of images captured by the UAS. Texture quality, related to each individual object in the scene, can highly affect the training and inference phases of the DCNN-based SISR model, which subsequently affects the results of the SfM process. Ambient environmental conditions, such as lighting or any instability of the platform during image capturing, such as due to the wind, can impact the above results. Similarly, flight design including altitude above ground and camera perspective (e.g., oblique versus nadir) will impact the GSD and appearance of land cover features. As a result, the visual representation of the same target may deviate from one exposure to another in a single UAS flight mission and across repeat data acquisitions. Thus, the authors emphasize that the results shown here, are valid for the specific data set acquired at a certain time over the specific study site. The results presented here, in terms of reconstruction accuracy, cannot be necessarily generalized to other sites with very different targets and textures, or the same area imaged at a different time and during different environmental conditions, without further experimentation. However, we believe that the high capacity of deep CNN models to efficiently extract informative contextual features from the raw UAS images in an end-to-end manner have the potential to be extended further by training DCNN-based SISR models using a time-series of UAS images acquired over the same area, or UAS images captured from the same area under different weather conditions. Also, training and evaluating the performance of a certain DCNN-based SISR model on multiple UAS image sets including images from different areas with a wider range of targets and varying textures may be considered for further analyses.

Conclusions
SISR seeks to obtain HR images from corresponding LR images, which is a notoriously arduous and ill-posed problem. Investigating different IQMs evaluated on SR images predicted from corresponding LR images in a DCNN-based SISR network revealed two important findings with respect to this study's experiment on UAS imagery. First, the quantitative measures of image quality, including PSNR and SSIM index, applied to the super-resolved UAS imagery, confirm that the DCNN-based super-resolution technique employed here (ERSGAN architecture) can achieve the same level of performance for spatial-resolution and pictorial information enhancement relative to the original HR ground truth image set. Both quantitative and qualitative assessment of SR images showed that the level of additive white noise to the LR image remarkably decreases in the SR image. Furthermore, visual comparison of SR images with corresponding HR images in some areas showed that the SR image may exhibit less amount of noise.
The second important finding relates to the task-based IQM performed using SfM photogrammetry. Results confirmed that the geometry of UAS image acquisition can be recovered in SR images with high accuracy. Camera interior and exterior parameters, evaluated by processing SR images in auto-calibration module within the SfM photogrammetry procedure, closely approximate the original results derived from the same procedure on the ground truth HR images. Preserving the geometry of imagery can significantly increase the reliability of using super-resolution techniques in many different RS applications, specifically where extracting spatial information from RS images is required. The densified point cloud generated by SfM photogrammetry on the SR UAS images is about 15 times richer than the point cloud generated from the artificially degraded LR UAS images, which provides more details about the underlying terrain. Furthermore, the differential DSM and related height-difference histogram show the STD around 2.5 cm, which confirms the closeness of the two reconstructed surfaces generated from the SR and HR image sets.
Overall, results from this study's experiment on UAS imagery show that DCNN-based SISR enhancement techniques can exploit spatial and non-spatial information in LR and HR imagery for effectively discriminating the signal from noise in image space resulting in high performance in recovering image details and more visually appealing images for different RS applications. For example, one practical application of the SR technique for UAS mapping is that it can potentially enable flights at higher altitudes and lower GSDs to cover more area in a certain time duration, thereby leading to more flight efficiency. Then, a DCNN-based SISR technique, such as presented in this study, could be applied to super-resolve the imagery to a specific resolution and generate a dense point cloud from SfM photogrammetry, and subsequently DSM or orthoimage, as though the data were acquired from a UAS flight conducted at a lower altitude and with similar quality.
Future work will seek to investigate the real scenario of employing SISR to reduce UAS image acquisition flight time for aerial surveying operations when mapping of a relatively large area at high resolution is demanded. This will be investigated by employing two UAS image sets acquired at two different altitudes over the same area. Performance of the DCNN-based SISR model to super-resolve the LR (high altitude) images can then be assessed by comparing SfM processing results with the super-resolved LR images and original HR (low altitude) images in terms of 3D reconstruction fidelity and image quality. The effect of different lighting and environmental conditions, and the impact of different study sites with different objects of varying textures, on model performance may also be explored. Finally, examining the most optimized DCNN-based SISR techniques, with the lowest time-complexity in training and inference phases, might be a topic of great interest where it can help pave the path for integration of SISR into real-time remote sensing application scenarios.