Remote Sensing Image Super-Resolution for the Visual System of a Flight Simulator: Dataset and Baseline

: High-resolution remote sensing images are the key data source for the visual system of a ﬂight simulator for training a qualiﬁed pilot. However, due to hardware limitations, it is an expensive task to collect spectral and spatial images at very high resolutions. In this work, we try to tackle this issue with another perspective based on image super-resolution (SR) technology. First, we present a new ultra-high-resolution remote sensing image dataset named Airport80, which is captured from the airspace near various airports. Second, a deep learning baseline is proposed by applying the generative and adversarial mechanism, which is able to reconstruct a high-resolution image during a single image super-resolution. Experimental results for our benchmark demonstrate the effectiveness of the proposed network and show it has reached satisfactory performances.


Introduction
As is well known, air traffic control (ATC) is the key to ensuring the operational safety of air traffic, which highly depends on the collaboration between the air traffic controller (ATCO) and the aircrew [1]. The ATCO makes real-time decisions to direct the flight to its destination based on situational information from the ATC system, while the aircrew flies the aircraft in strict accordance with the ATCO's instruction, in an accurate and prompt manner [2]. Due to safety issues, both the ATCO and aircrew are required to be licensed by the concerned administration of their country. To obtain a valid license, they must meet specific requirements for being licensed. In addition, their skills will need to be re-examined at specified intervals. Thus, training equipment is indispensable for achieving the training of the ATCO or aircrew, and comprises an ATC simulator for the ATCO and a flight simulator for the aircrew.
Of these, the flight simulator has become a hot research topic due to its prominent significance related to flight in the air. The simulator is very important for ensuring flight safety, and is also able to greatly reduce equipment and maintenance costs [3,4]. The main purpose of the flight simulator is to provide realistic, real-time, immersive scenarios to complete the pilots' training before they fly a real aircraft. The training scenarios consist of various flight phases, including the airport ground, instrument landing, approach, and cruise. Furthermore, they also depend on the location of the target flight, for example, the scene for Chengdu airport is highly distinct from that of Beijing airport. To this end, the flight simulator puts forward higher requirements for its visual system, for which the most realistic are given a higher priority.
Currently, remote sensing images are widely applied to build the visual systems of flight simulators because of their merits of wide and accurate scenes. The development of remote sensing technology in recent years has led to a great increase in the number of satellite images. Remote sensing images have been broadly applied to various research fields, including target/object detection, temperature measurement, biophysical prediction, multi-specialist architecture, etc. However, due to hardware limitations of sensors and high costs for collecting such images, it is difficult to gain very high-resolution images. Therefore, more and more researchers are preferring to reconstruct high-resolution (HR) images from low-resolution (LR) images, rather than devoting time to physical imaging technology.
The single image super-resolution (SISR) task aims to reconstruct high-resolution images from their low-resolution counterparts. The SISR task is a significant computer vision and image processing issue that has been widely applied for all kinds of practical applications. Normally, the SISR problem can be represented by the following forward observation with a linear degradation process: Y ∈ R N/s×N/s is an obtained LR image (N/s × N/s is the resolution of the LR image). H ∈ R N/s×N/s denotes a downsampling operation (typically, a bicubic interpolation) that is able to resize an HR input imageX ∈ R N×N by a scaling factor s. In general, η is defined as an additive white Gaussian noise with a standard deviation σ. However, in real-world natural scenes, η also accounts for all possible noise during the image collection process. The noise may be the inherent sensor noise, stochastic noise, compression artifacts, etc. As is well known, the downsampling operation H is a typical ill-conditioned or singular problem, since the unknown noise (η) is usually imposed on the images. Therefore, there are many possible solutions for this task.
In this work, we attempt to utilize super-resolution technology to reconstruct the LR image into an HR one, which is further applied to build a more accurate and realistic visual system for the flight simulator. Due to the lack of a public remote sensing image dataset for the super-resolution task in this field, we first present a new dataset named Airport80, which consists of 80 ultra-high-resolution remote sensing images. This benchmark was captured from the airspace near the airports of many major cities in Asia, so it contains all kinds of natural scenes.
In succession, learning from current state-of-the-art works, we propose a simple yet powerful generative adversarial network (GAN) to achieve the remote sensing image super-resolution task. The gaming between the generative and discriminative models is expected to fit different image information caused by diverse scenes and reduces the dependencies of training samples. In general, the GAN-based SR approach is mainly to address the drawbacks of losing the high-frequency information and the fine details [5], and is able to obtain a perceptually satisfying reconstruction result.
Basically, the proposed method is based on the super-resolution generative adversarial network (SRGAN) [5] and we integrate some of the latest network design methods into the model to make it better. Since the SISR task is finally completed by the generator, our improvements mainly focus on the adjustment of the structure of the generator network. We first remove batch normalization (BN) layers from the generator. It has been confirmed that BN layers have no effect on performance in some PSNR-oriented tasks, like superresolution. Removing BN layers helps to improve training stability and save memory usage. Second, for better ability to extract features, we replace the activation function from ReLU with PReLU [6]. Last, enlightened by [7], we also introduce deformable convolutional kernels into the generator, which can adjust the convolution sampling location by learning and focus on the extraction of local related information. Experimental results demonstrate that our approach can achieve comparable performances with state-of-the-art methods.
We summarize our primary contributions as follows.
• Due to the lack of a dataset for the super-resolution task in the research field of the visual system of a flight simulator, we present a new dataset named Airport80, which contains 80 ultra-high-resolution remote sensing images captured from the airspace near airports.
• We propose a neural network based on the GAN framework to serve as a baseline model of this dataset, in which some of the latest network designs are integrated into the model to improve the SISR performance. The proposed method is capable of generating realistic textures during a single remote sensing image super-resolution. • Experimental results for the proposed benchmark demonstrate the effectiveness of the proposed method and show it has reached satisfactory performances. We hope that this work can bring better quality data for the visual system of a flight simulator.

Related Work
After decades of research, super-resolution approaches can generally be categorized into the following types: traditional methods and deep-learning-based methods. Basically, the traditional methods focus on structuring a compact dictionary or manifold space to connect patches between the low-resolution and high-resolution areas of an image. In succession, the super-resolution task can be achieved by proposing a representation scheme to conduct the super-resolution operations. A dictionary-based approach was proposed by Freeman et al. [8], in which some key dictionaries were pre-defined to present the scene pairs between the low-resolution and high-resolution patches. In this work, the nearest neighbor (NN) algorithm is applied to search the most similar patch for the input in the defined dictionary, and the corresponding high-resolution counterpart is thereby regarded as the reconstructed patch (image area). Recently, a manifold embedding technique was proposed by Change et al. [9] to replace the NN-based search strategy and showed desired performance improvements. Following this idea, the sparse coding formulation was also introduced by Yang et al. [10] to serve as an alternative solution of the NN algorithm, which further improves the performance of the super-resolution task.
Thanks to the powerful ability of the neural network to capture nonlinear transformation, deep-learning-based approaches were introduced to solve the super-resolution task and showed the performance priority over the traditional methods. A deep-learning-based model [11] was first built to achieve the image SR task in an end-to-end manner and achieved superior performance against previous works. Due to the shallow architecture, the CNN-based deep learning model [12] was designed with more convolutional layers (up to 20) to improve the final performance, in which the residual learning mechanism [13] was applied to address the gradient problems during model training. A deeper architecture (up to 52 convolutional layers), called the deep recursive residual network, was designed by Tai et al. [14] to further enhance the accuracy of the SR task. In these methods, the LR input is first upscaled to change its size to that of the HR image before feeding it into the network to complete the image reconstruction. Obviously, this design requires more computational resources (memory) and training time. To solve this issue, Shi et al. [15] proposed a sub-pixel layer, with the goal of learning a set of upsampling transformations to integrate the LR feature maps into the HR output in a more efficient way. This approach not only replaces the bicubic operation of the SR pipeline with more complex upsampling maps but also reduces the computational complexity for the overall SR operation. Recently, a deeper and wider network architecture was proposed by Lim et al. [16] to reconstruct the HR images from their LR inputs, in which the batch normalization layers [17] are removed to improve the final performance. The dense connection mechanism [18] was also adopted to complete the SR task, in which all the hierarchical features from convolutional layers are considered to generate high-resolution patches.
Like other computer vision tasks, the perception mechanism was also introduced to the SR research. The first work is the SRGAN model [5], which is able to reconstruct perceptually more pleasant high-resolution images. In order to pay more attention to the visual quality of generated images, the perceptual loss function [19] was lately introduced into GAN-based SR approaches. In those models, an adversarial loss was also proposed to formulate a combined loss function, which can produce photo-realistic high-resolution images. To further improve the performance for the GAN-based SR models, the enhanced super-resolution generative adversarial network (ESRGAN) [20] model was proposed, where the state-of-the-art perceptual SR images can be obtained up until now. More recently, a benchmark protocol was presented by Lugmayr et al. [21] to recover real-world image corruptions, in which real-world challenge series [22] are also introduced to describe the influences of the bicubic downsampling operation and separate degradation learning for super-resolution. Later, a downsample generative adversarial network (DSGAN) [23] was proposed to capture the degradation transformation by fitting the transformation distribution in an unsupervised manner, and the ESRGAN was also modified as ESRGANfrequency separation (FS) to further improve its accuracy in a real-world setting.

Airport80 Dataset
As far as we know, there are few public remote sensing image datasets for superresolution tasks in the visual system of a flight simulator for the air transportation industry. Therefore, we have created a new dataset named Airport80, containing 80 ultra-highresolution remote sensing images. This benchmark was captured from the airspace near the airports of many major cities in Asia, so it contains all kinds of real-world structures. We term it Airport80 to be consistent with the naming of other super-resolution datasets, like Set5 [24], Set14 [25], and Urban100 [26]. Due to image content and copyright issues, this dataset is meant for research purposes.
Resolution and Diversity: Each image was captured by a remote sensing satellite with a spatial resolution of 0.6 meters. Therefore, all 80 images are ultra-high-resolution, which means each of them has 4K pixels on at least one of the axes (horizontal or vertical), and some of them even have 20,000 × 20,000 resolution. In addition, this dataset includes a wealth of real-world scenes, such as urban settings, ports, deserts, hills, lakes, rivers, and so on. We randomly selected 60 images for training and used the rest for testing. Considering the ultra-high resolution issue, we cropped the remaining 20 images to 1440 × 1440 resolution with fixed step size and obtained 250 sub-images as the final testing set. Figure 1 shows some samples from our new dataset. We hope that this dataset can supplement current super-resolution tasks for remote sensing images, which are further applied to build the visual system of a flight simulator for training a qualified pilot. Evaluation Metrics: Like other super-resolution benchmarks, two commonly used metrics, peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [27] were considered to achieve a quantitative evaluation of the Airport80 dataset. PSNR is calculated via the mean squared error and the maximum value (denoted as L) of the images. Given the target image I and the reconstruction imageĨ, the PSNR measurement can be obtained by: where L equals 255 in 8-bit images. In addition, SSIM is proposed for estimating the structural similarity between two images. In general, the properties of an image, including contrast, luminance, and structures, are independently evaluated to calculate a fair comparison, as shown below: where µ * represents the mean of each image, σ * represents the variance of each image, and σ IĨ represents the covariance of two images. c 1 , c 2 , c 3 are constants used to maintain stability.

Baseline Model
The architecture of the proposed SR network is shown in Figure 2. Briefly, it is a typical application of the GAN [28] family in super-resolution tasks. We made some improvements to make it more suitable for the super-resolution task of remote sensing images in respect of research into the visual system of a flight simulator. It contains two individual neural networks: a generator G is designed to estimate a given LR image its HR counterpart, and a discriminator D is designed to discriminate real HR images from generated samples and ground-truth. The details of the two networks will be introduced in the following parts.  The discriminator uses modules of the form Convolution-BatchNorm-LeakyReLU. In general, a total of 8 convolutional layers are stacked to formulate the discriminator, in which an incremental number of 3 × 3 kernels are designed, increasing by a factor of 2 from 64 to 512 like VGG [29]. The convolution operations with stride 2 are utilized to downsampling the resolution of the feature map each time, while the number of kernels will be doubled. Finally, the feature representations are converted into the probability distribution, in which two fully connected layers and a sigmoid function are applied to achieve the classification task.

Incremental Details
Since the SISR task is finally completed by the generator, our improvements mainly focus on the adjustment of the structure of the generator network. As depicted in Figure 2, the generator is broadly composed of three parts: (1) a series of basic blocks is responsible for extracting convolutional features for the low-resolution image, (2) a skip connection operator is designed for concatenating high-level and low-level features, and (3) a fusion block is used to fuse features and complete the final output. Compared with the original SRGAN [5], we have modified the structure of the basic blocks and fusion blocks.
In the basic blocks, we first remove batch normalization (BN) layers. BN layers have been proven to decrease performance in some PSNR-oriented tasks, like superresolution, image deblurring, and image dehazing. Referring to ESRGAN [20], BN layers are more likely to create artifacts when the network goes deeper. These artifacts occasionally appear among iterations and different settings, violating the need for stable performance overtraining. Thus, removing BN layers will help to improve the stability of training and save memory usage, As shown in Figure 3.  In addition, for better ability to extract features, we changed the activation function from ReLU to parameteric rectified linear unit (PReLU) [6]. It is expressed as: The parameter a is initially set to 0.25, and it will be updated automatically while training. Because there are only a few parameters added to the network, the computation and risks of over-fitting will not increase too much. The curves of two activation functions are shown in Figure 4.  The inherent limitation with standard convolutional networks is that they are unable to handle geometric transformations due to their fixed shape kernel. Although some extension types like dilated convolution [30] are presented to alleviate this issue, it is still challenging for the standard kernel to align the related locations or salient features in the input image. To solve this issue, recent work [7] introduced the deformable convolutional kernel [31] into the super-resolution task to improve the capability of modeling geometric transformations by adding flexible and learnable offsets. Following this strategy, we simply replaced the standard convolutional kernel with the deformable one, as depicted in Figure 5. The standard convolution of each position p 0 in the image is expressed as where x means the feature maps or inputs, w means the sampled weights and R represents the size of the receptive field. In the deformable kernel, R is augmented with offsets {∆p n |n = 1, ..., N} The offsets can be learned automatically during the training phase. The standard convolution with a fixed receptive field will introduce irrelevant background noise. By introducing the deformable convolutional kernel, we hope that the network can learn convolution sampling locations autonomously and focus more on the extraction of localrelated information. Figure 6 shows the sampling locations of two convolutions.

Loss Function
The model is trained to simultaneously minimize perceptual loss L percep , adversarial loss L Ra G , and context loss L 1 .
Different from the pixel-wise losses, the perceptual loss [19] leverages multi-scale features extracted by a pretrained classification network to estimate high-level perceptual and semantic information differences between images. In our implementation, the loss makes use of VGG-19 [29] pretrained on ImageNet [32] as the loss network φ and extracts the features from the last layer of each of the first three stages. The perceptual loss is defined as where φ j (J )φ j (J), j = 1, 2, 3 denote the aforementioned three VGG-19 feature maps associated with the dehazed image J and the clear image J, and C j , H j , and W j specify the dimension of φ j (J )φ j (J).
In addition, we modified the standard discriminator to the relativistic average discriminator (RaD) [33], denoted as D Ra . The standard discriminator is defined as D(x) = σ(C(x)), σ means sigmoid function and C(x) represents the non-transformed discriminator output. Thus, the RaD can be formulated as D Ra (x r , , and E[] means the average of all generated samples in the mini-batch. The loss of the discriminator is then defined as: The adversarial loss for the generator is in a symmetrical form: where x f = G(x i ) and x i stands for the input LR image. At last, L 1 loss is regarded as the context loss formulated by L 1 = E x t ||G(x i ) − y|| 1 that evaluates the 1-norm distance between reconstructed image G(x i ) and the ground-truth y. Overall, the multi-task loss L is a weighted sum of those losses: where λ 1 , λ 2 , λ 3 are predefined constants indicating the relative strength of each component. To keep the balance of different losses, we set them to 1.0, 5 × 10 −3 , and 1 × 10 −2 , respectively.

Training Details
Like SRGAN [5] and ESRGAN [20], all of our experiments were performed with a scaling factor of ×4 between HR and LR images. It is worth noting that only the Airport80 dataset was used as the training data, and no images from the extra dataset were involved in the training phase. We kept all the training parameters of the unofficial SRGAN implementation provided by MMEditing (https://github.com/open-mmlab/mmediting/ (accessed on 25 February 2021)). We crop 128 × 128 HR sub-images and set the batch size to 16. Unlike the original SRGAN [5], we did not utilize a PSNR-oriented pretrained model to initialize the generator. The model was optimized by Adam [34] with β 1 = 0.9 and β 2 = 0.999. The learning rate was initially set to 1 × 10 −4 and halved at [50k, 100k, 200k, 300k] iterations. All experiments were carried out on a standard PC with Intel (Santa Clara, USA) i7-6800k and two NVIDIA (Santa Clara, USA) TITAN RTX GPUs.

Ablation Study
In order to investigate the effectiveness of our improvements, we first trained some PSNR-oriented models and conducted several ablation studies. As we mentioned above, our network was built on SRGAN [5], thus the generator named SRResNet in SRGAN was selected as our baseline model. The PSNR-oriented model was only trained with the L 1 loss, and the learning rate was initially set to 2 × 10 −4 and halved every 2 × 10 5 of iterations. The comparison results are listed in Table 1. Apparently, we can see that the adaptations of our model achieve progress on the two metrics compared to the baseline model. Compared with others, the performance improvement obtained by replacing the activation function is not very obvious. However, this adjustment is easy to implement and makes little change to the network, so we still added it to get a better performance. Finally, we integrated all of improvements and obtained a further promotion of each evaluation value, which demonstrates the proposed components are effective for the super-resolution task.

Experimental Results
For fair comparison, we evaluated the proposed network on the Airport80 dataset for quantitative comparisons with other methods, including nearest-neighbor interpolation, bicubic interpolation, SRCNN [11], SRGAN [5], and SRResNet [5]. In addition, all the implementations came from the MMEditing image and video editing toolbox. It is worth noting that SRCNN and SRResNet belong to PSNR-oriented methods, while the SRGAN and our method belong to the perceptual-driven approaches. Referring to [35], the PSNR only deals with the differences between corresponding pixels instead of visual perception, which usually leads to unsatisfactory performance in representing the reconstruction quality in natural scenes, where we are usually more concerned with human perceptions. Therefore, the PSNR and SSIM in Table 2 are provided for reference. It can be observed from Figure 7 that the proposed method outperforms the above mentioned approaches in both detail and sharpness. Although Bicubic and SRCNN obtain higher PSNR and SSIM, their reconstructions are generally fuzzy, and the human perception is not very good. On the contrary, SRGAN and our method, which are based on a perceptual-driven approach, achieve better edge and texture details. That also proves that PSNR and SSIM are not effective metrics for perceptual quality. Compared with SRGAN [5], our method controls the color consistency better, as shown in Figure 7, and some unpleasant color patches appear in the resulting image of SRGAN. It is worth noting that none of the above methods can handle fine textures, such as the farmland in the lower right corner of the fourth sample.

Conclusions
In this paper, we started from the perspective of computer vision and utilized superresolution technology to tackle the problem of high-resolution remote sensing image acquisition for the visual system of a flight simulator. First, due to the lack of relevant datasets in this field, we created a new dataset named Airport80, which contains 80 ultra-high-resolution remote sensing images and can be used for training and testing superresolution algorithms. Second, a baseline model based on GAN and integrating some of the latest network designs was presented to generate realistic high-resolution images from low-resolution ones. Finally, the experimental results for our dataset demonstrate the effectiveness of the proposed method and show it has reached satisfactory performances. We hope that the above work can make a supplement to the current remote sensing image super-resolution field.
In the next step, we plan to combine some object detectors with our super-resolution network and test its application in real scenes. For example, detecting vehicles, ships and buildings in low-resolution remote sensing images  Data Availability Statement: Data available on request due to restrictions eg privacy or ethical.

Conflicts of Interest:
The authors declare no conflict of interest.