Super-Resolution Restoration of Spaceborne Ultra-High-Resolution Images Using the UCL OpTiGAN System

: We introduce a robust and light-weight multi-image super-resolution restoration (SRR) method and processing system, called OpTiGAN, using a combination of a multi-image maximum a posteriori approach and a deep learning approach. We show the advantages of using a combined two-stage SRR processing scheme for signiﬁcantly reducing inference artefacts and improving effective resolution in comparison to other SRR techniques. We demonstrate the optimality of OpTiGAN for SRR of ultra-high-resolution satellite images and video frames from 31 cm/pixel WorldView-3, 75 cm/pixel Deimos-2 and 70 cm/pixel SkySat. Detailed qualitative and quantitative assessments are provided for the SRR results on a CEOS-WGCV-IVOS geo-calibration and validation site at Baotou, China, which features artiﬁcial permanent optical targets. Our measurements have shown a 3.69 times enhancement of effective resolution from 31 cm/pixel WorldView-3 imagery to 9 cm/pixel SRR.


Introduction
Increasing the spatial resolution of spaceborne imagery and video using groundbased processing, or, where feasible, onboard a smart satellite, allows greater amounts of information to be extracted about the scene content. Such processing is generally referred to as super-resolution restoration (SRR). SRR combines image information from repeat observations or continuous video frames and/or exploits information derived (learned) from different imaging sources, to generate images at much higher spatial resolution.
SRR techniques are applicable to images and videos without the usual increased costs and mass, associated with increased bandwidth or larger/heavier optical components, normally required for achieving higher resolution. In particular, enhancing the ultra-high spatial resolution Earth observation (EO) images, or high definition (HD) videos, is an active driver for many applications in the fields of agriculture, forestry, energy and utility maintenance and urban geospatial intelligence. The ability to further improve 30 cm/80 cm EO images and videos into 10 cm/30 cm resolution SRR images and videos will allow artificial intelligence-based (AI-based) analytics to be performed in transformative ways.
This work builds on our previous development from the UKSA CEOI funded SuperRes-EO project, where we developed the MAGiGAN SRR system [1], i.e., Multi-Angle Gotcha image restoration [2,3] with generative adversarial network (GAN) [4]. MAGiGAN was developed to improve the effective resolution of an input lower resolution (LR) image using a stack of overlapping multi-angle (more than 15 • ) observations. In this paper, we propose a lightweight multi-image SRR system, called OpTiGAN, using optical-flow [5] and total variation [6] image restoration, to replace the extremely computational expensive expensive Gotcha-(Grün-Otto-Chau) [2] based multi-angle restoration [1,3], for continuous image sequences that do not have much change in viewing angles (less than 3°).
We demonstrate the proposed OpTiGAN SRR system with ultra-high resolution Digital Globe ® WorldView-3 panchromatic (PAN) band images (at 31 cm/pixel), obtained through the third-party mission programme of the European Space Agency (ESA), Deimos Imaging ® Deimos-2 PAN band images (at 75 cm/pixel), through EarthDaily Analytics, and Planet ® SkySat HD video frames (at 70 cm/pixel). Image quality evaluation, effective resolution measurements and inter-comparisons of the OpTiGAN SRR results with other SRR techniques were achieved based on a geo-calibration and validation site at Baotou, China [7] (hereafter referred to as Baotou Geocal site). Our quantitative assessments have suggested effective resolution enhancement factors of 3.69 times for WorldView-3 (using five LR inputs), 2.69 times for Deimos-2 (using three LR inputs) and 3.94 times for SkySat (using five LR inputs). An example of the original 31 cm/pixel WorldView-3 image and the 9 cm/pixel OpTiGAN SRR result is shown in Figure 1.

Previous Work
SRR refers to the process of restoring a higher resolution (HR) image from a single or a sequence of LR images. SRR is traditionally achieved via fusing non-redundant information carried within multiple LR inputs and is mostly achieved nowadays using a deep learning process.
Over the past 30 years, most of the successful multi-image SRR techniques have focused on spatial domain approaches, trying to inverse the degraded imaging process by optimising the image formation and degradation models. Iterative back projection methods were amongst the earliest methods developed for SRR [8][9][10][11][12]. Such methods attempted to define an imaging model to simulate LR images using real observations, then iteratively refine an initial guess of the HR image by comparing its simulated versions of LR images with the provided LR inputs. Later on, maximum likelihood [13][14][15][16][17] and maximum a posteriori (MAP) [6,14,[18][19][20][21][22][23][24][25] based approaches attempted to resolve the inverse process stochastically by introducing a priori knowledge about the desired HR image.

Previous Work
SRR refers to the process of restoring a higher resolution (HR) image from a single or a sequence of LR images. SRR is traditionally achieved via fusing non-redundant information carried within multiple LR inputs and is mostly achieved nowadays using a deep learning process.
Over the past 30 years, most of the successful multi-image SRR techniques have focused on spatial domain approaches, trying to inverse the degraded imaging process by optimising the image formation and degradation models. Iterative back projection methods were amongst the earliest methods developed for SRR [8][9][10][11][12]. Such methods attempted to define an imaging model to simulate LR images using real observations, then iteratively refine an initial guess of the HR image by comparing its simulated versions of LR images with the provided LR inputs. Later on, maximum likelihood [13][14][15][16][17] and maximum a posteriori (MAP) [6,14,[18][19][20][21][22][23][24][25] based approaches attempted to resolve the inverse process stochastically by introducing a priori knowledge about the desired HR image.
Of particular relevance to this work, we previously proposed two SRR systems, namely, GPT-SRR (Gotcha partial differential equation (PDE) based total variation (TV)) and MAGi-GAN SRR, in [1,3], for Mars imagery and EO satellite imagery, respectively, adopting the multi-angle imaging properties and, for the latter one, combining the multi-angle approach with GAN-based inference. GPT-SRR is able to reconstruct the non-redundant information from multi-angle views, based on the MAP framework. MAGiGAN improves upon GPT-SRR and applies a two-stage reconstruction scheme, which combines the advantages from GPT-SRR and GAN SRR and effectively eliminates potential artefacts from using GAN alone. However, the key limitation for [1,3] is that they are both based on the computationally expensive Gotcha process [2], which is not suitable for GPU computing solutions.
In this work, we introduce a new OpTiGAN SRR system that contains modifications and improvements on top of the MAGiGAN SRR system [1], is about 20 times faster in processing speed, uses non-multi-angle observations, e.g., continuous satellite image sequences or video frames, and in particular, is ideal for SRR of ultra-high-resolution (less than 80 cm/pixel) imaging data.

Datasets
In this work, our test datasets are WorldView-3 (provided by ESA third-party missions from Maxar ® , in 2020), Deimos-2 PAN band images (provided by Deimos Imaging, S.L., in 2021) and SkySat HD video frames (provided by Planet ® , in 2019). The training dataset, used for the GANs, is the Deimos-2 4 m/pixel multispectral (MS) green band and 1 m/pixel (downsampled from 75 cm/pixel) PAN band images (provided by UrtheCast Corp. (now EarthDaily Analytics), in 2018).
The Maxar ® WorldView-3 is the first multi-payload, multi-spectral, high-resolution commercial satellite. WorldView-3 captures images at 31 cm/pixel spatial resolution for the PAN band, 1.24 m/pixel for the MS band, 3.7 m/pixel for the short-wave infrared (SWIR) band and 30 m/pixel for the Clouds, Aerosols, Vapors, Ice and Snow (CAVIS) band, from an operating orbital altitude of 617 km (see https://www.maxar.com/constellation (accessed on 9 June 2021) and https://earth.esa.int/web/eoportal/satellite-missions/v-wx-y-z/WorldView-3 (accessed on 9 June 2021)) with a swath width of 13.1 km (at nadir). WorldView-3 has an average revisit time of less than one day and is capable of collecting up to 680,000 km 2 area per day. The WorldView-3 data (for research purposes) are available via application through the ESA site (see https://earth.esa.int/web/guest/pi-community (accessed on 9 June 2021)).
Deimos-2 is a follow-on imaging mission of Deimos-1 for high resolution EO applications owned and operated by the UrtheCast Corp. and Deimos Imaging, S.L. (see https://elecnor-deimos.com/project/deimos-2/ (accessed on 9 June 2021)). Deimos-2 collects 0.75 m/pixel PAN band and 4 m/pixel MS band images with a swath width of 12 km (at nadir) from an orbit at~600 km. Deimos-2 has a collection capacity of more than 150,000 km 2 area per day, with a two-day average revisit time worldwide (see https://earth.esa.int/eogateway/missions/deimos-2 (accessed on 9 June 2021)). The MS capability includes four channels in the visible: red, green and blue bands and near-infrared (NIR) band. The Deimos-2 satellite is capable of achieving up to ±45 • off-nadir pointing and has a nominal acquisition angle up to ±30 • to address particular multi-angle applications.
SkySat is a constellation of 21 high-resolution Earth imaging satellites owned and operated by the commercial company Planet®(see https://www.planet.com/products/hires-monitoring/ (accessed on 9 June 2021)). SkySat satellites operates on different orbit altitudes of 600 km, 500 km and 400 km and with different swath width of 8 km (at nadir), 5.9 km (at nadir) and 5.5 km (at nadir), for SkySat-1 and SkySat-2, SkySat-3-SkySat-15, SkySat-16-SkySat-21, respectively. SkySat has an image collection capacity of 400 km 2 per day and the SkySat constellation has a sub-daily revisit time (6-7 times at worldwide Remote Sens. 2021, 13, 2269 4 of 27 average and 12 times maximum; see https://earth.esa.int/eogateway/missions/skysat (accessed on 9 June 2021)). SkySat captures~70 cm/pixel resolution still images or HD videos. Full videos are collected between 30 and 120 s (30 frames per second) by the PAN camera from any of the SkySat constellation while the spacecraft pointing follows a target.

An Overview of the Original MAGiGAN SRR System
The original MAGiGAN SRR system is based on multi-angle feature restoration, estimation of the imaging degradation model and using GAN as a further refinement process. A simplified flow diagram is shown in Figure 2. The overall process of the MAGiGAN SRR system has 5 steps, including: (a) image segmentation and shadow labelling; (b) initial feature matching and subpixel refinement; (c) subpixel feature densification with multi-angle off-nadir view interpolation onto an upscaled nadir reference grid; (d) estimation of the image degradation model and iterative SRR reconstruction; (e) GAN-(pre-trained) based SRR refinement. cessed on 9 June 2021)). SkySat captures ~70 cm/pixel resolution still images or HD videos. Full videos are collected between 30 and 120 s (30 frames per second) by the PAN camera from any of the SkySat constellation while the spacecraft pointing follows a target.

An Overview of the Original MAGiGAN SRR System
The original MAGiGAN SRR system is based on multi-angle feature restoration, estimation of the imaging degradation model and using GAN as a further refinement process. A simplified flow diagram is shown in Figure 2. The overall process of the MAGi-GAN SRR system has 5 steps, including: (a) image segmentation and shadow labelling; (b) initial feature matching and subpixel refinement; (c) subpixel feature densification with multi-angle off-nadir view interpolation onto an upscaled nadir reference grid; (d) estimation of the image degradation model and iterative SRR reconstruction; e) GAN-(pre-trained) based SRR refinement.
MAGiGAN operates with a two-stage SRR reconstruction scheme. The first stage, i.e., steps (a)-(d), provides an initial SRR with an upscaling factor of 2 times of the original LR resolution, followed by the second stage, i.e., step (e), for a further SRR with an upscaling factor of 2 times of the resolution of the intermediate SRR result from the first stage output. A detailed description of the above steps can be found in [1]. It should be noted that, in order to produce sufficient effective resolution enhancement (≥3 times), the input LR images for MAGiGAN must meet one key criteria, which is to contain a wide range (a minimum of 15° and preferably 30°) of multi-angle (off-nadir) views. Simplified flow diagram of the MAGiGAN SRR system described in [1]. N.B.: darker coloured boxes represent the inputs and outputs.

The Proposed OpTiGAN SRR System
In this paper, we propose the OpTiGAN SRR system that is based on the original MAGiGAN framework, but with three key modifications for LR inputs of continuous image sequences or video that do not contain viewing angle changes, to produce similar quality of SRR result (~3 times effective resolution enhancement), compared to MAGi-GAN, but with significantly reduced computation time/cost. The flow diagram of the proposed OpTiGAN SRR system is shown in Figure 3, highlighting (in yellow) the modifications, in comparison to the MAGiGAN SRR system. The three modifications are listed as follows.
(a) Firstly, the shadow labelling module of MAGiGAN is removed in OpTiGAN. Given the LR inputs of continuous image sequence or video frames, the time differences between each LR images are usually minor (i.e., from seconds to minutes), thus shadow/shading differences can be ignored between each LR input. Therefore, there MAGiGAN operates with a two-stage SRR reconstruction scheme. The first stage, i.e., steps (a)-(d), provides an initial SRR with an upscaling factor of 2 times of the original LR resolution, followed by the second stage, i.e., step (e), for a further SRR with an upscaling factor of 2 times of the resolution of the intermediate SRR result from the first stage output. A detailed description of the above steps can be found in [1]. It should be noted that, in order to produce sufficient effective resolution enhancement (≥3 times), the input LR images for MAGiGAN must meet one key criteria, which is to contain a wide range (a minimum of 15 • and preferably 30 • ) of multi-angle (off-nadir) views.

The Proposed OpTiGAN SRR System
In this paper, we propose the OpTiGAN SRR system that is based on the original MAGiGAN framework, but with three key modifications for LR inputs of continuous image sequences or video that do not contain viewing angle changes, to produce similar quality of SRR result (~3 times effective resolution enhancement), compared to MAGiGAN, but with significantly reduced computation time/cost. The flow diagram of the proposed OpTiGAN SRR system is shown in Figure 3, highlighting (in yellow) the modifications, in comparison to the MAGiGAN SRR system. The three modifications are listed as follows.
(c) Thirdly, we replace the original GAN prediction (refinement) module with the MARSGAN model described in [43], where we demonstrated state-of-the-art single image SRR performance for 4 m/pixel Mars imaging datasets. The network architecture of MARSGAN can be found in [43]. In this work, we re-train the MARSGAN network with the same training dataset used in [44]. The updated GAN module offers improvement to the overall process of OpTiGAN, which complements the fallback introduced in (b). With the afore-mentioned modifications, OpTiGAN is able to achieve similar SRR enhancement (in comparison to MAGiGAN [1]) in a much shorter (~20 times shorter) processing time, for continuous image sequences and video frames with "zero" (or little) viewing angle changes. The rest of the processing components are the same as MAGiGAN [1]. The overall process of the OpTiGAN SRR system has 5 steps, including: (a) initial image feature matching and subpixel refinement; (b) calculation of dense sub-pixel translational correspondences (motion prior) with optical flow; (c) estimation of the image degradation model using the computed motion prior from (b) and initialisation of an intermediate SRR reconstruction using PDE-TV; (d) MARSGAN (pre-trained) SRR refinement on top of the intermediate SRR output from step (c).
OpTiGAN also operates on a two-stage SRR reconstruction process. In the first stage processing of OpTiGAN, i.e., steps (a)-(c), 2 times upscaling is achieved with the opticalflow PDE-TV, followed by a second stage, i.e., step (d), for a further 2 times upscaling, using the pre-trained MARSGAN model, resulting in a total of 4 times upscaling for the final SRR result.
In particular to the implementation of the optical flow-based motion prior estimation, we use the OpenCV implementation (see https://docs.opencv.org/3.4/de/d9e/classcv_1_1FarnebackOpticalFlow.html (accessed on 9 June 2021)) of the Gunnar Farneback algorithm [5]. This method uses a polynomial expansion transform to approximate pixel neighbourhoods of each input LR image and the (a) Firstly, the shadow labelling module of MAGiGAN is removed in OpTiGAN. Given the LR inputs of continuous image sequence or video frames, the time differences between each LR images are usually minor (i.e., from seconds to minutes), thus shadow/shading differences can be ignored between each LR input. Therefore, there is no need to keep the shadow labelling module in OpTiGAN, whereas when dealing with multi-angle LR inputs using MAGiGAN, the input LR images are normally acquired with much longer time differences (i.e., from days to months). (b) Secondly, we compute the dense optical flow between each LR input using the Gunnar Farneback algorithm [5], to produce translation-only transformations of each and every pixel and to replace the computationally expensive Gotcha algorithm, which calculates the affine transformations of each and every pixel. Theoretically, there should be a reduction of the SRR enhancement/quality due to the absence of multiangle information. We cover this reduction by implementing the third modification, i.e., (c), which is introduced next. In return, by replacing Gotcha, we obtain a~20 times of speedup. (c) Thirdly, we replace the original GAN prediction (refinement) module with the MARS-GAN model described in [43], where we demonstrated state-of-the-art single image SRR performance for 4 m/pixel Mars imaging datasets. The network architecture of MARSGAN can be found in [43]. In this work, we re-train the MARSGAN network with the same training dataset used in [44]. The updated GAN module offers improvement to the overall process of OpTiGAN, which complements the fallback introduced in (b).
With the afore-mentioned modifications, OpTiGAN is able to achieve similar SRR enhancement (in comparison to MAGiGAN [1]) in a much shorter (~20 times shorter) processing time, for continuous image sequences and video frames with "zero" (or little) viewing angle changes. The rest of the processing components are the same as MAGi-GAN [1]. The overall process of the OpTiGAN SRR system has 5 steps, including: (a) initial image feature matching and subpixel refinement; (b) calculation of dense sub-pixel translational correspondences (motion prior) with optical flow; (c) estimation of the image degradation model using the computed motion prior from (b) and initialisation of an intermediate SRR reconstruction using PDE-TV; (d) MARSGAN (pre-trained) SRR refinement on top of the intermediate SRR output from step (c).
OpTiGAN also operates on a two-stage SRR reconstruction process. In the first stage processing of OpTiGAN, i.e., steps (a)-(c), 2 times upscaling is achieved with the opticalflow PDE-TV, followed by a second stage, i.e., step (d), for a further 2 times upscaling, using the pre-trained MARSGAN model, resulting in a total of 4 times upscaling for the final SRR result.
In particular to the implementation of the optical flow-based motion prior estimation, we use the OpenCV implementation (see https://docs.opencv.org/3.4/de/d9e/classcv_ 1_1FarnebackOpticalFlow.html (accessed on 9 June 2021)) of the Gunnar Farneback algorithm [5]. This method uses a polynomial expansion transform to approximate pixel neighbourhoods of each input LR image and the reference image (could be any image in the inputs-usually the first one) and then estimates displacement fields from the polynomial expansion coefficients. With dense optical flow, the affine correspondences of local pixel points are simplified with translation-only correspondences. The omnidirectional displacement values from dense optical flow are then passed through to the MAP process (PDE-TV) [1,3], as the motion prior.
Using the two-stage SRR reconstruction scheme (both in MAGiGAN and the proposed OpTiGAN), we found, in [1], that the MAP-based approaches (i.e., the first stage of MAGi-GAN/OpTiGAN) are highly complementary with the deep learning-based approaches (e.g., the second stage of MAGiGAN/OpTiGAN), in terms of restoring and enhancing different types of features. In particular, the first stage of the OpTiGAN processing retrieves sub-pixel information from multiple LR images and tends to produce robust restorations ("artefact free") of small objects and shape outlines, whereas the second stage of the OpTi-GAN processing contributes more to the reconstruction of the high-frequency textures. Besides, experiments in [1] and in this paper (see Section 3 for demonstration) have shown that using GAN inference alone, i.e., without the first stage of the OpTiGAN processing, can result in artificial textures or even synthetic objects, whereas using GAN inference on top of an intermediate SRR result produced from a classic MAP-based approach produces the highest effective resolution, in terms of resolvable small objects and edge/outline sharpness and texture details, with the least artefacts.
On the other hand, in the second stage of OpTiGAN processing, we replace the original GAN implementation, which was an optimised version of SRGAN [4], with our recently developed MARSGAN model [43]. This is the same as a general GAN framework, wherein MARSGAN trains a generator network to generate potential SRR solutions and a relativistic adversarial network [32,43,45] to pick-up the most realistic SRR solution. MARSGAN uses 23 Adaptive Weighted Residual-in-Residual Dense Blocks (AWRRDBs), followed by an Adaptive Weighted Multi-Scale Reconstruction (AWMSR) block in the generator network, providing much higher network capacity and better performance, compared to the original SRGAN-based model used in MAGiGAN (see Section 3 for comparisons). Besides, MARSGAN uses a balanced PSNR-driven and perceptual quality-driven [43] loss function to produce high quality restoration while limiting synthetic artefacts. A simplified network architecture of the MARSGAN model is shown in Figure 4. A detailed description of the MARSGAN network can be found in [43]. reference image (could be any image in the inputs-usually the first one) and then estimates displacement fields from the polynomial expansion coefficients. With dense optical flow, the affine correspondences of local pixel points are simplified with translation-only correspondences. The omnidirectional displacement values from dense optical flow are then passed through to the MAP process (PDE-TV) [1,3], as the motion prior.
Using the two-stage SRR reconstruction scheme (both in MAGiGAN and the proposed OpTiGAN), we found, in [1], that the MAP-based approaches (i.e., the first stage of MAGiGAN/OpTiGAN) are highly complementary with the deep learning-based approaches (e.g., the second stage of MAGiGAN/OpTiGAN), in terms of restoring and enhancing different types of features. In particular, the first stage of the OpTiGAN processing retrieves sub-pixel information from multiple LR images and tends to produce robust restorations ("artefact free") of small objects and shape outlines, whereas the second stage of the OpTiGAN processing contributes more to the reconstruction of the highfrequency textures. Besides, experiments in [1] and in this paper (see Section 3 for demonstration) have shown that using GAN inference alone, i.e., without the first stage of the OpTiGAN processing, can result in artificial textures or even synthetic objects, whereas using GAN inference on top of an intermediate SRR result produced from a classic MAPbased approach produces the highest effective resolution, in terms of resolvable small objects and edge/outline sharpness and texture details, with the least artefacts.
On the other hand, in the second stage of OpTiGAN processing, we replace the original GAN implementation, which was an optimised version of SRGAN [4], with our recently developed MARSGAN model [43]. This is the same as a general GAN framework, wherein MARSGAN trains a generator network to generate potential SRR solutions and a relativistic adversarial network [32,43,45] to pick-up the most realistic SRR solution. MARSGAN uses 23 Adaptive Weighted Residual-in-Residual Dense Blocks (AWRRDBs), followed by an Adaptive Weighted Multi-Scale Reconstruction (AWMSR) block in the generator network, providing much higher network capacity and better performance, compared to the original SRGAN-based model used in MAGiGAN (see Section 3 for comparisons). Besides, MARSGAN uses a balanced PSNR-driven and perceptual qualitydriven [43] loss function to produce high quality restoration while limiting synthetic artefacts. A simplified network architecture of the MARSGAN model is shown in Figure 4. A detailed description of the MARSGAN network can be found in [43].  With the above changes, OpTiGAN operates the best on slowly "drifting" scenes with less than 3 • of changes in camera orientations, while MAGiGAN operates the best on "point-and-stare" and/or multi-angle repeat-pass observations with more than 15 • of changes in camera orientations.

Training Details of the MARSGAN Model
In this work, we retrained the MARSGAN model with the Deimos-2 images used in [44], using a similar hyperparameter set-up, as used previously in [43], i.e., batch size of 64, adversarial loss weight of 0.005, same low-level and high-level perceptual loss weight as 0.5, pixel-based loss weight of 0.5 and an initial learning rate of 0.0001, which is then halved at 50 k, 100 k, 200 k and 400 k iterations.
We have re-used the same training datasets as described in [44]. The training datasets were formed from 102 non-repeat and cloud-free Deimos-2 images, including the 4 m/pixel MS green band images (bicubic upsampled to 2 m/pixel for OpTiGAN training) and the 1 m/pixel (downsampled) PAN band images. The 102 resampled (2 m/pixel and 1 m/pixel) Deimos-2 MS green band and PAN band images were then cropped and randomly selected (50% of total samples are reserved) to form 300,512 (32 by 32 pixels) LR training samples and 300,512 (64 by 64 pixels) HR training samples. It should be noted that additional training (for validation and comparison purposes) of the SRGAN, ESRGAN and MARSGAN networks used the original 4 m/pixel MS green band images (without upsampling) and a larger HR sample size (128 by 128 pixels), in order to achieve the unified upscaling factor (4 times) for intercomparisons.
All training and subsequent SRR processing were performed on the latest Nvidia ® RTX 3090 GPU and an AMD ® Ryzen-7 3800X CPU.
It should be noted that the final trained MARSGAN model has a high scalability over different datasets that have different resolutions (from 30 cm to 10 m). Other than the demonstrated ultra-high-resolution datasets, it also works for the 4 m Deimos-2 MS band images and 10 m Sentinel-2 images.

Assessment and Evaluation Methods
In this work, our target testing datasets are amongst the highest resolution satellite optical data for EO, therefore there is no "reference HR" image available for direct comparison and evaluation. Standard image quality metrics, e.g., PSNR and structural similarity index metric, that require a reference HR image, cannot be used for this work. However, we visually examined and compared the restorations of different artificial targets (i.e., the bar-pattern targets and fan-shaped targets, available at the Baotou Geocal site) in the SRR results against fully measured reference truth (see Figure 5). The smallest bars that were resolvable in the SRR results are summarised in Section 3.
Remote Sens. 2021, 13, x FOR PEER REVIEW 8 of 26 5) with multiple slanted-edges and then averaged the resolution enhancement factors, estimated with different slanted-edges, to obtain an averaged effective resolution enhancement factor for the SRR results. Figure 5. Illustration of the artificial optical targets at the Baotou Geocal site. Photo pictures (left) courtesy from [48] and map of the bar-pattern target (right) provided by L. Ma [48] and Y. Zhou [49] (private correspondence, 2021).

Experimental Overview
The Baotou Geocal site (50 km away from the Baotou city) located at 40°51′06.00″N, 109°37′44.14″E, Inner Mongolia, China, was used as the test site in this study, in order to obtain assessments which can be compared against other published ones. The permanent artificial optical targets (see Figure 5), at the Baotou Geocal site, provided broad dynamic range, good uniformity, high stability and multi-function capabilities. Since 2013, the artificial optical targets have been successfully used for payload radiometric calibration and on-orbit performance assessment for a variety of international and domestic satellites. The artificial optical targets were set-up on a flat area of approximately 300 km 2 , with an average altitude of 1270 m. The Baotou Geocal site features a cold semi-arid climate that has (an average of) ~300 clear-sky days every year, which has made it an ideal site for geocalibration and validation work. Some illustration photos (courtesy of [48]) of the artificial optical targets, including a knife-edge target, a fan-shaped target and a bar-pattern target, at the Baotou Geocal site, as well as the fully measured reference truth (provided by L. Ma [48] and Y. Zhou [49] in private correspondence) for the bar-pattern target, can be found in Figure 5.
We tested the proposed OpTiGAN SRR system with three ultra-high-resolution satellite datasets, i.e., the 31 cm/pixel WorldView-3 PAN images, the 75 cm/pixel Deimos-2 PAN images and the 70 cm/pixel SkySat HD video frames. Table 1 shows the input image IDs used in this work. Note that the first row was used as the reference image in case of multi-image SRR processing and it was also the sole-input in case of single-image SRR processing.  [48] and map of the bar-pattern target (right) provided by L. Ma [48] and Y. Zhou [49] (private correspondence, 2021).
In addition, we performed edge sharpness measurements using the Imatest ® software (https://www.imatest.com/ (accessed on 9 June 2021)) to measure the effective resolutions of the SRR results. The Imatest ® edge sharpness measurement calculated the averaged total amount of pixels from 20% to 80% rises of slanted-edge profiles within a given highcontrast area. If the total number of pixels of such profile rises in the SRR image was compared against the total number of pixels involved in its LR counterpart, for the same slanted-edge profile, then their ratio could be used to estimate an effective resolution enhancement factor between the two images (biased to the measured edge in particular). We performed this test using the knife-edge target visible at the Baotou Geocal site (see Figure 5) with multiple slanted-edges and then averaged the resolution enhancement factors, estimated with different slanted-edges, to obtain an averaged effective resolution enhancement factor for the SRR results.

Experimental Overview
The Baotou Geocal site (50 km away from the Baotou city) located at 40 • 51 06.00"N, 109 • 37 44.14"E, Inner Mongolia, China, was used as the test site in this study, in order to obtain assessments which can be compared against other published ones. The permanent artificial optical targets (see Figure 5), at the Baotou Geocal site, provided broad dynamic range, good uniformity, high stability and multi-function capabilities. Since 2013, the artificial optical targets have been successfully used for payload radiometric calibration and on-orbit performance assessment for a variety of international and domestic satellites. The artificial optical targets were set-up on a flat area of approximately 300 km 2 , with an average altitude of 1270 m. The Baotou Geocal site features a cold semi-arid climate that has (an average of)~300 clear-sky days every year, which has made it an ideal site for geo-calibration and validation work. Some illustration photos (courtesy of [48]) of the artificial optical targets, including a knife-edge target, a fan-shaped target and a bar-pattern target, at the Baotou Geocal site, as well as the fully measured reference truth (provided by L. Ma [48] and Y. Zhou [49] in private correspondence) for the bar-pattern target, can be found in Figure 5.
We tested the proposed OpTiGAN SRR system with three ultra-high-resolution satellite datasets, i.e., the 31 cm/pixel WorldView-3 PAN images, the 75 cm/pixel Deimos-2 PAN images and the 70 cm/pixel SkySat HD video frames. Table 1 shows the input image IDs used in this work. Note that the first row was used as the reference image in case of multi-image SRR processing and it was also the sole-input in case of single-image SRR processing. Since MAGiGAN and OpTiGAN require different inputs, i.e., LR images with viewing angle differences and without viewing angle differences, respectively, we are not able to provide results of a comparison of performance between these two SRR systems. However, in this work, we provide intercomparisons against four different SRR techniques. The four SRR techniques include: (1) an optimised version of the SRGAN [4] single-image SRR network that was used in MAGiGAN [1] (hereafter referred to as SRGAN, in all text, figures and tables); (2) ESRGAN [32] single-image SRR network; (3) MARSGAN [43] singleimage SRR network, which was also used as the second stage processing of OpTiGAN; (4) optical-flow PDE-TV (OFTV; hereafter referred to as OFTV, in all text, figures and tables) multi-image SRR, which was also used as the first stage processing of OpTiGAN. It should be noted that all deep network-based SRR techniques, i.e., SRGAN, ESRGAN, MARSGAN and OpTiGAN, were trained with the same training datasets, as described in Section 2.3.

Demonstration and Assessments of WorldView-3 Results
For the WorldView-3 experiments, we used a single input image for the SRGAN, ESRGAN and MARSGAN SRR processing and used five overlapped input images for the OFTV and OpTiGAN SRR processing. Four cropped areas (A-D) covering the different artificial targets at the Baotou Geocal site are shown in Figure 6 for comparisons of the SRR results and the original input WorldView-3 image.
Area A showed the 10 m × 2 m and 5 m × 1 m bar-pattern targets. We can observe that all five SRR results showed good restoration of these larger sized bars. The SRGAN result of Area A showed some rounded corners of the 5 m × 1 m bars, whereas the ESRGAN result showed sharper corners, but with some high-frequency noise. The MARSGAN result showed the best overall quality among the three single-image deep learning-based SRR techniques. In comparison to the deep learning-based techniques, OFTV showed smoother edges/outlines, but with no observable artefact. The OpTiGAN result, which was based on OFTV and MARSGAN, showed the best overall quality, i.e., similar sharpness on edges/outlines/corners, but with no artefact or high-frequency noise. Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 26  Area B showed the central area of the knife-edge target. All three deep learning-based techniques showed sharp edges of the target; however, the SRGAN result appeared to demonstrate some artefacts at the centre corner and some artefacts at the edges, ESRGAN showed some high-frequency noise, whilst MARSGAN had some artefacts at the edges. OFTV showed a blurrier edge compared to SRGAN, ESRGAN and MARSGAN, but it showed the least number of artefacts. OpTiGAN had shown sharp edges that were similar to ESRGAN and MARSGAN, but with much less noise and artefact.
Area C showed a fan-shaped target. We can observe that all five SRR results showed good restoration of the target at mid-range radius. The MARSGAN and OpTiGAN results showed reasonable restoration at the centre radius, i.e., by the end of the target, with only a few artefacts at the centre and in between each strip pattern. ESRGAN also showed some reasonable restoration at small radius, but the result was much noisier compared to MARSGAN and OpTiGAN. SRGAN showed obvious artefacts at small radius. OFTV did not show as much detail as the other techniques, but, instead, it showed no observable noise or artefact. In terms of sharpness, ESRGAN, MARSGAN and OpTiGAN had the best performance. However, MARSGAN and OpTiGAN had less noise compared to ESRGAN, with OpTiGAN showing the least artefacts among the three.
Area D showed a zoom-in view of the larger 10 m × 2 m bar-pattern targets along with multiple smaller bar-pattern targets with size ranges from 2.5 m × 0.5 m, 2 m × 0.4 m, 2 m × 0.3 m, 2 m × 0.2 m, to 2 m × 0.1 m, from bottom-right to top-right (see Figure 5). We can observe that all five SRR techniques were able to restore the 2. In Table 2, we show the BRISQUE and PIQE image quality scores (0-100, lower scores representing better image quality) that were measured from the full-image (see Supplementary Materials for the full-image) at the Baotou Geocal site. We can observe improvements in terms of image quality from all five SRR techniques. MARSGAN achieved the best image quality score from BRISQUE, whereas ESRGAN achieved the best image quality score from PIQE. The image quality scores for OpTiGAN were close to ESRGAN and MARSGAN. The lower image quality scores reflected better overall image sharpness and contrast. However, since this measurement did not count for incorrect high-frequency texture, incorrect reconstruction of small sized targets and synthetic artefacts, the better scores do not reflect absolute better quality of the SRR results. More quantitative assessments are given in Tables 3 and 4.     Table 3. Summary of slanted-edge measurements as shown in Figure 7, for total pixel counts for 20-80% rise of the edge profile and the suggested effective resolution enhancement factor comparing to the input WorldView-3 image (upscaled by a factor of 4 for comparison).  Table 4. Summary of the effective resolution derived from Figure 7 and Table 3, the smallest resolvable bar pattern targets observed from the WorldView-3 image and each of the SRR results, in comparison to Figure 5, as well as the number of input images used and computing time for the Baotou Geocal site. It has been the case that in the field of photo-realistic SRR, many SRR techniques present "fancy" images with four times or even eight times of upscaling factor, but their effective resolution never reaches the upscaling factor. In this paper, we present a more quantitative assessment using the Imatest ® slanted-edge measurement for the knife-edge target at the Baotou Geocal site. In Figure 7, we show three automatically detected edges and their associated 20-80% profile rise analysis for each of the input WorldView-3 image, SRGAN SRR result, ESRGAN SRR result, MARSGAN SRR result, OFTV SRR result and OpTiGAN SRR result. The total pixel counts for the 20-80% profile rise of each slanted edge are summarised in Table 3. We divided the total pixel counts of the input WorldView-3 image (upscaled by a factor of 4 for comparison) with the total pixel counts of the SRR results to get the effective resolution enhancement factor for each of the measured edges. Note that the edge sharpness measurements are generally similar but may be different from area to area, even in the same image, thus we averaged three measurements to get the final effective resolution enhancement factor. The average effective resolution enhancement factors are shown in the last row of Table 3.

WorldView
Another quantitative assessment of image effective resolution was achieved via visually checking the smallest resolvable bar targets for each of the SRR results. This is summarised in Table 4. We checked the smallest resolvable bar targets in terms of "recognisable" and "with good visual quality", where "recognisable" means visible and identifiable and does not count for noise or artefacts and "with good visual quality" means clearly visible with little or no artefacts. We can see from Table 4 that, with a 31 cm/pixel resolution WorldView-3 image, we can resolve 60 cm/80 cm bar targets and, with a 9 cm/pixel OpTiGAN SRR, we can resolve 20 cm/30 cm bar targets. Table 4 also shows the effective resolution calculated from Table 3, along with input and processing information. Note, here, the proposed OpTiGAN system has significantly shortened the required processing time from a few hours to around 10 min for five of 420 × 420-pixel input images in comparison to MAGiGAN [1], which we based the aforementioned modifications on.

Demonstration and Assessments of Deimos-2 Results
For the Deimos-2 experiments, we used one input for the SRGAN, ESRGAN and MARSGAN SRR processing and three repeat-pass inputs for OFTV and OpTiGAN SRR processing. Four cropped areas (A-D) covering the different artificial targets at the Baotou Geocal site are shown in Figure 8 for comparison with the SRR results and the original input Deimos-2 image.
Area A showed the overall area of the bar-pattern targets that ranged from 25 m × 5 m to 2 m × 0.1 m. We can observe that all five SRR techniques were able to bring out the correct outlines for the 10 m × 2 m bars; however, the ESRGAN results displayed some artefacts (incorrect shape) and the OFTV results were slightly smoother than the others. For the smaller 5 m × 1 m bars, only ESRGAN, MARSGAN and OpTiGAN results showed some reasonable restoration, but with added noise. Some details could be seen for the 4.5 m × 0.9 m bars with MARSGAN and OpTiGAN, but with too much noise.
Area B showed the centre of the knife-edge target. We can observe that the SRGAN, ES-RGAN and OFTV results were quite blurry at the centre and edges. In addition, ESRGAN tended to produce some artificial textures that made the edges even more blurry. MARS-GAN showed sharper edges in comparison to SRGAN, ESRGAN and OFTV. OpTiGAN showed further improvement on top of MARSGAN.
Area C showed a zoom-in view of the smaller 10 m × 2 m and 5 m × 1 m bar-pattern targets. We could clearly observe the noise in SRGAN and ESRGAN for the 10 m × 2 m bars. The OFTV result was blurry, but without any artefact. MARSGAN and OpTiGAN had the best (and similar) restoration for the 10 m × 2 m bars. For the 5 m × 1 m bars, only OpTiGAN produced reasonable restoration.
Area D showed a fan-shaped target. We can observe that all five SRR results showed good restoration of the target at mid-to-long radius. At the centre of the fan-shaped target, the SRGAN and OFTV results were blurry and the ESRGAN, MARSGAN and OpTiGAN results showed different levels of artefact.
In Table 5, we show the BRISQUE and PIQE image quality scores (0-100, lower score represents better quality) that were measured with the full-image (see Supplementary Materials for the full-image) at the Baotou Geocal site. We can observe improvements in terms of image quality from all five SRR techniques. MARSGAN and OpTiGAN received the best overall score for the Deimos-2 results.   In Figure 9, we present a quantitative assessment achieved via the Imatest ® slantededge measurements for the knife-edge target at the Baotou Geocal site. There were three automatically detected edges and their associated 20-80% profile rise analysis are shown, for each of the input Deimos-2 image, SRGAN SRR result, ESRGAN SRR result, MARSGAN SRR result, OFTV SRR result and OpTiGAN SRR result. The total pixel counts for the 20-80% profile rise of each slanted edge is summarised in Table 6. We divided the total pixel counts of the input Deimos-2 image (upscaled by a factor of 4 for comparison) with the total pixel counts of the SRR results to get the effective resolution enhancement factor for each of the measured edge. The three measurements were then averaged to get the final effective resolution enhancement factor, as shown in the last row of Table 6. For the Deimos-2 experiments, we can observe that the effective resolution enhancements were generally not good with SRGAN, ESRGAN and OFTV; however, with OpTiGAN, we still obtained a factor of 2.96 times improvement. The other quantitative assessment of image effective resolution was achieved via visual checking of the smallest resolvable bar targets for each of the SRR results. The results are summarised in Table 7. We can observe, from Table 7, that even the effective resolutions for edges were generally not good for SRGAN, ESRGAN and OFTV, as shown in  Table 6. Summary of slanted-edge measurement, as shown in Figure 9, for total pixel counts for the 20-80% rise of the edge profile and indicated effective resolution enhancement factor comparing to the input Deimos-2 image (upscaled by a factor of 4 for comparison). The other quantitative assessment of image effective resolution was achieved via visual checking of the smallest resolvable bar targets for each of the SRR results. The results are summarised in Table 7. We can observe, from Table 7, that even the effective resolutions for edges were generally not good for SRGAN, ESRGAN and OFTV, as shown in Table 6. The results still showed reasonable restorations for the bar targets. We can see, from Table 7, that, with a 75 cm/pixel resolution Deimos-2 image, we can resolve 2 m/3 m bar targets and, with a 28 cm/pixel OpTiGAN SRR, we can resolve 1 m/2 m bar targets. Table 7. Summary of the effective resolution derived from Figure 9 and Table 6, the smallest resolvable bar pattern targets observed from the Deimos-2 image and each of the SRR results, in comparison with Figure 5, as well as the number of input images used and computing time for the Baotou Geocal site.

Demonstration and Assessments of SkySat Results
For the SkySat experiments, we used one input for the SRGAN, ESRGAN and MARS-GAN SRR processing and five continuous video frames for the OFTV and OpTiGAN SRR processing. Four cropped areas (A-D) covering the different artificial targets available at the Baotou Geocal site are shown in Figure 10 for comparison with the SRR results and the original input SkySat video frames.  Area A showed the overall area of the bar-pattern targets that ranged from 25 m × 5 m to 2 m × 0.1 m. We can observe that SRGAN, MARSGAN, OFTV and OpTiGAN were able to bring out the correct outlines of the 10 m × 2 m bars. The ESRGAN result showed some high-frequency textures and also showed some reconstruction of the 5 m × 1 m bars; however, its result was very noisy. OFTV was blurry, but with the least artefact. The MARSGAN and OpTiGAN results showed the best restoration with little noise and artefact, for the bar-pattern targets, and, especially, for bringing out the 5 m × 1 m bars. A few textures have been revealed for the 4.5 m × 0.9 m bars from OpTiGAN, but no individual 4.5 m × 0.9 m bar has been resolved from any SRR result.
Area B showed the centre of the knife-edge target. We can observe artefacts from SRGAN, noise from ESRGAN, blurring effects from MARSGAN and OFTV. In terms of edge sharpness for this area, SRGAN and OpTiGAN results were the best, but OpTiGAN had significantly less artefact and noise compared to SRGAN.
Area C showed a zoom-in view of the smaller 10 m × 2 m and 5 m × 1 m bar-pattern targets. For the smaller 5 m × 1 m bars, the MARSGAN and OpTiGAN results showed some good restoration, but both with artefacts and noise.
Area D showed the fan-shaped target. The ESRGAN result showed the best restoration at the centre, but it was also the noisiest. SRGAN displayed artefacts towards the centre. MARSGAN and OpTiGAN showed the best restoration for mid-to-long radiuses.
Finally, we give BRISQUE and PIQE image quality scores (0-100, lower scores represent better quality) in Table 8. We can observe significant improvements in terms of image quality from all five SRR results. ESRGAN, MARSGAN and OpTiGAN demonstrated the best overall score for the SkySat experiments. In Figure 11, we present the Imatest ® slanted-edge measurements for the knife-edge target. There were three automatically detected edges and their associated 20-80% profile rise analysis are shown, for each of the input SkySat video frames, SRGAN, ESRGAN, MARSGAN, OFTV and OpTiGAN SRR results. The total pixel counts for the 20-80% profile rise of each slanted edge is summarised in Table 9. We divided the total pixel counts of the input SkySat video frame (upscaled by a factor of 4 for comparison) with the total pixel counts of the SRR results to get an indicative effective resolution enhancement factor for each of the measured edge. The three measurements were then averaged to get the final effective resolution enhancement factor, as shown in the last row of Table 9. For the SkySat experiments, we can only observe marginal effective resolution enhancements for SRGAN, ESRGAN, MARSGAN and OFTV; however, with OpTiGAN, the effective resolution enhancement factor was much higher, at 3.94 times.
The smallest resolvable bar targets from the original SkySat video frame and its SRR results are summarised in Table 10. With a 70 cm/pixel resolution SkySat video frame, we can resolve 2 m/5 m bar targets. SRGAN brought the 3 m bars to visually good quality. ESRGAN did not improved the quality of the 5 m and 3 m bars, but made the 1 m bar resolvable. MARSGAN improved the quality of the 5 m and 3 m bars and also made the 1 m bar resolvable. OFTV did not make the 1 m bars resolvable, but improved the visual quality for 5 m, 3 m and 2 m bars. Finally, with the 18 cm/pixel OpTiGAN SRR, 1 m bars were resolvable and the qualities for 5 m, 3 m and 2 m bars were all improved. file rise of each slanted edge is summarised in Table 9. We divided the total pixel counts of the input SkySat video frame (upscaled by a factor of 4 for comparison) with the total pixel counts of the SRR results to get an indicative effective resolution enhancement factor for each of the measured edge. The three measurements were then averaged to get the final effective resolution enhancement factor, as shown in the last row of Table 9. For the SkySat experiments, we can only observe marginal effective resolution enhancements for SRGAN, ESRGAN, MARSGAN and OFTV; however, with OpTiGAN, the effective resolution enhancement factor was much higher, at 3.94 times.   Table 9. Summary of slanted-edge measurement as shown in Figure 11, for total pixel counts for the 20-80% rise of the edge profile and indicated effective resolution enhancement factor comparing to the input SkySat video frame (upscaled by a factor of 4 for comparison).  Table 10. Summary of the effective resolution derived from Figure 11 and Table 9, the smallest resolvable bar pattern targets observed from the SkySat image and each of the SRR results, in comparison to Figure 5, as well as the number of input images used and computing time for the Baotou Geocal site.

OpTiGAN Results Demonstration over Different Areas
In this section, we demonstrate further OpTiGAN SRR results for 31 cm WorldView-3 images and 75 cm Deimos-2 PAN band images over different areas. These included small building blocks within the Baotou site, from WorldView-3 ( Figure 12), a non-urban area (snow covered) in Greenland, from WorldView-3 (Figure 13), small and flat residential building blocks in Adelaide, from Deimos-2 ( Figure 14), and tower buildings, ships and highway roads over Dubai, from Deimos-2 ( Figure 15). It should be noted that all OpTiGAN SRR results in this section were produced with four input images from the two datasets (WorldView-3 and Deimos-2) and image IDs of the reference images are given in the figure captions. We can observe from the different examples of urban and non-urban scenes, that OpTiGAN was able to restore structural outlines (e.g., buildings, roads and geological surface features) and small objects (e.g., windows of a building, cars and ships), with much higher edge sharpness and no obvious artefact.

Discussion
We can observe from the results, OFTV has played an important role in producing the initial SRR image (intermediate output in the first-stage processing of OpTiGAN) to

Discussion
We can observe from the results, OFTV has played an important role in producing the initial SRR image (intermediate output in the first-stage processing of OpTiGAN) to provide initial controls on potential artefacts and noises, from the follow-on GAN-based refinement process (the second-stage processing in OpTiGAN). Although SRGAN, ESRGAN and MARSGAN alone were able to produce visually pleasing SRR results, due to effects from artefact and noise, we did not observe significant improvement in terms of resolving smaller bar targets, from all three experiments.
In addition, the effective resolutions achieved from SRGAN, ESRGAN and MARS-GAN alone, for ultra-high resolution satellite imagery, when a network did not have any prior knowledge of the HR counterpart of the input images (i.e., network had not been trained with any HR truth at such ultra-high spatial resolution), were generally limited, as demonstrated in the slanted-edge measurements. OpTiGAN provides a solution for SRR of ultra-high resolution satellite image sequences or videos.
The proposed multi-image OpTiGAN SRR system benefits from more input images; we based our test on a minimal number of LR inputs and tested its limit for only three LR inputs for the Deimos-2 datasets. We can observe better effective resolution enhancement factors with OpTiGAN using five LR inputs for WorldView-3 and SkySat, in comparison to using three LR inputs for Deimos-2.
However, more LR inputs (or larger input image size) require more computing time. The processing speed of a SRR system that involves MAP approaches is generally not comparable with a deep learning-based inference system. However, with OpTiGAN, as most of the components are portable to GPU, we were able to produce SRR results for small inputs (≤300 × 300 pixels, ≤5 inputs) within a fairly short time (≤10 min). The other downside is that we observed some artefacts, in our experiments from the first stage (OFTV) of the OpTiGAN processing, for using large input LR images (typically for input images >1000 × 2000 pixels). This issue can be fully eliminated by using tiles of smaller input images (≤500 × 500 pixels).
In the future, different optical flow algorithms and/or different SRR networks can be explored to further improve performance of each of the two processing stages of the OpTiGAN system. A multi-scale approach can be explored to replace the current sequential approach, in order to better integrate the traditional MAP solution with the deep learningbased result. Better training of the MARSGAN model is expected in the future, when there are more training images available. In addition, separating the MARSGAN model with respect to different types of targets/surface features (separately trained) may potentially improve the performance. In this case, an automated system to identify such different types of targets and surface features will be helpful both for training (scene classification) and for inferencing (trained model selection).