Single Image Super-Resolution Restoration of TGO CaSSIS Colour Images: Demonstration with Perseverance Rover Landing Site and Mars Science Targets

: The ExoMars Trace Gas Orbiter (TGO)’s Colour and Stereo Surface Imaging System (CaSSIS) provides multi-spectral optical imagery at 4–5 m/pixel spatial resolution. Improving the spatial resolution of CaSSIS images would allow greater amounts of scientiﬁc information to be extracted. In this work, we propose a novel Multi-scale Adaptive weighted Residual Super-resolution Generative Adversarial Network (MARSGAN) for single-image super-resolution restoration of TGO CaSSIS images, and demonstrate how this provides an effective resolution enhancement factor of about 3 times. We demonstrate with qualitative and quantitative assessments of CaSSIS SRR results over the Mars2020 Perseverance rover’s landing site. We also show examples of similar SRR performance over 8 science test sites mainly selected for being covered by HiRISE at higher resolution for comparison, which include many features unique to the Martian surface. Application of MARSGAN will allow high resolution colour imagery from CaSSIS to be obtained over extensive areas of Mars beyond what has been possible to obtain to date from HiRISE.


Introduction
Orbital imaging has been a highly effective way of exploring the Martian surface. The ExoMars Trace Gas Orbiter (TGO)'s Colour and Stereo Surface Imaging System (CaSSIS) provides multi-spectral optical imagery at 4-5 m/pixel spatial resolution [1]. CaSSIS has higher spatial resolution, image quality, and with colour bands, comparing to the Mars Reconnaissance Orbiter (MRO) Context Camera (CTX) images at 6 m/pixel [2]. However, the spatial resolution of CaSSIS is limited compared to the details revealed by the MRO High Resolution Imaging Science Experiment (HiRISE) images typically at 25-50 cm/pixel resolution [3]. CaSSIS has much better global coverage compared to HiRISE (<4% since 2006) and will provide more repeat and stereo observations in the future.
Improving the spatial resolution of CaSSIS images would allow greater amounts of information to be extracted about the nature of the surface and how it formed or changes over time. One of the options to achieve a greater spatial resolution is through the use of Super-Resolution Restoration (SRR/SR) techniques. This was first demonstrated with HiRISE In this work, we describe in detail and show qualitative and quantitative assessments of the proposed single-image MARSGAN CaSSIS SRR result for the Mars2020 Perseverance Rover's landing site [6,7], Jezero Crater. In addition, we demonstrate other potential applications with 8 further test sites with scientifically interesting features.
The layout of this paper is as follows. In Section 1.1, we introduce the 8 study sites. In Section 1.2, we review previous work in image SRR. In Section 2.1, we describe the MARSGAN architecture. In Section 2.2, we show the MARSGAN's loss functions. In Section 2.3, we introduce different assessment methods. In Section 2.4, we provide training and experimental details. In Section 3.1, we demonstrate CaSSIS SRR results for Jezero crater and provide assessment details. In Section 3.2, we demonstrate CaSSIS SRR results for 8 selected science targets. In Section 4.1, we discuss the perceptual-driven and PSNR-driven SRR solutions. In Section 4.2, we broadly compare the proposed singleimage and deep-learning based approach with a traditional multi-image computer-vision based approach. In Section 4.3, we briefly demonstrate the potential of the MARSGAN model for HiRISE, CTX, and CRISM data. In Section 5, we summarise conclusions and discuss future work.

Study sites
Our selected science targets include bedrock layers (Site-1), bright and dark slope streaks (Site-2), defrosting dunes and dune gullies (Site-3), gullies at Gasa crater (Site-4), recurring slope lineae at Hale Crater (Site-5), scalloped depressions and dust devils at Peneus Patera (Site-6), gullies at Selevac crater (Site-7), and defrosting (so-called) spiders MARSGAN SRR can not only be used for supporting image analysis to obtain improved scientific understanding of the Martian surface, but it can also be used for supporting existing and future rover missions. SRR images can be employed for a wide range of applications such as the detection of objects which may present hazards for landers and rover navigation and path planning, the improvement of colour and hyperspectral images for better understanding of surface mineralogy, the detection of spacecraft hardware, and better definition of dynamic features. Such techniques can also be applied to time series for change detection, for example in tracking dynamic features.
In this work, we describe in detail and show qualitative and quantitative assessments of the proposed single-image MARSGAN CaSSIS SRR result for the Mars2020 Perseverance Rover's landing site [6,7], Jezero Crater. In addition, we demonstrate other potential applications with 8 further test sites with scientifically interesting features.
The layout of this paper is as follows. In Section 1.1, we introduce the 8 study sites. In Section 1.2, we review previous work in image SRR. In Section 2.1, we describe the MARSGAN architecture. In Section 2.2, we show the MARSGAN's loss functions. In Section 2.3, we introduce different assessment methods. In Section 2.4, we provide training and experimental details. In Section 3.1, we demonstrate CaSSIS SRR results for Jezero crater and provide assessment details. In Section 3.2, we demonstrate CaSSIS SRR results for 8 selected science targets. In Section 4.1, we discuss the perceptual-driven and PSNRdriven SRR solutions. In Section 4.2, we broadly compare the proposed single-image and deep-learning based approach with a traditional multi-image computer-vision based approach. In Section 4.3, we briefly demonstrate the potential of the MARSGAN model for HiRISE, CTX, and CRISM data. In Section 5, we summarise conclusions and discuss future work.

Study Sites
Our selected science targets include bedrock layers (Site-1), bright and dark slope streaks (Site-2), defrosting dunes and dune gullies (Site-3), gullies at Gasa crater (Site-4), recurring slope lineae at Hale Crater (Site-5), scalloped depressions and dust devils at Peneus Patera (Site-6), gullies at Selevac crater (Site-7), and defrosting (so-called) spiders (Site-8). Figure 2 shows cropped samples of the CaSSIS colour images for the above-Remote Sens. 2021, 13, 1777 3 of 40 mentioned science targets. The CaSSIS and HiRISE image IDs for the above study sites are listed in Table 1 and can be found in Section 2.4.  Site 1 is an image of the floor of a 41 km diameter crater located to the north of the Argyre Basin whose floor exposes ancient layered rock deposits. The light tone of the layers in these deposits suggests they could be ancient clays, e.g., [8,9], and therefore may represent an ancient aqueous environment. The crater floor also hosts a number of dark sand dunes, e.g., [10][11][12], and transverse aeolian ridges, e.g., [13,14]. Many dark dunes on Mars have been shown to be currently in motion [15][16][17][18], whereas transverse aeolian Site 1 is an image of the floor of a 41 km diameter crater located to the north of the Argyre Basin whose floor exposes ancient layered rock deposits. The light tone of the layers in these deposits suggests they could be ancient clays, e.g., [8,9], and therefore may represent an ancient aqueous environment. The crater floor also hosts a number of dark sand dunes, e.g., [10][11][12], and transverse aeolian ridges, e.g., [13,14]. Many dark dunes on Mars have been shown to be currently in motion [15][16][17][18], whereas transverse aeolian ridges are thought to be inactive [19,20].
Site 2 captures the northern rim slope and floor of an ancient~45 km diameter crater in Arabia Terra. The steeply sloping hillslopes have many slope streaks, believed to represent avalanches of dust [21][22][23]. Many new slope streaks have been observed and they have also been observed to fade [24,25]. Their exact trigger is still an open question [26][27][28]. The flatter areas host numerous transverse aeolian ridges.
Site 3 comprises a~55 km diameter crater in Noachis Terra which hosts a dunefield (USGS dune database 0175-546) with active dune gullies [29][30][31]. The CaSSIS image is taken at a time of year when the seasonal frosts are retreating, creating distinctive albedo patterns on the surface [32][33][34]. These defrosting spots represent areas where dark dust has been deposited on top of the bright seasonal ices (mainly carbon dioxide ice) by CO 2 gas escaping from underneath the ice [35]. Flows of dark sand along gullies are also thought to occur at this time of year [29,36], driven by CO 2 sublimation.
Site 4 is at Gasa crater, a 6.5km diameter crater located inside Cilaos crater, a 21.4 km diameter crater. Gasa Crater has annual active gullies along its south-facing wall [36][37][38][39]. Gullies are also located on the south-facing wall of the larger host crater. Simulated and actual CaSSIS images were able to pick out new deposits in this crater based on their colour contrast [40,41]. The gullies in this crater have exceptionally well-developed source alcoves into the bedrock. This crater has a pitted floor, which is thought to indicate an impact into icy materials and subsequent volatile release from the impact melt deposits [40,41].
Site 5 is located on the central peak of Hale Crater, a 120-150 km diameter crater located on the northern rim of the Argyre basin. The slopes here host Recurring Slope Lineae (RSL) and gullies. RSL are dark linear markings that grow downslope during the warmest periods of the year and were initially thought to be liquid water seeps, e.g., [42][43][44], although that interpretation has been overturned and re-established many times over in the last decade, e.g., [45][46][47][48][49][50]. The southern edge of the central peak area is bounded by dark aeolian dunes. Further south the crater floor is intensely pitted, and this texture indicates that the Hale impact liberated volatiles [41,[51][52][53].
Site 6 shows terrain on the flank of Peneus Patera which hosts "scalloped depressions". These depressions are believed to be formed by loss of interstitial ice via sublimation and subsequent collapse of the overlying terrain -often compared to terrestrial thermo karst developed in permafrost terrains [54][55][56][57][58]. This particular image also shows dust devil tracks -dark tracks left on the ground by the passage of small wind vortexes which remove a thin layer of surface dust [59][60][61][62]. Dust devil tracks are constantly being formed and fading, their pattern rarely remaining similar between two orbital images.
Site 7 is the 7.3 km diameter Selevac Crater whose south-facing walls hosts numerous gullies, some of which have been active over the last decade [36]. The north-facing walls host talus features typical of fresh impact craters. This crater has a pitted floor similar to that shown by site 4, Gasa Crater. The terrain to the south of this crater hosts the subdued crater rim of a Noachian aged crater that seems to be almost totally infilled and perhaps breached by fluvial erosion [63,64].
Site 8 is located near the south pole of Mars in a terrain intensely patterned by "spiders". These enigmatic surface features are characterised by hierarchical branching networks of depressions, leading to one or more deeper foci. They are believed to be formed by repeated erosion of the surface caused by gas escaping from under the metrethick seasonal ice deposits [35,[65][66][67][68][69]. In this image, dark spots associated with defrosting can be seen, as described already for site 3. However, no perennial changes have been observed in spider systems so whether they are active today is a subject of debate.

Previous Work
SRR (or SR) refers to the task of enhancing the spatial resolution of an image from Lower-resolution (LR) to Higher-resolution (HR). In the past, SRR was based on the idea that a combination of the non-redundant information contained in multiple LR images can be used to generate a HR image. This is also referred to as multi-image SRR in the field of computer vision. This was built on the fundamental basis of using image coregistration, followed by multi-image sparse coding [70] or multi-image non-uniform interpolation [71]. The actual enhancement of resolution, as well as their robustness to noise, are generally limited with the simple forward techniques. Later around the 2010s, SRR techniques followed the Maximum a Posteriori (MAP) approach [72][73][74] to resolve the inverse process stochastically by assuming a model that each LR image is a downsampled, distorted, blurred, and noise added version of the true scene, i.e., the HR image. Building on the MAP techniques, we previously proposed two SRR systems in [4] and [75] for Mars orbital imagery and Earth observation satellite imagery, adopting the multi-angle imaging properties, and for the latter one, combining deep learning techniques.
The deep learning based SRR techniques have been fairly successful, during the past decade, in solving the problem of resolution enhancement and texture synthesis of real-life images and videos. The pioneering work of deep learning based SRR techniques, is the three-layer Convolutional Neural Network (CNN) based SRR algorithm (SRCNN) [76] that performs non-linear mapping between LR patches and HR patches, represented using convolutional filters. A simple Mean Squared Error (MSE) loss function is used to train the SRCNN network. Comparing to SRCNN, Very Deep Super-Resolution (VDSR) [77] use a deeper network with smaller convolutional filters to learn only the residual (high frequency information) between LR and HR images. VDSR is based on the popular VGG (named after the Visual Geometry Group at the University of Oxford) architecture that was originally proposed in [78] for large-scale image classification tasks. Instead of trying to learn high-frequency details at the up-sampled scales as used in SRCNN and VDSR, Fast SRCNN (FSRCNN) [79] and Efficient Sub-Pixel CNN (ESPCN) [80] learns the highfrequency details through a deconvolutional layer and sub-pixel convolutional layer at the end of their architecture, respectively, to significantly reduce unnecessary computation overheads.
Recently, residual-network based architectures were fairly successful in SRR tasks. The most representative ones are Enhanced Deep residual SR Network (EDSR) [81], Wide activation Deep residual SR (WDSR) [82] and CAscading Residual Network (CARN) [83]. EDSR is based on the original ResNet [84] and SRResNet [85] architectures using residual learning, and with a Rectified Linear Unit (ReLU) layer and Batch normalisation (BN) layers being removed. Based on EDSR, WDSR further demonstrated expanding features before ReLU activation leads to significant improvements, without adding additional parameters and computation, and used Weight Normalisation (WN) to replace BN for faster convergence and better accuracy. Both EDSR and WDSR have adopted the idea of not using up-sampled input for CNN and used the sub-pixel shuffling at the end of their architecture as proposed in the aforementioned ESPCN. On the other hand, CARN improved on top of the traditional residual network and proposed a cascading mechanism at both the local and global level in order to receive more information, and allow more efficient flow of information, while keeping the network lightweight.
Other successful SRR architectures employed recursive networks that use shared network parameters in convolutional layers in order to reduce memory usage. The most representative ones are Deep Recursive Convolutional Network (DRCN) [86] and Deep Recursive Residual Network (DRRN) [87]. DRCN reuses weight parameters and stack recursive blocks to improve SRR performance without introducing new parameters for convolutions. DRRN improves on top of DRCN by stacking residual blocks with shared parameters to achieve superior results over DRCN.
Unlike the aforementioned SRR networks that treat all spatial locations, features, scales, and channels of an image equally, some novel SRR networks use adaptively weighted importance to different locations, features, scales, and channels of an image. In Adaptive Weighted Super-Resolution Network (AWSRN) [88], the authors proposed a lightweight SRR network that uses a sequence of Adaptive Weighted Residual Units (AWRUs), to replace the original Residual Units used in WDSR, to form a Local Fusion Block (LFB), and then with a sequence of LFBs, to perform the non-linear mapping of extracted features. AWSRN also proposed an Adaptive Weighted Multi-Scale (AWMS) reconstruction module to selectively "stack and fuse" multi-scale convolutions in order to use the feature information, derived from the non-linear mapping module, more effectively. Another successful architecture that uses the idea of "selective attention" is the deep Residual Channel Attention Network (RCAN) [89]. RCAN emphasis is on the discriminative learning ability across different feature channels via selective downscaling and upscaling of feature maps, using Residual Channel Attention Blocks (RCABs) in Residual Groups (RGs), i.e., Residual in Residual (RIR), to focus on more informative components of the LR features. Long and short skip connections were used in RIR to help bypass low-frequency information and stabilise the training process of their very deep network.
More recently, Generative Adversarial Networks (GANs) have become more popular in the field of SRR that exploit perceptual differences rather than the pixel differences between LR and HR images. GANs operate by training a generative model with the goal of restoring high frequency textures, while in parallel, training a discriminator to distinguish SRR images from HR truth. SRGAN (Super-Resolution GAN), proposed in [85], first used a GAN based architecture to generate visually pleasant SRR images. SRGAN uses ResNet/SRResNet [84,85] as a backend and employs a weighted combination of content loss that is defined on feature maps of high level features (the Euclidean distance between the feature representations of generated image and reference image) from the VGG network [78], and the adversarial loss that was originally defined in [90], to achieve visually optimal results. The generator network in SRGAN has 16 identical residual blocks that consist of 2 convolutional layers, BN, and Parametric ReLU, followed by 2 subpixel convolutional layers, that were proposed in [80], for upscaling. The discriminator network in SRGAN contains 8 convolutional layers/BN/Leaky ReLU (LReLU), with increasing number of feature maps and down-sampling when the number of features is doubled, followed by 2 dense layers and a sigmoid activation. In parallel with SRGAN, an independent group of researchers proposed a similar network called EnhanceNet [91]. The generator network of EnhanceNet has 10 residual blocks followed by 2 nearest neighbour up-sampling (of feature activation) layers and followed by a convolutional layer to cancel checkerboard artefacts. In comparison to SRGAN, the major difference is that the EnhanceNet uses an additional texture matching loss, which is computed from the Euclidean distance of local (patch-wise) texture statistics, on top of the perceptual loss and adversarial loss, to enforce locally similar textures between SRR and HR truth. The SRR result from EnhanceNet is perceptually significantly sharper but suffers from more synthetic artefacts. Improved on top of SRGAN, the Enhanced SR GAN (ESRGAN) [92] use the basic architecture of SRGAN, replacing the original RBs with a deeper basic block, namely RIR Dense Block (RRDB), and also uses an improved loss function that incorporates a new perceptual loss, a relativistic adversarial loss, together with the traditional MSE loss.
In this work, we propose a novel MARSGAN network for single image SRR. MARS-GAN improves upon ESRGAN, which is used as our backbone architecture, in three aspects: (1) use an adaptive weighted basic block, called AW-RRDB (AWRRDB), with noise inputs, for more effective residual learning while allowing local stochastic variations; (2) use a multi-scale reconstruction scheme to make full use of both low-frequency and high-frequency residuals; (3) use a fine-tuned loss function to balance between perceptual quality and synthetic artefacts. MARSGAN is fully trained with HiRISE images and is used, in this work, for CaSSIS single image SRR.

MARSGAN Architecture
GANs provide a state-of-the-art framework for producing high-quality and "photorealistic" SRR images. Recent GAN variations [92][93][94] have been focusing on optimisations of the original residual architecture of the generator network [85,91] and/or on better modelling of the perceptual loss, in order to improve the visual quality of the SRR results. In this work, we based our model on the ESRGAN architecture [92] due to its solid performance on real-world images. Inspired by the adaptive weighted learning process proposed in AWSRN [88] and the optimisations introduced in ESRGANplus [93], we propose an Adaptive Weighted RRDB with noise inputs (AWRRDB) to replace the original RRDB basic block in ESRGAN for more effective and efficient residual learning. Moreover, we use a multi-scale reconstruction scheme [88] based on subpixel-shuffling [80,85] to replace the up-sampling layers used in ESRGAN to make full use of both low-frequency and high-frequency residuals while avoiding the checkerboard patterned artefacts from using up-sampling layers [91]. We follow the standard discriminator network architecture that was proposed in SRGAN [85] and adopt the relativistic average discriminator concept that was proposed in [95] and employed in [92].
Our proposed Multi-scale Adaptive-weighted Residual SRR GAN (MARSGAN) network architecture is shown in Figure 3. With MARSGAN, our goal is to estimate a super-resolved image I SR from a lower-resolution input image I LR . Here I LR is the lowerresolution version of its higher-resolution counterpart I HR . Note that I HR is only available during training.  The MARSGAN generator starts from a single convolutional layer (3x3 kernels, 64 feature maps, stride 1) for initial feature extraction, which can be formulated as where denotes the initial feature extraction function for and the output feature map from the first convolutional layer is .
In MARSGAN, our basic residual unit for non-linear feature mapping is AWRRDB. AWRRDB is based on the original Dense Block (DB) structure and applies two modifications on top of the RRDB basic residual units that were used in ESRGAN [92]. RRDB has a much deeper and more complex structure (see Figure 3), compared to the The MARSGAN generator starts from a single convolutional layer (3 × 3 kernels, 64 feature maps, stride 1) for initial feature extraction, which can be formulated as where f ext denotes the initial feature extraction function for I LR and the output feature map from the first convolutional layer is x 0 . In MARSGAN, our basic residual unit for non-linear feature mapping is AWRRDB. AWRRDB is based on the original Dense Block (DB) structure and applies two modifications on top of the RRDB basic residual units that were used in ESRGAN [92]. RRDB has a much deeper and more complex structure (see Figure 3), compared to the Residual Blocks (RBs) used in SRGAN [85], in order to have much higher network capacity benefiting from dense connections. The first modification to improve the RRDB basic blocks is through use of the concept of AWRU, inspired by AWSRN [88]. Instead of applying a fixed value of residual scaling [81] in each DB, i.e., 0.2 used in ESRGAN, we use 11 independent weights for each DB (see Figure 3), which can be adaptively learned after given an initial value, to help the flow of information and gradients more effectively. The second modification to the RRDB structure is adding a Gaussian noise input after each DB. The additive Gaussian noise inputs were demonstrated as being useful in [93] in terms of adding stochastic variation to the generator network, while keeping their effects very localised, i.e., without changing the global perception of the images. Note that 3 of the 11 weights are scaling factors for the additive Gaussian noise. There was another potential improvement to the DB structure, which was also proposed in [93], called Residual DB (RDB), by adding a residual every two layers to augment the generator network capacity. However, we found the improvement of using RDB is marginal compared to DB in AWRRDB, and therefore we keep the ESRGAN's design of DB in AWRRDB.
Defining the DB used in the original ESRGAN architecture as f DB , then the output of the n-th proposed AWRRDB units, denoted as x n+1 , for input x n , (n = 0, 1, 2, . . . , 22) can be expressed as where λ n a and λ n b are two independent weights for the n-th AWRRDB unit, and x n 3 can be solved via where λ n k r , λ n k x , and λ n k n , (k = 0, 1, 2), are three independent sets of weights for each DB unit and G n is the additive Gaussian noise inputs. The non-linear feature mapping is represented by a sequence (23 in this work) of the proposed AWRRDBs. As shown in Figure 3, each AWRRDB contains 3 DBs, and each DB contains 5 convolutional layers (3 × 3 kernels, 32 feature maps, stride 1) and 4 LReLU activation with a negative slope of 0.2. Merging Equation (2) and Equation (3), the output of the non-linear mapping, x n+1 , for the n-th AWRRDB unit, given the initial input, x 0 , from Equation (1), can be expressed as After the non-linear feature mapping, we use an Adaptive Weighted Multi-Scale Reconstruction (AWMSR) scheme [88] based on subpixel-shuffling [80,85] to replace the up-sampling layers used in ESRGAN for SRR image reconstruction. The AWMSR unit (see Figure 3), which was originally introduced in [88] and demonstrated helpful on top of the WDSR results [82], stacks 4 different levels of scaling convolutions (3 × 3, 5 × 5, 7 × 7, 9 × 9 kernels) with adaptive weights (initialised with an equal weight of 0.25) to make full use of the learned low-frequency and high-frequency information during SRR reconstruction. Here the output x n+1 is fed to the AWMSR unit (see Figure 3), denoted as f AW MSR , followed by a final convolutional layer, denoted as f rec , to generate I SR , which can be expressed as For the discriminator, we use the same network architecture as described in SR-GAN [85] and ESRGAN [92], which contains 8 convolutional layers with an increasing number of feature maps and strides of 2 each time the number of features is doubled (3 × 3 kernels; 64 feature maps, stride 1; 64 feature maps, stride 2; 128 feature maps, stride 1; 128 feature maps, stride 2; . . . ; 512 feature maps, stride 1, 512 feature maps, stride 2). The resulting 512 feature maps are followed by two fully connected dense layers together with a final sigmoid activation function for output. We adopt the relativistic concept that was originally proposed in RaGAN [95] and was applied in ESRGAN [92], to use a "relativistic discriminator", which estimates the probability of the given real data to be relatively more realistic than fake data in average, instead of simply predicting real or fake. The relativistic discriminator network is optimised in an alternating manner [90] along with the generator network to solve the adversarial min-max problem. Given the standard discriminator, denoted as D s , for real input image I r and fake input image I f , then where, σ is the sigmoid function, and C is the non-transformed discriminator output. Then the relativistic average discriminator, denoted as D Ra , for real input image I r and fake input image I f , can be formulated as where E I f represents the operation of computing the mean of all fake data in a mini-batch, and E I r represents the operation of computing the mean of all real data.

Loss Functions
Loss function plays an important role in deep learning based SRR techniques. Up until the work of SRGAN [85] and EnhanceNet [91], classic SRR networks mostly minimise the peak Signal to Noise Ratio (SNR; PSNR) between the recovered SRR image and the ground truth (HR). Due to the ill-posed nature of the SRR problem, texture details are typically synthetic textures (if not absent) in the reconstructed SRR image and therefore cannot be "pixel-to-pixel" matched with the ground truth, leading to a smoother solution that averages all potential synthetic solutions within a PSNR oriented model. In SRGAN [85], the authors proposed to replace the MSE based content loss (in pixel space) with a loss (in feature space) defined by feature maps, denoted as ϕ(i, j), where j and i indicate the j-th convolution (after activation) before the i-th maxpooling layer within the pretrained VGG19 network. SRGAN used ϕ(2, 2) and ϕ (5,4) in their experiments. Instead of completely removing the content loss term, in EnhanceNet [91], the authors explicitly experimented with different combinations (weighted averages) of the content loss (τ E ), perceptual loss (τ P ), adversarial loss (τ A ), and an additional texture loss (τ T ; defined by matching patch-wise statistics of textures). Their study shows the network optimised by τ E has the smoothest and most artefact-free results, whereas τ P or τ P + τ A are much sharper but full of artefacts, whereas τ E + τ P + τ A and τ E + τ P + τ A + τ T produces some "balanced" results that are comparably sharper (more detailed) but with less artefact.
ESRGAN demonstrated greater optimality, on top of SRGAN, to use the VGG features before the activation layers in order to have denser feature representations (before activation), while keeping consistency of reconstructed SRR brightness [92]. Besides, ESRGAN kept the l 1 norm-based content loss term (weighted by a factor of 0.01) to balance the perceptual-driven solutions. Due to the very small weight of the content loss, ESRGAN proposed the use of a "network interpolation" method, which is a weighted average of the two networks trained with perceptual loss and l 1 loss, to balance the perceptual-driven and PSNR-driven solutions.
The l 1 and MSE based content loss, denoted as l l 1 SR and l MSE SR , respectively, can be formulated as where G represents the generator function, W and H denote the width and height of the I LR , and s denotes the scaling factor for I SR (and I HR ) with respect to I LR .
The VGG based perceptual loss, denoted as l VGG/ϕ(i,j) SR , can be expressed as where W i,j and H i,j represent the dimensions of the respective feature maps ϕ(i, j) within the VGG network. The authors of ESRGAN also experimented with the VGG loss based on a fine-tuned VGG network for material recognition and concluded the gain is marginal. We also tested with both VGG networks for HiRISE images and visually checked the resultsthere is not any viewable difference. Future experiments on perceptual loss that focuses on texture still have the potential to improve the SRR results, but in this work, we stick with the original pre-trained VGG19 network [78] for feature representation. Based on Equation (7), the discriminator loss of RaGAN, denoted as l Ra D , can be expressed as The adversarial loss for the generator, denoted as l Ra SR , can be expressed as a symmetrical form of Equation (10), as Given the nature of this work is to derive scientifically meaningful results, we therefore adjust the total loss function to encourage solutions towards an ideal scenario, that is, better than using MSE loss alone, but with minimal tolerance to artefacts. On the other hand, we empirically found that using the same loss function as used in ESRGAN tends to produce fine-scale synthetic textures that contain similar noise patterns introduced from the original HiRISE images (this is further discussed in Section 4.1). Although optimisation of perceptual based loss functions is better suited for photo-SRR applications, it does not appear to be suitable for remote sensing applications, with the current state of the art of deep learning based SRR. We rebalance the lower-level and higher-level perceptual loss derived from the VGG network to act together as the perceptual loss, and also to give a higher weight to the traditional MSE based content loss, in order to minimise the creation of hallucinate finer details.
The total generator loss, L MARSGAN G , used in this work can be expressed as a weighted sum of the content loss formulated in Equation (8), lower-level and higher-level perceptual losses formulated in Equation (9), and the adversarial loss formulated in Equation (11), as follows where γ, λ, and η, are the hyperparameters to balance different loss terms, where, in comparison to the total loss used in ESRGAN, L ESRGAN G , can be expressed as In order to show the effectiveness of the MARSGAN architecture, we choose to firstly optimise the ESRGAN's loss function, as shown in Equation (13), for the MARSGAN model for Jezero crater, as demonstrated in Section 3.1. In Section 3.2, we use our fine-tuned loss function, shown in Equation (12), for the MARSGAN model, for the 8 science sites.

Assessment Methods
For validation and quality assessment, we follow the standard image quality metrics, which include PSNR, Mean Structural Similarity Index Metric (MSSIM), Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE), and Perception-based Image Quality Evaluator (PIQE), using HiRISE and downsampled HiRISE (at 1 m/pixel) as reference/validation dataset. CaSSIS images and SRR results are co-registered with HiRISE using our in-house multi-resolution image co-registration pipeline [96] in order to calculate PSNR and MSSIM, and also to compare against the BRISQUE and PIQE scores. These metrics are available via Matlab's "Image Quality Metrics" bundle (https://uk.mathworks.com/help/images/ image-quality.html (accessed on 1 May 2021)) and can be summarised as follows: (1) PSNR: PSNR is derived from the MSE and indicates the ratio of the maximum pixel intensity to the power of the distortion. A mathematical expression of PSNR can formulated as where T denotes the target image (the CaSSIS SRR image in our case), R denotes the reference image (the down-sampled HiRISE image in our case), and PeakVal is the maximum value of the reference image (normalised to 255 for 8-bit image in our case). (2) MSSIM [97]. MSSIM is the mean of locally computed structural similarity. The structural similarity index is derived using patterns of pixel intensities among neighbouring pixels with normalised brightness and contrast. MSSIM can be formulated as where E represents the operation of mean, µ T , µ R , σ T , σ R , and σ T,R are the local means, standard deviations, and cross-covariance of the target image and reference image respectively. C 1 and C 2 are constants based on the dynamic range of pixel values. (3) BRISQUE [98]. The BRISQUE model provides subjective quality scores based on a pre-trained model using images with known distortions. The score range is [0,100] and lower values reflect better perceptual quality. (4) PIQE [99]. PIQE measures the quality of images using block-wise calculation against arbitrary distortions. The score range is [0,100] and lower values reflect better perceptual quality.
In practice, due to large differences of imaging time (year and local Mars time) between HiRISE images and CaSSIS images (as explained in Section 2.4), these measurements, sometimes, may not be appropriate. We try to assess the CaSSIS SRR result with the "closest" HiRISE, in terms of the imaging date and local Mars time, in order to maintain the most "compatible" brightness, contrast and shading characteristics between HiRISE and CaSSIS, even though the choices are extremely limited.
Complementary to the quality measurements, we also demonstrate with SRR results of the Jezero crater site, using sharpness measurement of contrasted-and-slanted edges (see Section 3.1). This is a direct way of measuring image spatial resolution and is not dependent on a reference HiRISE image, which is subject to changes in appearance, due to different imaging time (the closest CaSSIS and HiRISE "pairs" used in this work is still

Training and Testing
Our training dataset included~1.8 million pairs of HiRISE (i.e., HR; at 0.25 m/pixel) and down-sampled HiRISE (i.e., LR; at 1 m/pixel) cropped samples. The training HR samples were extracted from 466 unique HiRISE images, containing non-overlapping unique features of Mars (see Figure 4 ~1 month apart). Such measurements are critical for remote sensing applications in order to quantify the effective resultant SRR resolution.

Training and Testing
Our training dataset included ~1.8 million pairs of HiRISE (i.e., HR; at 0.25m/pixel) and down-sampled HiRISE (i.e., LR; at 1m/pixel) cropped samples. The training HR samples were extracted from 466 unique HiRISE images, containing non-overlapping unique features of Mars (see Figure 4), including dunes (Figure 4a), craters (Figure 4b), hills (Figure 4c), layering (Figure 4d), slopes (Figure 4e), cones (Figure 4f), scallops (Figure 4g), gullies (Figure 4h), falls (Figure 4i), deposits (Figure 4j), rocks (Figure 4k), chaos (Figure 4l), and other unique features (a list of these features with associated HiRISE image IDs is provided in supplementary material). Note that all experiments (training and testing), in this work, are performed with a scaling factor of 4x between LR and HR (or SR) images. The training LR samples are produced by applying a Gaussian filter and followed by a bicubic down-sampling process of the corresponding HR samples.  In Experiment-1, we train the original ESRGAN network with the original loss function, as shown in Equation (13). In Experiment-2, we train the proposed MARSGAN network optimised with the same loss function that was used in ESRGAN. Finally, in Experiment-3, we train the proposed MARSGAN network optimised by our rebalanced loss function, as shown in Equation (12). Comparisons of the results from the three trained models are given in Section 3.1. For further processing results, which are demonstrated in Section 3.2, on the proposed CaSSIS science scenes, we use the MARSGAN model trained with Experiment-3.
For Experiment-1, the batch size is set to 64 and the spatial sizes of the HR and LR patches are set to 256x256 pixels and 64x64 pixels. This is 4 times larger for each HR/LR patch compared to the spatial sizes used in [92] and~7 times larger comparing to [85]. It was observed in [92] that training a deep SRR network benefits from a larger patch size due to an enlarged receptive field, with trade-offs to more computing resources and a longer training time. We follow the two-stage training process, proposed in [92] for ESRGAN, to train a PSNR-oriented model initially with the l 1 loss in Equation (8), followed by the perceptual-oriented training with the perceptual loss in Equation (13), with λ = 5 × 10 −3 and η = 1 × 10 −2 . For Experiment-2, the batch size is 64 and the spatial sizes of the HR and LR patches are set to 128 × 128 pixels and 32 × 32 pixels for a shorter training time. The same two-stage training process (with the same hyperparameters) is followed as in Experiment-1.
In Experiment-3, the batch size and spatial patch sizes are the same as Experiment-2. We re-use the pre-trained MARSGAN model from Experiment-2 for initialisation for the generator and continue training with the MARSGAN loss in Equation (12), with λ = 5 × 10 −3 , γ = 0.5, and a higher η = 0.5 in order to encourage solutions with minimised synthetic artefacts. The initial learning rate is 10 −4 , and halved at 50k, 100k, 300k, and 500k iterations. Standard Adam optimisation [100] is used with β 1 = 0.9 and β 2 = 0.999. Training and testing are achieved on the latest NVIDIA RTX 3090 GPU.
Our testing dataset is a collection of CaSSIS colour images for the Perseverance rover's landing site, and as well as several selected science-oriented scenes introduced in Section 1.1. Note that our HiRISE training dataset is in greyscale. To handle the colour channels of CaSSIS image, we can either feed the CaSSIS colour images directly into the MARSGAN prediction module, which will work on the brightness channel (V) in the Hue-Saturation-Value (H-S-V/HSV) colour space, or we can produce SRR on each individual colour channel in the Red-Green-Blue (R-G-B/RGB) colour space and merge them later, for colour output. In our experiments, we found the two approaches result in similar SRR quality, however, the latter approach leads to a slightly different colour appearance compared to the input image (see Figure 5 for demonstration of the differences). Theoretically, the texture/sharpness manipulation in the separate R-G-B channels should not affect the brightness/reflectance for each channel alone, but it has an effect if we merge back the three channels in R-G-B colour space. On the other hand, texture/sharpness changes in the V channel would not affect the brightness/reflectance and also would not affect the colour appearance which is controlled by the H and S channel. Therefore, when merging the three channels in H-S-V colour space, only texture and sharpness change, brightness/reflectance and colour will remain. We follow this approach for all CaSSIS SRR results presented in this paper. Remote Sens. 2021, 13, x FOR PEER REVIEW 15 of 45 In addition, CaSSIS uses multiple combinations of long-to-short wavelengths to synthesise colour in R-G-B colour space [101], e.g., NIR-PAN-BLU, and empirically speaking, CaSSIS images generally have a better SNR on their longer wavelength channels, i.e., NIR band (centred at 936.7nm), RED band (centred at 836.2nm) and PAN band (centred at 675nm) and have a lower SNR on their shorter wavelength channels, i.e., BLU band (centred at 499.9nm). Figure 6 shows that the BLU band is obviously noisier compared to the NIR and PAN band. Therefore, running SRR on the R-G-B colour channels separately may provide an opportunity to produce a better SRR result via different treatment on the three channels, e.g., applying denoising on the BLU channel. However, the issue of the resulting different colour appearance needs to be tackled in a future study. Finally, for validation purposes, we only pick the CaSSIS testing images that has one or more corresponding HiRISE observations. If multiple HiRISE images are available for comparison, the one that was captured with the closest date and/or Solar Longitude (Ls) to the CaSSIS scene is used. Note that none of these validation HiRISE images are used for training (see supplementary material for a list of the training HiRISE image IDs). The testing and validation datasets for the selected science targets is presented in Table 1. The proposed quantitative assessment is only applied to the results of Jezero Crater (see Section 3.1). For the science-oriented scenes, only visual qualitative comparisons are given In addition, CaSSIS uses multiple combinations of long-to-short wavelengths to synthesise colour in R-G-B colour space [101], e.g., NIR-PAN-BLU, and empirically speaking, CaSSIS images generally have a better SNR on their longer wavelength channels, i.e., NIR band (centred at 936.7 nm), RED band (centred at 836.2 nm) and PAN band (centred at 675 nm) and have a lower SNR on their shorter wavelength channels, i.e., BLU band (centred at 499.9 nm). Figure 6 shows that the BLU band is obviously noisier compared to the NIR and PAN band. Therefore, running SRR on the R-G-B colour channels separately may provide an opportunity to produce a better SRR result via different treatment on the three channels, e.g., applying denoising on the BLU channel. However, the issue of the resulting different colour appearance needs to be tackled in a future study. In addition, CaSSIS uses multiple combinations of long-to-short wavelengths to synthesise colour in R-G-B colour space [101], e.g., NIR-PAN-BLU, and empirically speaking, CaSSIS images generally have a better SNR on their longer wavelength channels, i.e., NIR band (centred at 936.7nm), RED band (centred at 836.2nm) and PAN band (centred at 675nm) and have a lower SNR on their shorter wavelength channels, i.e., BLU band (centred at 499.9nm). Figure 6 shows that the BLU band is obviously noisier compared to the NIR and PAN band. Therefore, running SRR on the R-G-B colour channels separately may provide an opportunity to produce a better SRR result via different treatment on the three channels, e.g., applying denoising on the BLU channel. However, the issue of the resulting different colour appearance needs to be tackled in a future study. Finally, for validation purposes, we only pick the CaSSIS testing images that has one or more corresponding HiRISE observations. If multiple HiRISE images are available for comparison, the one that was captured with the closest date and/or Solar Longitude (Ls) to the CaSSIS scene is used. Note that none of these validation HiRISE images are used for training (see supplementary material for a list of the training HiRISE image IDs). The testing and validation datasets for the selected science targets is presented in Table 1. The proposed quantitative assessment is only applied to the results of Jezero Crater (see Section 3.1). For the science-oriented scenes, only visual qualitative comparisons are given Finally, for validation purposes, we only pick the CaSSIS testing images that has one or more corresponding HiRISE observations. If multiple HiRISE images are available for comparison, the one that was captured with the closest date and/or Solar Longitude (L s ) to the CaSSIS scene is used. Note that none of these validation HiRISE images are used for training (see Supplementary Materials for a list of the training HiRISE image IDs). The testing and validation datasets for the selected science targets is presented in Table 1. The proposed quantitative assessment is only applied to the results of Jezero Crater (see Section 3.1). For the science-oriented scenes, only visual qualitative comparisons are given (see Section 3.2). A collection of examples of the CaSSIS testing images for the proposed science targets can be found in Figure 2.

Results and Assessment for Jezero Crater
In this section, we first demonstrate our CaSSIS SRR results over the Mars2020 Perseverance rover's landing site, Jezero Crater. The input LR image is the 4 m/pixel CaSSIS NPB colour image (native resolution is about 4.5 m/pixel), that was captured on 23 February 2021 during the morning (local Mars time). The HR reference image (for validation) is the 25 cm/pixel HiRISE RED band image (native resolution is 29.3 cm/pixel; see https://www.uahirise.org/ESP_068294_1985 (accessed on 1 May 2021)) that was captured on 19 February 2021 at 14.55 in the afternoon (local Mars time). As the two images were captured pretty close to the same date, no obvious difference of the Martian surface at 1 m/pixel scale is expected. For example, both CaSSIS and HiRISE images have captured components from the rover, which landed on 18 February 2021. However, due to different solar illumination directions (morning and afternoon lighting), some surface features may look different due to surface bi-directional reflectance effects.
In this work, we down-sample the 25 cm HiRISE image to 1 m using the GDAL's "cubicspline" down-sampling method (https://gdal.org/programs/gdal_translate.html (accessed on 1 May 2021)) to simulate 1 m view of the surface, in order to compare with the SRR results (with an effective resolution enhancement factor of~3) at the scale of 1 m. For the 8 cropped regions shown in Figure 7, a visual comparison of the SRR results, from the three experiments (refer to Section 2.4), against downsampled HiRISE image (1 m/pixel) and as well as the original resolution HiRISE image (at 0.25 m/pixel), can be found in Figure 8. The first experiment (second column of Figure 8) refers to the CaSSIS SRR processing with the ESRGAN network that was trained with HiRISE images. The second experiment (third column of Figure 8) refers to the CaSSIS SRR processing with the proposed MARSGAN network that was trained with the same HiRISE training dataset (but optimised with the ESRGAN' loss function and with a smaller patch size for faster convergence; hereafter referred to as MARSGAN-m1). The third experiment (fourth column of Figure 8) refers to the CaSSIS SRR processing with our proposed MARSGAN network with our rebalanced loss function (hereafter referred to as MARSGAN-m2). For more details of the three experiments, please refer to Section 2.4.  Figure 8) and quantitative assessment (see Table 2). Figure 8 shows zoom-in views of the 8 selected areas for detailed comparison. The selected areas (crop-A to H) contain different types of features around the landing area, and also include a view of the rover's jettisoned parachute and back-shell, shown in crop-A. The rover itself is not visible from the CaSSIS image and thus SRR results but is visible from the HiRISE images in between two "blast patterned" bright features. From Figure 8, we can observe that, generally speaking, both the ESRGAN and MARSGAN results are able to show 2-4 times of resolution enhancement in comparison to the input CaSSIS image and referencing HiRISE image. Our proposed MARSGAN models (MARSGAN-m1 and MARSGAN-m2) outperforms the original ESRGAN model in terms of edge sharpness and realistic texture details. Although the larger-scale structural features (e.g., crater ridges, big rocks, dune patterns) are pretty seamless on the CaSSIS SRR image and the 1m HiRISE image (except for their different illumination directions), some of the very fine scale features (e.g., rocks, ground textures) still show quite a lot of differences between the SRR results and the 1m HiRISE image. This is due to the ill-posed nature of SRR, as if the information is completely missing from the LR image, then it cannot be recovered. Though textures synthesis is involved, we limit this process at the training stage, to discourage the perceptually pleasing solutions that have artefacts or fault textures (refer to Section 2.2 and Section 2.4).  Figure 8) and quantitative assessment (see Table 2). Figure 8 shows zoom-in views of the 8 selected areas for detailed comparison. The selected areas (crop-A to H) contain different types of features around the landing area, and also include a view of the rover's jettisoned parachute and back-shell, shown in crop-A. The rover itself is not visible from the CaSSIS image and thus SRR results but is visible from the HiRISE images in between two "blast patterned" bright features. From Figure 8, we can observe that, generally speaking, both the ESRGAN and MARSGAN results are able to show 2-4 times of resolution enhancement in comparison to the input CaSSIS image and referencing HiRISE image. Our proposed MARSGAN models (MARSGAN-m1 and MARSGAN-m2) outperforms the original ESRGAN model in terms of edge sharpness and realistic texture details. Although the larger-scale structural features (e.g., crater ridges, big rocks, dune patterns) are pretty seamless on the CaSSIS SRR image and the 1 m HiRISE image (except for their different illumination directions), some of the very fine scale features (e.g., rocks, ground textures) still show quite a lot of differences between the SRR results and the 1 m HiRISE image. This is due to the ill-posed nature of SRR, as if the information is completely missing from the LR image, then it cannot be recovered. Though textures synthesis is involved, we limit this process at the training stage, to discourage the perceptually pleasing solutions that have artefacts or fault textures (refer to Sections 2.2 and 2.4).   Table 2 shows the statistics of the standard image quality metrics (refer to Section 2.3), for the input CaSSIS image, ESRGAN SRR, MARSGAN-m1 SRR, MARSGAN-m2 SRR, and validation HiRISE image (at 1 m/pixel resolution), for 8 cropped areas shown in Figures 7 and 8. In order to calculate the PSNR and MSSIM using the 1 m/pixel downsampled HiRISE images as references, the input 4 m/pixel CaSSIS images are upscaled, using GDAL's bicubic resizing function (see https://gdal.org/programs/gdal_translate.html (accessed on 1 May 2021)), by a factor of 4, to 1 m/pixel. As mentioned in Section 2.4, all SRR results in this work already have an upscaling factor of 4 and they are all in the same scale with the reference 1 m/pixel down-sampled HiRISE images. All the HiRISE images are co-registered to the CaSSIS images.
In general, as shown in Table 2 On the other hand, the BRISQUE and PIQE measurements directly reflect the image quality in terms of sharpness, contrast, perceptual quality, and SNR. BRISQUE and PIQE scores between 0 to 100 and lower values mean better image quality (see Section 2.3). From Table 2, we can observe the much better image quality scores of MARSGAN-m2 SRR results compared to the input CaSSIS images and as well as the ESRGAN SRR results. For some of the areas, MARSGAN-m2 has achieved even better BRISQUE (crop-C, D, E, G) and PIQE scores (crop-C and E) than the 1 m HiRISE images. The BRISQUE and PIQE scores do not necessarily correlate with the amount of information in the image, for example, downsampled HiRISE images always contain more finer-scale information which are not recorded on the original CaSSIS and hence not resolvable on CaSSIS SRR images, however, BRISQUE and PIQE scores reflect the images' quality based on their existing information.
Nonetheless, these image quality metrics do not directly reflect the achieved image resolution of the SRR results. In order to estimate the resolution of the SRR results, we perform edge sharpness measurements on high-contrast slanted-edges using the Imatest ® software (https://www.imatest.com/ (accessed on 1 May 2021)). The edge sharpness measurement measures the total amount of pixels from 10% to 90% rise of a high-contrast edge profile (see Figure 9). If the total number of pixels are compared to the total number of pixels of the same edge profile of its LR counterpart, then their ratio can be used to estimate an enhancement factor between the two images. We perform this test for a rippled dune area at the northwest side of the largest crater on the same CaSSIS scene (MY36_014520_019_0), where many high contrast edges are presented to perform this measurement. The MARSGAN-m2 results is compared against the original CaSSIS image at 1 m/pixel scale (only PAN band is used for this measurement). Remote Sens. 2021, 13, x FOR PEER REVIEW 21 of 45  In this assessment, we perform the slanted-edge measurement (https://www.imatest. com/docs/#sharpness (accessed on 1 May 2021)) using the Imatest ® software for 20 highcontrast edges within 20 Regions of Interest (ROIs). Figure 9 shows zoom-in views of the 20 ROIs from the CaSSIS image and corresponding SRR image, 40 plots of the corresponding edge profiles (the orthogonal lines crossing the automatically detected edges), and the total number of pixels for a 10% to 90% rise along the profile line. The statistics in Figure 9 are summarised in Table 3. An enhancement factor between the MARSGAN SRR image and input CaSSIS image, for each "slanted-edge", is calculated by dividing the total pixels involved for the 10% to 90% profile rise of the original CaSSIS image, with the total pixels involved for the 10% to 90% profile rise of the MARSGAN SRR image. An average of 20 "slanted-edge" measurements, indicates a factor of 2.9625 ± 0.7x (~3x) resolution enhancement for the MARSGAN SRR result compared to the CaSSIS image. This agrees with our visual observation that is illustrated in Figure 8. Table 3. Summary of the statistics from Figure 9, and estimation of enhancement factor, from the total pixel counts of 10% to 90% profile rise crossing the 20 automatically detected slanted-edges, for the input CaSSIS image (MY36_014520_019_0) and MARSGAN SRR image.

Results and Visual Demonstration of Science Targets/Sites
Further to the initial assessment and validation work of the Perseverance rover's landing site, we demonstrate CaSSIS SRR results using the proposed MARSGAN model (i.e., MARSGAN-m2), for 8 more CaSSIS scenes, containing different science targets introduced in Section 1.  Figure 10 shows exposed bedrock) and transverse aeolian ridges on the crater floor. We can observe clearer shapes and outlines of features from the CaSSIS SRR result and the 1 m HiRISE image. Despite some finer scale textures shown in HiRISE, the larger scale structural features shown in CaSSIS SRR are similar to those in the HiRISE reference image. Figure 11 shows examples of 4 cropped regions of the MARSGAN SRR result for Site-2, in comparison with the input 4m CaSSIS image MY35_007017_173_0 and down-sampled 1 m HiRISE image ESP_012383_1905. For this site, the CaSSIS image was also taken in the morning and is illuminated from the other side compared to the HiRISE image. Crop A-C in Figure 11 shows bright and dark slope streak features. The CaSSIS SRR result reveals clearer boundaries of the slope streak feature and has higher SNR compared to the original input. Crop D in Figure 11 shows transverse aeolian ridges inside a small crater. The CaSSIS SRR result has enhanced sharpness and structural clarity for the aeolian features and agrees broadly with the HiRISE image. Figure 12 shows examples of 4 cropped regions of the MARSGAN SRR result for Site-3, in comparison with the input 4m CaSSIS image MY35_010749_247_0 and down-sampled 1 m HiRISE image ESP_059289_1210. This site highlights dunes and associated defrosting features on the Martian surface. The CaSSIS image was taken in the afternoon, but due to large L s (seasonal) differences, frost is no longer present on the HiRISE image, so the albedo patterns are not apparent anymore. The dunes were covered by frost for the CaSSIS/SRR image. Crop D shows gully-channels on dune slip faces with new deposits visible in the CaSSIS SRR image. The CaSSIS SRR image has visually shown improved resolution of the dark defrosting spots and higher SNR comparing to the input. Figure 13 shows examples of 4 cropped regions of the MARSGAN SRR result for Site-4, in comparison with the input 4m CaSSIS image MY35_012112_221_0 and downsampled 1 m HiRISE image ESP_065469_1440. The CaSSIS and HiRISE images were taken at very similar local Mars time and just under a month apart, resulting in very similar illumination/contrast between each other. Crop A-C in Figure 13 shows gully channels between bedrock outcrops at the rim of Gasa Crater, and crop D in Figure 13 shows small bedrock outcrops on the floor of Gasa Crater. The CaSSIS SRR result has brought out the details of the gullies and bedrock outcrop and has good agreement with the reference HiRISE image. Note there are local mis-registration/distortions between the CaSSIS/SRR and HiRISE images, which are due to very limited overlapping area between the original CaSSIS and HiRISE images.     Figure 15, which are the dust devil tracks, are different in the CaSSIS/SRR and HiRISE because they were imaged 10 years apart (and these features change on a sub-annual timescale). We can observe better structural information of scalloped features, from the CaSSIS SRR result, in crop B of Figure 15, and more fine scale details in crops A and C. The overall noise level for this site is higher than the other sites, and especially for crop D, the improvement of SNR in the SRR result is limited. This is probably due to the lack of any patterned textures or structures from the original input image. Figure 16 shows  Figure 16 shows larger-scale and finer-scale gullies and rock outcrops at the crater's rim. Shaper edges, clearer structural detail, and better SNR can be observed from the CaSSIS SRR result in comparison to the input CaSSIS image. Finer scale textures are missing from the SRR result in comparison to HiRISE, but as previously mentioned, we do not seek to introduce details that were not initially present in the LR input in this work.
Finally, Figure 17 shows  Finally, Figure 17 shows examples of 4 cropped regions of the MARSGAN SRR result for Site-8 in comparison with the input 4m CaSSIS image MY35_011777_268_0. The coregistration of HiRISE (PSP_002081_1055) and CaSSIS/SRR images were not possible for this site due to large time and seasonal differences of high latitude features. Therefore, no reference samples are shown for this site. Site-8 highlights the spider features terrain with frost. Better SNR and clearer structures of such features are observable from the CaSSIS SRR result in comparison to the input CaSSIS image.

Perceptual-driven solution or PSNR-driven solution
Perceptual-driven models generally produce SRR results with sharper edges and richer textures, which lead to visually more pleasing results, in comparison to the PSNRdriven models. However, due to the ill-posed nature of SRR, lost information or missing textures cannot be fully and correctly recreated based on the LR image. Therefore, the sharper and richer the details are, the more stochastic solutions are involved. On the other hand, PSNR-driven SRR solutions are generally smoother and have less texture details, but they have a much lower chance of creating artefacts and synthetic textures. This was demonstrated in [85] and [91] that PSNR-driven solutions encourage the models to find pixel-wise averages of all potential solutions that have high and sharp texture details. The averaged solutions are, therefore, smoother but less "synthetic".
Although SRR networks that are optimised for the best perceptual quality are currently popular for SRR research in general computer vision tasks, they are not fit for purpose for remote sensing or scientific applications. The issue is illustrated in Figure 18 using two small example samples of HiRISE images (ESP_029674_1650 & PSP_007455_1785). The first column is the input LR images, the second column is the SRR images produced with loss optimised ESRGAN model (representing "PSNR-driven" solutions), the third column is the SRR images produced with ESRGAN model that was optimised using a balanced, e.g., = 1 in Equation (13), perceptual and loss (representing "Balanced" solutions) and the fourth column is the SRR images using perceptual loss only trained ESRGAN model (representing "Perceptual-driven"

Perceptual-Driven Solution or PSNR-Driven Solution
Perceptual-driven models generally produce SRR results with sharper edges and richer textures, which lead to visually more pleasing results, in comparison to the PSNRdriven models. However, due to the ill-posed nature of SRR, lost information or missing textures cannot be fully and correctly recreated based on the LR image. Therefore, the sharper and richer the details are, the more stochastic solutions are involved. On the other hand, PSNR-driven SRR solutions are generally smoother and have less texture details, but they have a much lower chance of creating artefacts and synthetic textures. This was demonstrated in [85] and [91] that PSNR-driven solutions encourage the models to find pixel-wise averages of all potential solutions that have high and sharp texture details. The averaged solutions are, therefore, smoother but less "synthetic".
Although SRR networks that are optimised for the best perceptual quality are currently popular for SRR research in general computer vision tasks, they are not fit for purpose for remote sensing or scientific applications. The issue is illustrated in Figure 18 using two small example samples of HiRISE images (ESP_029674_1650 & PSP_007455_1785). The first column is the input LR images, the second column is the SRR images produced with l 1 loss optimised ESRGAN model (representing "PSNR-driven" solutions), the third column is the SRR images produced with ESRGAN model that was optimised using a balanced, e.g., η = 1 in Equation (13), perceptual and l 1 loss (representing "Balanced" solutions) and the fourth column is the SRR images using perceptual loss only trained ESRGAN model (representing "Perceptual-driven" solutions). We can observe from Figure 18 that the PSNR-driven solution doesn't produce any artefact but meanwhile doesn't produce sharp SRR result. On the other hand, the perceptual-driven solution produces the sharpest result and richest texture. However, in the case of ESP_029674_1650, synthetic textures have been brought into the image, and in the other case of PSP_007455_1785, shapes of the small rocks have been altered, synthetically, compared to the original LR image. As shown from the third column of Figure 18, ESRGAN model with a balanced perceptualand PSNR-oriented optimisation, produces good quality result with no visible artefacts.
Our MARSGAN SRR solution feeds the model with stochastic variations having the perceptual loss as a weighted term during training, but also keeps a highly weighted term of the MSE loss to minimise texture synthesis and the production of artefacts (refer to Section 2.4). A balanced SRR solution, with the best possible resolution enhancement and minimised artefact creation, is the overall objective of this work. As demonstrated in Section 3.2 with different science targets, there were no obvious synthetic artefact found with our proposed MARSGAN SRR results. solutions). We can observe from Figure 18 that the PSNR-driven solution doesn't produce any artefact but meanwhile doesn't produce sharp SRR result. On the other hand, the perceptual-driven solution produces the sharpest result and richest texture. However, in the case of ESP_029674_1650, synthetic textures have been brought into the image, and in the other case of PSP_007455_1785, shapes of the small rocks have been altered, synthetically, compared to the original LR image. As shown from the third column of Figure 18, ESRGAN model with a balanced perceptual-and PSNR-oriented optimisation, produces good quality result with no visible artefacts.
Our MARSGAN SRR solution feeds the model with stochastic variations having the perceptual loss as a weighted term during training, but also keeps a highly weighted term of the MSE loss to minimise texture synthesis and the production of artefacts (refer to Section 2.4). A balanced SRR solution, with the best possible resolution enhancement and minimised artefact creation, is the overall objective of this work. As demonstrated in Section 3.2 with different science targets, there were no obvious synthetic artefact found with our proposed MARSGAN SRR results. Figure 18. Illustration with HiRISE SRR images using ESRGAN models that were optimized with loss function ("PSNR Oriented"), VGG loss function ("Perceptual Oriented"), and our balanced loss function ("Balanced"), showing the impact of having perceptualdriven training/prediction and having PSNR-driven training/prediction.

Single image SRR or Multi-image SRR
SRR have been divided into single-image and multi-image techniques (including video SRR). Theoretically, multi-image SRR techniques have more information (resources) to use, for example, the classic multi-frame subpixel information [70], the multi-angle-view information [4], and information from spatial-temporal correlations [102] [103]. Therefore, multi-image SRR techniques could theoretically produce more details.
This is also demonstrated in Figure 19, in which a MARSGAN single-image SRR result using a single HiRISE image (PSP_010097_1655_RED) as input, is shown compared with the GPT [4] multi-image SRR result using a sequence of 8 overlapping HiRISE images (with different viewing angles) as input, and by comparison, to the 25cm/pixel original HiRISE image (PSP_010097_1655_RED) over the Homeplate area visited by MER-A, Spirit. We can observe that the multi-image SRR result brought out more detail, e.g., Figure 18. Illustration with HiRISE SRR images using ESRGAN models that were optimized with l 1 loss function ("PSNR Oriented"), VGG loss function ("Perceptual Oriented"), and our balanced loss function ("Balanced"), showing the impact of having perceptual-driven training/prediction and having PSNR-driven training/prediction.

Single Image SRR or Multi-Image SRR
SRR have been divided into single-image and multi-image techniques (including video SRR). Theoretically, multi-image SRR techniques have more information (resources) to use, for example, the classic multi-frame subpixel information [70], the multi-angle-view information [4], and information from spatial-temporal correlations [102,103]. Therefore, multi-image SRR techniques could theoretically produce more details. This is also demonstrated in Figure 19, in which a MARSGAN single-image SRR result using a single HiRISE image (PSP_010097_1655_RED) as input, is shown compared with the GPT [4] multi-image SRR result using a sequence of 8 overlapping HiRISE images (with different viewing angles) as input, and by comparison, to the 25 cm/pixel original HiRISE image (PSP_010097_1655_RED) over the Homeplate area visited by MER-A, Spirit. We can observe that the multi-image SRR result brought out more detail, e.g., surface deposits and small rocks, while the single-image SRR result seems to have a sharper reconstruction of the rover tracks.
On the other hand, the GPT multi-image SRR result (for 8 repeat input images with 1000 × 1000 pixels) took a whole day to process on a high-spec CPU machine (Intel Core i7 @2.8GHz), while the MARSGAN SRR prediction for the same-sized single image input only took a few minutes on the same CPU and takes less than a second on the NVIDIA ® RTX3090 GPU. The trade-off becomes obvious, when we want to process a large image, like a full-strip CaSSIS or HiRISE. Note that the GPT SRR [4] is based on multi-angle view information and not based on deep learning, and also its key component is not suitable for GPU implementation.
surface deposits and small rocks, while the single-image SRR result seems to have a sharper reconstruction of the rover tracks.
On the other hand, the GPT multi-image SRR result (for 8 repeat input images with 1000x1000 pixels) took a whole day to process on a high-spec CPU machine (Intel Core i7 @2.8GHz), while the MARSGAN SRR prediction for the same-sized single image input only took a few minutes on the same CPU and takes less than a second on the NVIDIA ® RTX3090 GPU. The trade-off becomes obvious, when we want to process a large image, like a full-strip CaSSIS or HiRISE. Note that the GPT SRR [4] is based on multi-angle view information and not based on deep learning, and also its key component is not suitable for GPU implementation.

Extendability with other datasets
This paper focuses on SRR processing of the TGO CaSSIS images. However, it should be pointed out that the proposed MARSGAN model can also be applied to other extrahigh resolution, e.g., 0.25m HiRISE, or medium-to-high resolution, e.g., 6m CTX and 18m Compact Reconnaissance Imaging Spectrometer (CRISM), Mars imaging datasets. Figure  20 shows an example of the MARSGAN SRR result, in comparison to the original HiRISE colour image (ESP_068294_1985; https://www.uahirise.org/ESP_068360_1985 (accessed on 1 May 2021)) of the Perseverance rover's parachute at the landing site. Figure 21 shows examples of the MARSGAN SRR result, in comparison to the original CTX image (rectJ21_052811_1983_XN_18N282W_v7pt1_6m_Eqc_latTs0_lon0; https://planetarymaps.usgs.gov/mosaic/mars2020_trn/CTX/ (accessed on 1 May 2021)) over Jezero crater. Figure 22 shows an example of the MARSGAN SRR result, in comparison to the original CRISM image (using bands 233, 78, and 13 of frt0000d3a4_07_if164l_trr3_raw downloaded from PlanetServer at http://planetserver.eu/ (accessed on 1 May 2021)), over Capri Chaos, Valles Marineris. In the future, with crossinstrument training (using different datasets with different resolutions to form the LR/HR training dataset), further improvement of the MARSGAN model can be expected.

Extendability with Other Datasets
This paper focuses on SRR processing of the TGO CaSSIS images. However, it should be pointed out that the proposed MARSGAN model can also be applied to other extrahigh resolution, e.g., 0.25 m HiRISE, or medium-to-high resolution, e.g., 6 m CTX and 18m Compact Reconnaissance Imaging Spectrometer (CRISM), Mars imaging datasets. Figure 20 shows an example of the MARSGAN SRR result, in comparison to the original HiRISE colour image (ESP_068294_1985; https://www.uahirise.org/ESP_068360_1985 (accessed on 1 May 2021)) of the Perseverance rover's parachute at the landing site. Figure 21 shows examples of the MARSGAN SRR result, in comparison to the original CTX image (rectJ21_052811_1983_XN_18N282W_v7pt1_6m_Eqc_latTs0_lon0; https: //planetarymaps.usgs.gov/mosaic/mars2020_trn/CTX/ (accessed on 1 May 2021)) over Jezero crater. Figure 22 shows an example of the MARSGAN SRR result, in comparison to the original CRISM image (using bands 233, 78, and 13 of frt0000d3a4_07_if164l_trr3_raw downloaded from PlanetServer at http://planetserver.eu/ (accessed on 1 May 2021)), over Capri Chaos, Valles Marineris. In the future, with cross-instrument training (using different datasets with different resolutions to form the LR/HR training dataset), further improvement of the MARSGAN model can be expected.

Conclusions
In this paper, we introduced the network architecture and training details of the proposed MARSGAN model for single-image SRR of TGO CaSSIS images. MARSGAN offers improvements over the ESRGAN model by using adaptive weighted basic residual blocks, a multi-scale reconstruction scheme, and a rebalanced loss function. We showed the improvements of MARSGAN in comparison with ESRGAN for CaSSIS SRR over the Perseverance rover's landing area. Image-quality based assessment (against downsampled HiRISE images) and edge-sharpness based effective resolution measurement are demonstrated for the landing site image. A resolution enhancement of a factor of ~3x is estimated based on the Imatest ® 's slanted-edge measurements. Further demonstration of CaSSIS SRR for 8 selected science-oriented scenes are given, which include many features unique to the Martian surface (e.g., bedrock layers, slope streaks, defrosting dunes, gullies, RSL, scalloped depressions, dust devils, and defrosting Spiders). For these science study sites, we demonstrated general improvement of image SNR, improvement of edge sharpness for different feature outlines, and enhancement of high-frequency details. Finally, the potential extendibility of the proposed MARSGAN model is demonstrated with examples from HiRISE, CTX, and CRISM images. Future work will include scientific studies to demonstrate what new information can be derived from the SRR results. Also, SRR of multi-spectral data (i.e., CRISM) will be explored in the wavelength domain.

Conclusions
In this paper, we introduced the network architecture and training details of the proposed MARSGAN model for single-image SRR of TGO CaSSIS images. MARSGAN offers improvements over the ESRGAN model by using adaptive weighted basic residual blocks, a multi-scale reconstruction scheme, and a rebalanced loss function. We showed the improvements of MARSGAN in comparison with ESRGAN for CaSSIS SRR over the Perseverance rover's landing area. Image-quality based assessment (against down-sampled HiRISE images) and edge-sharpness based effective resolution measurement are demonstrated for the landing site image. A resolution enhancement of a factor of~3x is estimated based on the Imatest ® 's slanted-edge measurements. Further demonstration of CaSSIS SRR for 8 selected science-oriented scenes are given, which include many features unique to the Martian surface (e.g., bedrock layers, slope streaks, defrosting dunes, gullies, RSL, scalloped depressions, dust devils, and defrosting Spiders). For these science study sites, we demonstrated general improvement of image SNR, improvement of edge sharpness for different feature outlines, and enhancement of high-frequency details. Finally, the potential extendibility of the proposed MARSGAN model is demonstrated with examples from HiRISE, CTX, and CRISM images. Future work will include scientific studies to demonstrate what new information can be derived from the SRR results. Also, SRR of multi-spectral data (i.e., CRISM) will be explored in the wavelength domain.