Multi-image Super Resolution of Remotely Sensed Images using Residual Feature Attention Deep Neural Networks

Convolutional Neural Networks (CNNs) have been consistently proved state-of-the-art results in image Super-Resolution (SR), representing an exceptional opportunity for the remote sensing field to extract further information and knowledge from captured data. However, most of the works published in the literature have been focusing on the Single-Image Super-Resolution problem so far. At present, satellite based remote sensing platforms offer huge data availability with high temporal resolution and low spatial resolution. In this context, the presented research proposes a novel residual attention model (RAMS) that efficiently tackles the multi-image super-resolution task, simultaneously exploiting spatial and temporal correlations to combine multiple images. We introduce the mechanism of visual feature attention with 3D convolutions in order to obtain an aware data fusion and information extraction of the multiple low-resolution images, transcending limitations of the local region of convolutional operations. Moreover, having multiple inputs with the same scene, our representation learning network makes extensive use of nestled residual connections to let flow redundant low-frequency signals and focus the computation on more important high-frequency components. Extensive experimentation and evaluations against other available solutions, either for single or multi-image super-resolution, have demonstrated that the proposed deep learning-based solution can be considered state-of-the-art for Multi-Image Super-Resolution for remote sensing applications.


I. INTRODUCTION
Super-resolution (SR) algorithms serve the purpose of reconstructing high-resolution (HR) images from either single or multiple low-resolution (LR) images. Due to constraints such as sensor limitations and exceedingly high acquisition costs, it is often challenging to obtain HR images. In this regard, SR algorithms provide viable opportunity to enhance and reconstruct HR images from LR images recorded by the sensors. Over more than three decades, progress has steadily been observed in the development of Super-resolution, as both multi-frame and single-frame SR now have substantial applications that can use the image generation purposefully.
SR is very significant to Remote Sensing because it provides opportunity to enhance LR images despite the inherent problems often faced in remote-sensing scenarios. The hardware The authors are with Politecnico di Torino -Department of Electronics and Telecommunications, PIC4SeR, Politecnico di Torino Interdepartmental Centre for Service Robotics and SmartData@PoliTo, Big Data and Data Science Laboratory, Italy. Email: {name.surname}@polito.it. and material costs for smaller missions around data accumulation are very high. Additionally, onboard instruments on satellites continue to generate ever-increasing data as spatial and spectral resolutions also increase, and this has progressively become challenging for compression algorithms [1], as they try to meet the bandwidth restrictions [2], [3]. Remote sensing is fundamental in obtaining images covering most of the globe, permitting many vital projects such as disaster monitoring, military surveillance, urban maps, and vegetation growth monitoring. It is thus imperative that enhancements and progress be made in post-processing techniques to overcome obstacles of increasing spatial resolution.
There are two main methods used in Super-resolution: Single-image SR (SISR) and Multi-image SR (MISR). SISR employs a single image to reconstruct a HR version of it. However, a single image is quite limited in the amount of information that it provides, mainly post the LR image formation process. Contrastingly, MISR involves multiple LR images of the same scene acquired from the same or different sensors to construct an HR image. The significant advantage MISR holds over SISR is in how it can draw out otherwise unavailable information from the different image observations of the same scene. It consequently constructs high spatial resolution image. However, to achieve the additional benefits of MISR, a multitude of problems need to be solved. Conventionally, multiple images are obtained by either a satellite during its multiple orbits or by different satellites at different times or different sensors acquiring images at the same time. With so many variables involved, many complications need to be considered, such as cloud coverage, time variance in scene content, and invariance to absolute brightness variability.
There has been significant progress in Single-image SR as deep learning methods and deep neural networks have been brought into use, allowing a better efficient generation of non-linear maps to deal with complex degradation models. However, there has not been any similar progress in MISR.
In this paper, building over the latest breakthroughs in SISR [4]- [8], we propose a deep learning MISR solution for remotesensing applications that exploits both spatial and temporal correlations to combine multiple low-resolution acquisitions smartly. Indeed, our model provides a real end-to-end efficient solution to recover high-resolution images, overcoming limitations of previous similar methodologies, and providing enhanced reconstruction results. So, the presented research constitutes an exceptional opportunity, easily replicable, to access better quality and more useful information for the remote-sensing community. In particular, the main contribution of our work lies in: 1) The use of 3D convolutions to efficiently extract, directly from the stack of multiple low-resolution images, high-level representations, simultaneously exploiting spatial and temporal correlations. 2) The introduction of a novel feature attention mechanism for 3D convolutions that lets the network focus on most promising high-frequency information largely overcoming main locality limitations of convolutional operations. Moreover, the concurrent use of multiple nested residuals, inside the network, let low-frequency components flow directly to the output of the model. 3) The conceptualization and development of an efficient, highly replicable, deep learning neural network for MISR that makes use of 2D and 3D convolutions exclusively in the low-resolution space. It has been extensively evaluated on a major multi-frame opensource remote-sensing dataset proving state-of-the-art results with a considerable margin. So, it constitutes an exceptional tool and opportunity for the remote-sensing research community. The remainder of this paper is structured as follows. Section II covers the related work on SR and its developments in techniques for both SISR and MISR. Section III explains the overall methodology, network architecture and its subsequent blocks, and training process. Section IV discusses the experimentation, the Proba-V dataset, data pre-processing, and results. Section V draws some conclusions and future directions.

II. RELATED WORK
Related literature is organized as follows. Firstly, a wide range of studies related to SISR are discussed which involve state-of-the-art methods and recent developments in SISR techniques, which is the basis of every SR method. Secondly, studies performed for SR in remote sensing domain are discussed. Lastly, MISR related studies, which are rarely addressed, are discussed including latest developments.

A. Single-image Super-resolution
Ever since the late eighties and the early nineties, there has been an eager interest in SR, comprehensively reviewed by Borman and Stevenson [9]. Following forth in the works of Tsai and Huang [10] and afterward, Kim et al. [11], the new approaches considered processing images in the frequency domain to recover lost information of higher-frequency. These first works had certain drawbacks, like the level of difficulty observed in successfully incorporating the prior available spatial information. Several studies performed by Irani and Peleg [12]- [14] focused over the spatial domain, proposing methods for SR reconstruction. Learning-based methods build upon the relation between LR-HR images, and there have been many recent advancements in this approach, mostly due to deep convolutional neural networks (CNNs) [4], [15], [16]. The leading force for this was Dong et al. [17], who achieved superior results by proposing a Super-resolution CNN (SRCNN) framework. Kim [19] and memory blocks in MemNet [20]. So going forth, particular emphasis has been placed on proper upscaling of spatial resolutions at network tail-ends, as well as extracting information of the original scale LR inputs. To that end, some enhancements have been proposed for accelerating the testing and training needed for SRCNN, a faster network structure FSRCNN [15]. Ledig et al. [21] proposed SRGAN, a generative adversarial network (GAN) for photo-realistic SR with perceptual losses [22], and K. He et al. introduced ResNet [23] for image SR and to make a deeper network SRResNet. EnhanceNet [24] also used a GAN based model to merge perceptual loss with automated texture synthesis. Though, the predicted results can produce some artifacts and may not be a faithful reconstruction.
In recent past years, enhancements in deep networks have been proposed and showed promising results for SISR, for example, in [5], an Enhanced Deep Super-resolution (EDSR) network was developed to improve the performance by removing unnecessary modules and expanding the model size with the stable training process in conventional residual networks. Yu et. al [6] demonstrated better results in terms of accurate SR by generating models with a wide range of features before ReLU activation and training with normalized weights. Zhang et. al [7] proposed residual channel attention networks (RCAN) that exploits very deep network structure based on residual in residual (RIR) which bypass excessive low-frequency information through multiple skip connections.

B. SR for Remotely Sensed Imagery
With the increasing availability of recent satellite-based multispectral sensors and transmission bandwidth restrictions [25], it is possible to obtain images at different spatial resolutions with multiple spectral bands. Keen attention is being paid to developing better methods of super-resolving the lowerresolution bands but simultaneously keeping the images at a high spatial resolution. An example can be seen in [26], where -through the utilization of only lower resolution bands SR of multispectral remote sensing images is applied with convolutional layers. [27] shows the integration of residual connections into a single image SR based architecture to achieve better SR performance. The performance of image enhancement methods in computer vision can also be increased prominently through generative adversarial networks (GANs) [21], [28]. Moreover, GANs have also been exploited to superresolve remote sensing images. For example, Ma et. al. [29] developed a dense residual generative adversarial network (DRGAN)-based SISR method to super resolve remote sensing images. By designing a dense residual network as the generative network in GAN, their method makes full use of the hierarchical features from low-resolution (LR) images. Dong et. al. [30] proposed a novel multi-perception attention network (MPSR) for Super-resolution of low resolution remotely sensed images, which achieved better results by incorporating the proposed enhanced residual block (ERB) and residual channel attention group (RCAG). Their methodology is capable of dealing with low-resolution remote sensing images via multi-perception learning and multi-level information adaptive weighted fusion. They claimed that, a pre-train and transfer learning strategy can improve the SR performance and stabilize the training procedure. Gargiulo et. al. [31] proposed a CNNs based approach to provide a 10 m superresolved image of the original 20 m bands of remotely sensed Sentinel-2 images. In their experimental results, they claimed that the proposed solution can achieve better performance with respect to most of the state-of-the-art methods, including other deep learning based ones with a considerable saving of computational burden. Recently methods to enhance spatial resolution of remotely sensed images used Parallel Residual Network [32], Bidirectional Convolutional LSTMs [33], Deep Residual Squeez and Excitation Network [34].

C. Multi-image Super-resolution
Multi-image SR (MISR) involves the extraction of information from many LR observations of the same scene to reconstruct HR images [35]. The earliest work for MISR was proposed by Tsai and Huang [10] using a frequencydomain technique, by combining multiple images with subpixel displacements to improve the spatial resolution of images. Due to the some weaknesses of the first proposed method related to incorporate prior information of HR images, several spatial domain MISR techniques were considered [36]. These include projection onto convex sets (POCS) [37], non-uniform interpolation [38], regularized methods [39], [40], and sparse coding [41]. With the availability of more data from the multiple observations of the scene, it is possible to obtain a more accurate reconstruction than through single-image methods. MISR techniques involve different ways of degrading the original image by following an image model, and these involve blurring, warping, noise contamination, and decimation. Then the degradation is reversed by solving an ill-posed optimization problem. To this end, Bayesian reconstruction in the gradient projection algorithm was used alongside subpixel displacement estimation [42]. An enhanced Fast and Robust SR (FRSR) [43] employs estimation of maximum likelihood analysis and simplified regulation. Another proposal in SR was for the Adaptive detail enhancement (SR-ADE) [44], which reconstructs satellite images with the use of a bilateral filter for decomposing input images while also amplifying highfrequency detail information.
Another approach Iterative Back Projection (IBP), introduced by Irani and Peleg [13], used a back-projection of the difference between the actual LR images obtained and the simulated LR images to the SR image. The forward imaging process is inverted and iteratively attempted in updates. As with MISR, there are apparent drawbacks when prior images are difficult to be included, or it is difficult to model an image's degradation process.
In the past few years, many deep learning-based approaches have been exploited to address the MISR problems in the context of enhancing video sequences [45]- [47]. However, MISR is rarely exploited for remotely sensed satellite imagery. Kawulok et al [48] demonstrated the potential benefits of information fusion offered by multiple satellite images reconstruction and learning-based SISR approaches. In their work, EvoNet framework [49] based on several deep CNNs was adopted to exploit SISR in the preprocessing phase of the input data for MISR.
Recently, a challenge was set by the European Space Agency (ESA) to super-resolved multi-temporal PROBA-V satellite imagery 1 . In this context, a new CNN-based architecture DeepSUM was proposed by Molini et. al [50] to super resolve multi-temporal PROBA-V imagery. An end-to-end learning approach was established by exploiting both spatial and temporal correlations. Most recently, Deudon et. al presented HighRes-net based on deep learning to deal with the MISR of remotely sensed PROBA-V satellite imagery [51]. They proposed an end-to-end mechanism that learns the sub-tasks involved in MISR, that are co-registration, fusion, upsampling, and registration-at-the-loss.

III. METHODOLOGY
MISR aims at recovering an HR image I HR from a set of T LR images I LR [1,T ] of the same scene acquired in a certain temporal window. In contrast to SISR, MISR can simultaneously benefit from spatial and temporal correlations, being able to achieve far better reconstruction results theoretically. Either way, SR is an inherently ill-posed problem since a multiplicity of solutions exist for any given set of low-resolution images. So, it is an underdetermined inverse problem, of which solution is not unique. Our proposed methodology, based on a representation learning model, aims at generating a superresolved image I SR applying a function H RAMS to the set of I LR [1,T ] images: where Θ are model parameters learned with an iterative optimization process.
In other words, we propose a fully convolutional Residual Attention Multi-image Super-resolution network (RAMS) that can efficiently extract high-level features concurrently from T LR images and fuse them exploiting a built-in visual attention mechanism. Attention directs the focus of the model only on most promising extracted features, reducing the importance of less relevant ones and mostly transcending limitations of the local region of convolutional operations. Moreover, extensive use of nested residual connections lets all the redundant lowfrequency information, present in the set I LR [1,T ] of LR images, flow through the network, leaving the model focusing its computation only on high-frequency components. Indeed, highfrequency features are more informative for HR reconstruction, while LR images contain abundant low-frequency information that can directly be forwarded to the final output [7]. Finally, as the majority of the model for single-image super-resolution [5], [6], [8], [15], all computations in our network are efficiently performed in the LR feature space requiring only an upsample operation at the final stage of the model.
In the following paragraphs, we present the overall architecture of the network with a detailed overview of the main blocks. Finally, we conclude the methodology section with precise details of the optimization process for training the network.

A. Network architecture
An overview of the RAMS network, with its main three blocks and two branches, is depicted in Fig. 1. As a high-level description, the model takes as input a single set of T lowresolution images I LR [1,T ] that can be represented as a tensor X (i) with shape H × W × T × C where H, W and C are the height, width, and channels of the single low-resolution images, respectively. The upper global residual path proposes a simple SR solution, making an aware fusion of the T input images. On the other hand, the central branch exploits 3D convolutions residual-based blocks in order to extract spatial and temporal correlations from the same set of T LR images and provide a refinement to the residual simple SR image.
More in detail, in the first part of the main path of the model, we use a simple 3D convolutional layer, with each filter of size Then, we apply a cascade of N residual feature attention blocks that increasingly extract higher-level features, exploiting local and non-local, temporal, and spatial correlations. Moreover, we make use of a long skip connection for the shallow features and several short skip connections inside each feature attention block to let flow all redundant low-frequency signals and let the network focus on more valuable high-frequency components. The three dimensions H, W and T are always preserved through reflecting padding. The first part of the main branch can be modeled as a single function H I that maps each tensor X (i) to a new higher dimensional one X (i) In the second part of the main branch, we further process the output tensor X (i) In each block, we intersperse a residual feature attention block with 3D convolutional layer without padding on the temporal T dimension (TR-Conv). So, H, W and F remain invariant and only the temporal dimension is reduced. The output of this second block is a new tensor X (i) Finally, the output tensor X (i) II is processed by a further TR-Conv layer that reduces T to one and an upscale function H UP|3D that generates a tensor X (i) UP|3D of shape sH × sW × C where s is the scaling factor. The overall output X (i) UP|3D of the main branch sums with the trivial solution provided by the global residual. Indeed, the global path simply weights the T LR images of the input tensor X (i) with a residual temporal attention block with filters of size f h × f w . Then it produces an output tensor X (i) UP|2D of shape sH × sW × C that is added to the one of the main branch. So, the final SR prediction of the networkŶ (i) = I SR is the sum of the two contribution: The upscaling procedure is identical for both branches; after several trials with different methodologies, such as transposed convolutions [51], bi-linear resizing and nearest-neighbor upsampling [52], we adopted a sub-pixel convolution layer as explained in detail in [53]. So, for either branch, the last 2D or 3D convolution generates s 2 ·C features in order to produce the final tensors of shape sH × sW × C for the residual sum.
In conclusion, the overall model takes as input a tensor X (i) with shape H × W × T × C, works always efficiently in the LR space and generates only at the final stage an output tensor Y In the following sub-paragraphs, the three major functional blocks, residual feature attention, residual temporal attention, and temporal reduction blocks are further explained and analyzed.

B. Residual attention blocks
Residual attention blocks are at the core of the RAMS model; their specific architecture allows it to focus on relevant high-frequency components and let redundant, low-frequency information flow through the residual connections of the network. Inter-dependencies among features, in the case of feature attention blocks, or temporal steps, in the case of temporal attention blocks, are taken into account computing for each of them, relevant statistics that take into account local and non-local, temporal and spatial correlations. Indeed, either 3D or 2D convolution filters operate with local receptive fields loosing the possibility to exploit contextual information outside of their limited region of view.
1) Residual feature attention: Except for the global residual path, all residual attention blocks are residual feature attention blocks, as shown in Fig. 1. Indeed, each block of features is weighted up in order to trace most promising high-frequency components, and a residual connection lets low-frequency information flow through the network.
More formally, the output of a residual feature attention block with a generic tensor, X (i) n , is equal to: where H F A is the feature attention function and X (i) * is the output of two stacked 3D convolutional layers.
where W 1 ,W 2 and B 1 , B 2 represent the filters with size f h × f w × f t and biases respectively and, ' * ' denotes the 3D convolution operation. The number of filters is always equal to F as the ones extracted by the first 3D convolutional layer. So, all low-frequency components in X (i) n can flow through the residual connection and H F A can focus the attention of the network to more valuable high-frequency signals. To this end, the feature attention block takes the feature-wise global spatial and temporal information into a feature descriptor by using a global average pooling. So, from the tensor X (i) * with shape H × W × T × F we extract z F ∈ R F feature statistics using the following equation: Statistics of the feature z F can be viewed as a collection of descriptors, whose values contribute to express the whole stack of temporal images [54]. In Fig.2, it is possible to observe the global pooling operation which output is a tensor Z (i) F with shape 1 × 1 × 1 × F and last dimension equal to z F . In addition, the output tensor Z (i) F is further processed by a stack of two 3D convolutional layers with a ReLU [55] and sigmoid activation function, respectively. Indeed, as discussed in [54], the stack of two convolutional layers with the filter of size 1 × 1 × 1 concur to create a non-linear mapping function which is able to deeply capture feature-wise dependencies from the aggregated information extracted by the global pooling operation. The first 3D convolutional layer reduces the feature size by a factor of r, and then the second layer restores the original dimension and constraints its values from zero to one with a sigmoid function in a non-mutually exclusive relationship.
Finally, the original tensor X (i) * is weighted up by the processed attention statistics as shown in Eq. 5.
2) Residual temporal attention: The primary purpose of the global residual path is to generate a starting trivial solution for the upsampling problem. More accurate is this starting prediction, and more simplified is the role of the main branch of the network, leading to a lower reconstruction error. However, the input of the model X (i) has T different LR images that have to be merged. Intuitively, for each input sample I LR [1,T ] , there are some LR images more similar to each other. So, giving them more relevance when merging the T LR images would most probably lead to higher quality predictions. In this context, the aim of the residual temporal attention  block is to make an aware weighing of the different input temporal images, letting the network to make an upsample solution with primarily the most similar temporal steps. That is accomplished with an asymmetrical mechanism to the one employed in the residual feature attention blocks and can be summarized by the following formula: where H T A is the temporal attention function and X (i) * is the product of a stack of two 2D convolutional operations as depicted in Fig. 3 with f h × f w and T · C as filter size and number of features, respectively. Then, as already introduced with the feature attention blocks, the temporal block takes the temporal-wise global spatial information into a feature descriptor by using a global average pooling operation. Finally those statistical descriptors are processed by a stack of 2D convolutional layers with ReLU and sigmoid as activation function, respectively, scaling the T · C channels of the input tensor, as shown in Eq. 8. As for feature attention blocks, the first convolutional layer reduces the number of the last dimension by a factor of r, giving the network the possibility to fully capture temporal-wise dependencies from the aggregated output information of the global average pooling operation.

C. Temporal reduction blocks
The aim of the last block of the main branch is to slowly reduce the number of temporal steps so that the temporal depth eventually reduces to one. Indeed, the output tensor X (i) I of the N residual feature attention blocks has T temporal dimensions that need to be merged. To this end, we further process the incoming tensors with T /(f t − 1) − 1 temporal reduction blocks. Each one is composed of a residual feature attention block and a 3D convolutional layer without any reflecting padding in the temporal dimension, denoted TR-Conv. So, at each TR-Conv layer we reduce T of f t −1. The attention blocks allow the network to learn the best space to decouple image features, "highlighting" more promising features to maintain when reducing the temporal dimension. The output of the last temporal reduction block is a tensor X (i) II with shape H ×W ×f t ×F where the temporal dimension T is reduced to f t . The last TR-Conv, before the upsampling function H UP|3D , reduces to one the number of temporal steps and generates s 2 · C features for the sub-pixel convolutional layer.

D. Training process
Learning the end-to-end mapping function H RAMS requires the estimation of model parameters Θ. That is achieved by minimizing a loss L between the reconstructed super-resolved images I SR and the corresponding ground truth high-resolution images I HR .
Several loss functions have been proposed and investigated for the SISR problem, such as L 1 [5], [6], [56], [57], L 2 [4], [11], [20], [50] and perceptual and adversarial losses [21], [22]. However, in typical MISR remote-sensing problems, LR images are taken within a certain time window and they could have an undefined spatial misalignment one to each other. So, we must take into account that the super-resolved output of the model I SR will be inherently not registered with the target image I HR . Moreover, since we can have very different conditions among the images part of the same scene, it is important to make the loss function independent from possible intensity biases between the super-resolved I SR and the target I HR . Indeed, if we get a super-resolved image I SR = I HR + , with constant and low enough to avoid numerical saturation, we can consider its reconstruction perfect, since it represents the scene with the same level of detail of the ground truth.
With these premises, inspired by the metric proposed in [58], we defined I SR crop as the super-resolved output cropped of d pixels on each border and we consider each possible patch I HR u,v , u, v ∈ [0, 2d] of size (sH − 2d) × (sW − 2d) extracted from the ground truth I HR . We compute the mean biases between the cropped I SR crop and the patches I HR u,v as follows: The loss is then defined as the minimum mean absolute error (L 1 loss) between I SR crop and each possible alignment patch I HR u,v . We use the MAE instead of the most used MSE since we experimentally find that provides better results for image restoration problems, as proved by the previous works [5], [7], [59].
where · 1 represents the L 1 norm of a matrix, i.e. the sum of its absolute values.
IV. EXPERIMENTS In this section, we test the proposed methodology in an experimental context, training it on a dataset of real-world satellite images and evaluating its performance in comparison with other approaches, including a state-of-the-art SISR algorithm, to demonstrate the superiority of Multi-image models. We first present the dataset and the preprocessing stages, we define all the parameters used during the experimentation, and then we propose quantitative and qualitative results. We also perform an ablation study to demonstrate the contribution of the global residual branch that implements a temporal attention mechanism. To implement our network, we use the TensorFlow framework. The complete code with a pre-trained version of our model is available online 2 .

A. The Proba-V Dataset
To train our model, we exploit the dataset released by the Advanced Concept Team of the European Space Agency (ESA) [58]. This dataset has been specifically conceived for MISR problems, and it is composed of several images taken by  the Proba-V satellite 3 in the two different spectral bands RED and NIR (near-infrared). Proba-V satellite has been launched by ESA in 2013 and is specifically designed for land covering and vegetation growth monitoring across almost the entire globe. The satellite provides images in two resolutions with different revisit frequency. HR images have a 100m per pixel spatial resolution and are released roughly every five days, while LR images have 300m per pixel resolution and are available almost daily. The characteristics of the Proba-V imagery make it particularly suitable for MISR algorithms since it provides both resolutions natively, allowing for the application of the SR process without the need for artificially degrading and downsampling the HR images.
The dataset has been released for the Proba-V Super Resolution challenge 4 and is composed of two main parts: the train part provides both LR and HR images, while the test part LR images, only. In order to verify the effectiveness of our approach, we consider the train part and not the test part, since it has been conceived for the challenge evaluation only and it does not include the ground truths. Thus, we subdivide the train part in training and validation sets. To ease the comparison with previous methods, we use the same validation images used in [50]. In total, we have 415 scenes for training and 176 for validation for the RED band and 396 for training and 170 for validation for NIR.
Each scene is composed of several LR images (from 9 to 35, depending on the scene) with a dimension of 128x128 pixels and a single HR ground truth with a dimension of 384x384 pixels. The images are encoded as 16-bit png files, even if the actual signal bit-depth is 14 bits. Additionally, each image features a binary mask that distinguishes reliable pixels from unreliable ones (e.g., due to cloud coverage). This information is vital since the images are not taken in the same weather and temporal conditions, but a maximum period of 30 days can be covered in a single scene. For this reason, non-trivial changes in the landscape can occur between different LR images and their HR counterpart and are essential to understand which pixels carry meaningful information and which do not. Trying to infer the value of pixels that are concealed by clouds would mean being able to predict the weather in an unknown time by merely looking at the condition in other unspecified moments. For this reason, it is essential to train the network so that unreliable pixels do not influence the SR process. To assess the quality of each image, we define c as the clearance of the image, i.e. the fraction of reliable pixels in the correspondent binary mask.

B. Data pre-processing
Before training the model, we pre-process the dataset with the following steps: • register each LR image using as reference the one with maximum clearance c • select the clearest T images from each scene that are above a certain clearance threshold c min • pre-augment the training dataset with n p temporal permutations of the LR input images • normalize the images by subtracting the dataset mean intensity value and dividing by the standard deviation Since each LR image is taken at a different time and with some intrinsic spatial misalignment with respect to the others, it is important to resample each pixel value in order to have a coherent reference frame. For each scene of the dataset, we consider as a reference image the one with the maximum clearance c. During the registration process, we consider translation as transformation model, which computes the necessary shifts to register each image for both the axes. Masks are taken into consideration during this process in order to avoid bad registration caused by unreliable pixels. The registration is performed in the Fourier domain using normalized crosscorrelation as in [60]. After computing the shifts, both LR images and the correspondent masks are shifted accordingly. We use a reflect padding to add pixels to LR images and a constant zero padding for masks. In this way, these extra pixels will be considered unreliable.
For each scene, we must select some LR images in order to match the temporal dimension T of the network. We set a threshold c min = 0.85 on the clearance for an image to be accepted to avoid using awful images that can worsen the SR performance. The acceptable images are then sorted in order of clearance, and the best T are selected. In the case of a scene with less than T images, we sample randomly from the set of acceptable images until T are reached. If a scene is only composed of clearances under c min , it is entirely removed from the dataset. This selection process is performed after the registration so that heavily bad registered images are also removed, even if they had an initial clearance above the threshold. Since each scene of the dataset contains at least 9 LR images, we set T = 9 to fully exploit all the available information for most of the scenes.
One of the characteristics of the Proba-V dataset is that the LR images of a particular scene have no clear temporal order. Therefore, there is no reason to prefer a specific order in the T input images to another. The training dataset is, therefore, preaugmented by performing n p random temporal permutations of the selected T input images to help generalization. In this way, we can train the algorithm to identify the best temporal image independently on the position inside the input tensor. We set this permutation parameter to n p = 7, reaching a total of 2905 training data-points for RED and 2751 for NIR.
Finally, each image is normalized by subtracting the mean pixel intensity value computed on the entire dataset and dividing by the standard deviation. After the forward pass in the network, all the tensors are then denormalized, and the subsequent evaluations are performed on the 16 bits unsigned integer arrays.

C. Experimental settings
The scaling factor of the Proba-V dataset is s = 3. Since we have different scenes for RED and NIR data, we treat the problem for the two bands separately. For this reason, we have C = 1, since we consider images with a single channel. We set F = 32 and f h = f w = f t = 3 as number of filters and kernel size respectively for each convolutional layer. Therefore, the number of temporal reduction blocks is T /(f t − 1) − 1 = 3, since each block reduces the temporal dimension of 2. In all the residual attention blocks, we set r = 8 as the reduction factor. After testing different values with a grid search, we set N = 12 as the number of residual feature attention blocks in the main branch of the network. We find that decreasing this number causes a loss of performance while increasing it gives a little improvement in the results at the cost of a high increase in the number of parameters. N = 12 is, therefore, the best compromise between network size and prediction results. In total, our model has slightly less than 1M parameters.
In most of the SR applications present in literature, LR images are obtained from the artificial degradation of the target HR images. In contrast, the real-world nature of the dataset, in which LR images are obtained independently from HR images, causes an unavoidable misalignment between the superresolved output and the ground truth. To take into account this problem, the authors of the dataset consider a maximum shift of ±3 pixels on each axis between I SR and target I HR , computed on the basis of the geolocation accuracy of the Proba-V satellite [58]. When computing the loss function presented in Sec. III-D, we can therefore set d = 3. Besides, since the Proba-V dataset also provides binary mask that marks with one reliable pixel and with 0 unreliable (e.g., concealed by clouds) ones, we adapt the loss function to use this information to refine the training process. During the loss computation, we want pixels marked as unreliable in the target binary mask M HR not to influence the loss computation. Practically, we can simply multiply the cropped super-resolved image I SR crop , and the HR patch I HR u,v by the correspondent cropped mask M HR u,v and average all the quantities over the number of clear pixels. The bias computation is therefore adapted from Eq. 9 as: where · 1 represents the L 1 norm of a matrix, i.e. the sum of its absolute values. In the same way, the loss is adapted from Eq. 10 as: To train the model, we extract from each LR image 16 patches with a size of 32 × 32 pixels and the corresponding HR and masks patches with a size of 96 × 96. We further check every single patch and remove those that have a target mask M HR with less than 0.85 clearance. The total number of training data points obtained is 41678 for RED and 40173 for NIR. During the training process, we further perform data augmentation with random rotations of 90 • , 180 • and 270 • and random horizontal flips.
We set the batch size to 32. Therefore, during training, we have an input tensor with shape 32 × 32 × 32 × 9 × 1 and an output tensor with shape 32×96×96×1. We optimize the loss function with Adam algorithm [61] with default parameters β 1 = 0.9, β 2 = 0.999 and = 10 −7 . We set an initial learning rate η i = 5 × 10 −4 and we reduce it with a linear decay down to η f = 5 × 10 −7 . We train two different networks for RED and NIR spectral bands on a workstation with an Nvidia RTX 2080Ti GPU with 11GB of memory and 64GB of DDR4 SDRAM. We use the TensorFlow 2.0 framework with CUDA 10. In total, we train the models for 100 training epochs for about 16 hours.

D. Quantitative results
To evaluate the obtained results, we need to use a slightly modified version of PSNR and SSIM [62] criteria to take into consideration all the aspects we considered in the previous section to obtain a proper loss function. Martens et al. [58] propose a corrected version of the PSNR, called cPSNS, that is obtained from a corrected mean squared error (cMSE). The computation of the cMSE is performed in the same way as we did for the loss in Eq. 12: it is the minimum MSE between I SR crop + b u,v and the HR patches I HR u,v : where MSE clear represents the mean squared error computed only on pixels marked as clear in the binary mask M HR u,v . Again, we can simply multiply the matrices by the mask to make unreliable pixels irrelevant: where · 2 represents the Frobenius (L 2 ) norm of a matrix, i.e. the square root of the sum of its squared values. We can then compute the cPSNR as: cPSNR = 10 log 10 (2 16 1) Temporal self-ensemble (RAMS+): As in Sec. IV-B, during the training process images are augmented with random permutation in the temporal axis. For this reason, it is possible to maximize the performance of the model, by adopting a self-ensemble mechanism during inference, similarly to what done in previous super-resolution works [5], [7], [63]. For each input scene, we consider a certain number P of random permutations on the temporal axis and we denote as I LR [1,T ], 0 , · · · , I LR [1,T ], P the resulting set of inputs. The output of the inference process is therefore the average of the predictions on the whole set. We call this methodology RAMS+ P , where P is the number of random permutations performed: Fig. 4 shows cPSNR results on the testing dataset for a different number of permutated predictions. The trend clearly shows how increasing P results in better performance on both the spectral bands, with an effect that tends to saturate for P ≥ 25. For the following evaluation, we select P = 20 to present the results for RAMS+. It is worth noting that, even if this method allows to increase the performance of the network sharply, inference time grows linearly with P , with RAMS+ 20 taking roughly 20 times as long as RAMS. Another aspect to highlight is that the permutations are performed randomly, so different results can be achieved even with the same value of P .
2) Comparison with state-of-the-art methods: Tab. I shows the comparison of cPSNR and cSSIM metrics with several methods on the validation set. We consider as the baseline the   bicubic interpolation of the best image of the scene selected considering the clearance, i.e., the number of clear pixels as marked by the binary masks. IBP [13] and BTV [43] methods are tested with the same methodology presented in Molini et al. [50]. They achieve slightly better results than the baseline with both the metrics. RCAN [7] is currently one of the Single-image Superresolution state-of-the-art networks. We trained from scratch two models, one for each spectral band, setting G = 5 and B = 5, as the number of residual groups and residual channel attention blocks respectively, for a total of about 2 million parameters. We train the two models from scratch on the Proba-V dataset, selecting the best image per scene as input. RCAN shows better performance with respect to classical methods but is far beyond the other MISR networks, showing how the additional information coming from both spatial and temporal correlations is vital to boost the super-resolution process.
VSR-DUF [47] has been developed to upsample video signals using a temporal window of several frames. We train two models from scratch, one for each spectral bands, using 9 LR images as input as in our methodology. The authors consider three different architectures depending on the number of convolutional layers and find better results, increasing the depth of the model. We select the baseline 16 layers deep architecture, that already has more than double parameters with respect to RAMS, with the same number of input images.
HighRes-net [51] algorithm got the second place in the Proba-V challenge and featured a single network for both spectral bands that recursively reduce the temporal dimension to fuse the input LR images. We train the model on our training dataset with default architectures. Since the authors designed the architecture to have an input temporal dimension multiple of 2, we set it to 16, as it is closest to 9.
DeepSUM [50] is the algorithm winner of the original Proba-V challenge, and the authors have further developed it with DeepSUM++ [64]. We train our RAMS on the same training dataset as these two works.
Results clearly show how the proposed methodology can obtain the best results with the two metrics on both the spectral bands and thus represents the current state-of-the-art for Multiimage Super-resolution for remote sensing applications. Using temporal self-ensemble, RAMS+ is able to achieve even higher performance. We show the value for RAMS+, setting P = 20 as the size of the ensemble, which is the value at which we experimentally find that the resulting gain starts to saturate. However, further increasing the ensemble size can result in even better performance, though at the cost of a significant inference speed drop.
It is worth mentioning that our methodology reaches a together. Fig. 5 shows a direct comparison between the cPSNR results of RAMS and the bicubic interpolation baseline and RCAN (SISR state-of-the-art). Each cross represents a scene of the validation dataset of the corresponding spectral band. The graphs on the left show how our method strongly beats the bicubic upsampling on almost all the scenes, 98% for RED and 91% for NIR. That is coherent with a general worse behavior of all the methods on the NIR images, probably due to an intrinsic worse information quality of the NIR dataset. The graphs on  the right show, on the other hand, the enormous potential of MISR with respect to SISR methods. It can be observed how again RAMS outperforms RCAN an almost all the scenes, with results only slightly worse than to bicubic interpolation, 92% for RED, and 91% for NIR. That is reasonable since RCAN results are someway in the middle between bicubic and RAMS.
3) Importance of the residual temporal attention branch: As a final analysis, we perform an ablation study to demonstrate the importance of the global residual branch that implements a temporal attention mechanism. We train two alternative networks, one for each spectral band, that have the same architecture of RAMS, except that we delete the residual temporal attention (RTA) branch. These reduced networks are trained from scratch independently from the complete ones. Tab. II shows a significant drop in the results obtained without the global residual branch and demonstrates the importance of selecting the best temporal views to ease the super-resolution process of the main branch. We find this difference particularly relevant for the RED band, since the training repeatedly failed without the RTA branch, with a diverging behavior after some epochs. The result reported in the table is computed with the last valuable parameters before the divergence starts.

E. Qualitative results
A visual comparison between some of the methods taken in the exam is shown in Fig. 6 and 7 for a RED and NIR image respectively. We provide a zoomed patch of the best LR input image of the scene, its bicubic interpolation, and the inference output of RCAN, VSR-DUF, DeepSUM, RAMS and RAMS+ 20 , together with the target HR ground truth. cPSNR and cSSIM scores for the image under analysis are also provided. From this comparison, MISR methods clearly show a better performance with respect to bicubic and SISR (RCAN). However, it is not trivial to understand which method is the better among MISR algorithms with a visual inspection of the results, only. As found by Ledig et al. [21], the task of achieving pleasant-looking results is a different optimization problem from maximizing the fidelity of the reconstructed information. Therefore, results with high content-related metrics as PSNR and SSIM frequently appear less photo-realistic to a human eye. However, in the context of remote sensing, the fidelity of the pixels content is vital to ensure that the super-resolved image are meaningful, thus the quality of results should be inferred by using content-related metrics, rather than by visual inspection.
V. CONCLUSION In this paper, we proposed a novel representation learning model to super-resolve remotely sensed multi-temporal LR images by exploiting concurrently spatial and temporal correlations. We introduced a feature and temporal attention mechanisms with 3D convolutions that, coupled with nestled residual connections, let the network focus on high-frequency components, flow redundant low-frequency information and transcend the local region of convolutional operations. Extensive experiments on the open-source Proba-V MISR dataset, either with single image and multi-image SR methodologies, demonstrated the effectiveness of our proposed methodology. In both NIR and RED spectral bands, our efficient and straightforward solution achieved considerably better results than other literature methodologies obtaining 48.51 dB and 50.44 dB of cPSNR, respectively for the two channels. That is further proved by the score of the official post-mortem Prova-V challenge where RAMS claimed the first place in the leaderboard. Future work may investigate the performance of the RAMS architecture on hyperspectral remote sensing imaging.