Deep Distillation Recursive Network for Remote Sensing Imagery Super-Resolution

Deep convolutional neural networks (CNNs) have been widely used and achieved state-of-the-art performance in many image or video processing and analysis tasks. In particular, for image super-resolution (SR) processing, previous CNN-based methods have led to significant improvements, when compared with shallow learning-based methods. However, previous CNN-based algorithms with simple direct or skip connections are of poor performance when applied to remote sensing satellite images SR. In this study, a simple but effective CNN framework, namely deep distillation recursive network (DDRN), is presented for video satellite image SR. DDRN includes a group of ultra-dense residual blocks (UDB), a multi-scale purification unit (MSPU), and a reconstruction module. In particular, through the addition of rich interactive links in and between multiple-path units in each UDB, features extracted from multiple parallel convolution layers can be shared effectively. Compared with classical dense-connection-based models, DDRN possesses the following main properties. (1) DDRN contains more linking nodes with the same convolution layers. (2) A distillation and compensation mechanism, which performs feature distillation and compensation in different stages of the network, is also constructed. In particular, the high-frequency components lost during information propagation can be compensated in MSPU. (3) The final SR image can benefit from the feature maps extracted from UDB and the compensated components obtained from MSPU. Experiments on Kaggle Open Source Dataset and Jilin-1 video satellite images illustrate that DDRN outperforms the conventional CNN-based baselines and some state-of-the-art feature extraction approaches.


Introduction
In recent years, remote sensing imaging technology is developing rapidly and provides extensive applications, such as object matching and detection [1][2][3][4], land cover classification [5,6], assessment of urban economic levels, resource exploration [7], etc. [8,9].In these applications, high-quality or high-resolution (HR) imageries are usually desired for remote sensing image analysis and processing procedure.The most technologically advanced satellites are able to discern spatial within a squared meter on the Earth surface.However, due to the high cost of launch and maintenance, the spatial resolution of these satellite imageries in ordinary civilian applications is often low-resolution (LR).Therefore, it is very useful to construct HR remote sensing images from existing LR observed images [10].
Compared with the general images, the quality of satellite imageries can be subject to additional factors, such as ultra-distanced imaging, atmospheric disturbance, as well as relative motion.All these factors can impair the spatial resolution or clarity of the satellite images, but video satellite imageries are more severely affected due to the over-compression.More specifically, for the video satellite, since it captures continuous dynamic video, in order to improve the temporal resolution, the optical imaging system has to sacrifice spatial resolution.At present, the original data volume of the video satellite has reached to the Gb/s level, but the channel transmission capacity of the spaceborne communication system is only in Mb/s level.To adapt to the transmission capacity of the satellite channel, the video acquisition system has to increase the compression ratio or reduce the spatial sampling resolution.For example, taking the video imagery taken by "Jilin No. 1" launched in China in 2015 as an example, although its frame rate reaches 25 fps, the resolution is only in 2048 × 960 pixels (equivalent to 1080P), and hence the imagery looks very blurred.Therefore, the loss of high-frequency details caused by excessive compression is a special concern for video satellite imagery SR.
To address the above mentioned problems, a series of SR techniques for the restoration of HR remote sensing images have been proposed [10][11][12][13][14].For example, Merino et al. proposed the super-resolution with variable-pixel linear reconstruction algorithm, named SRVPLR [15], which recombines a set of LR images in a linear nonuniform optimum manner.In [16], a hidden Markov tree model is proposed to establish a prior model in the wavelet domain to regularize the ill-conditioned problem for remote sensing image SR restoration.To fully use prior knowledge from a given LR image, Gou et al. [17] presented a non-local pairwise dictionary learning (NPDL) based model.In this model, the photometric, geometric, and feature information of the given LR image can be considered to improve the quality of reconstruction.
However, these shallow learning-based frameworks, show poor reconstruction performance when a high object resolution is required in practical applications.Recently, given the strength of deep CNNs, many CNN-based methods have evolved to deal with complex tasks in various applications [18][19][20], such as medical imaging, satellite imaging and video surveillance [21,22].In particular, these effective architectures have achieved very good performance in general image SR reconstruction.For example, Dong et al. [23] introduced a three-layer CNN into single image SR (SISR) and achieved considerable improvement.Then, Kim et al. [24] proposed a residual network, called VDSR by using adaptive gradient clipping and skip connection to alleviate training difficulty.More recently, Sheng et al. [25] proposed the deep laplacian pyramid super-resolution network (LapSRN) to reconstruct the sub-band residuals of HR images at multiple pyramid levels.In LapSRN, a weight-sharing mechanism is implemented in the same structure, thus considerably reducing large quantity of parameters.However, the incremental depth in a deep CNN framework causes loss of information, thus weakening the continuity of information propagation.Moreover, these conventional CNN-based or residual-learning-based structures fail to restore fine texture details with simply direct or skip connections under complex imaging conditions.In particular, remote sensing satellite imageries have a complicated degradation process, low ground object resolution, and weak textures, thus posing considerable challenges for SR reconstruction.
Recently, Huang et al. [26] introduced the dense convolutional network (DenseNet) to strengthen feature propagation and encourage feature reuse by connecting each layer to every other layer in a feed-forward manner.Furthermore, in [27], the feature maps of each layer are propagated into all subsequent layers, thus providing an effective method of combining the low-and high-level features to boost reconstruction performance.Tai et al. [28] proposed memory blocks to build MemNet by heavily using long-term dense connections in MemNet to recover more high-frequency information.Although these methods can enforce information propagation by increasing nodes between layers with skip or dense connections, the features are fused in the network with a concatenated manner and will lead to large computational burden and high memory consumption.
Following the idea of sharing weights among recursive nodes, recursive learning networks have been recently used to reduce redundancy parameters of the network.For example, Kim et al. [29] presented to use more layers to increase the receptive field of the network.It proposes a very deep recursive layer to avoid excessive parameters.In addition, a skip-connection manner is used to mitigate the training difficulty.Tai et al. [30] proposed a deep recursive residual network to address the problems of model parameters and accuracy, which recursively learns the residual unit in a multi-path model.More recently, Yang et al. [31] used the LR image and its edge map to infer sharp edge details of an HR image during the recurrent recovery process.However, the simple-connection manner used in these models [29,30] extremely limits the SR reconstruction performance.
In this study, a novel ultra-dense-connection manner is proposed to improve the reconstruction performance along with recursive strategy to mitigate memory consumption.Compared with the conventional skip-and dense-connection-based networks [24,26], the proposed UDB contains approximately twice as many short and long paths as the conventional dense block given the same convolution layers.Therefore, this will greatly enhance the representational power of the network.In addition, parameters sharing strategy between UDBs can extremely release the memory burden.We also find ferture distillation in different stages leads to better accuracy for deep SR networks.Thus, we distill the feature maps by partly choosing output (with a special ratio) in different stages yet retain its integrity.After getting feature maps in different UDBs, we aggregate these components for gaining more abundant and efficient information in a multi-scale purification unit.
The strategy of feature distillation and compensation is obviously different from the knowledge distillation in these studies [32,33].They compacted deep networks by letting a small simple network learn from a large complex network.In [34], the authors distilled a multi-model complex network by retaining the necessary network knowledge while keeping close performance.In [35], Pintea et al. showed substantially reduced parameters by recasting multiple residual layers in the large network into a single recurrent simple layer.However, our proposed distillation and compensation strategy is mainly used to compensate for the high-frequency details lost during information propagation rather than model compression.
In summary, the main contributions of this work are as follows: 1.
We propose a novel deep distillation recursive network DDRN for remote sensing satellite image SR reconstruction in a convenient and effective end-to-end training manner.

2.
We propose a novel multiple-path residual block UDB, which provides additional possibilities for feature extraction through ultra-dense connections, quite agreeing with the uneven complexity of image content.

3.
We construct a distillation and compensation mechanism to compensate for the high-frequency details lost during information propagation through the network with a special distillation ratio.
The remainder of this paper is organized as follows.In Section 2, we introduce previous works on CNN-based SR reconstruction algorithms, particularly network structures for feature extraction.Section 3 particularly presents the framework of the proposed DDRN.Section 4 individually presents the design of each key module under the proposed DDRN framework in details, including UDB, MSPU, resolution lifting, and loss function.Experimental results are given in Section 5, and the conclusions of this study are given in Section 6.

Related Work
We briefly review previously related works on structure-efficient networks [25,29,[36][37][38], from which our network draws inspiration.These previous deep networks are committed to learning fine detail textures by designing a sophisticated structure.In this section, we focus on recent skip-and dense-connection-based methods.
Skip connection: A skip connection that directly connects input to output through an identity map, as shown in Figure 1b, was pioneered for SISR by Kim et al. [24].They proposed a 20-layer CNN model known as VDSR.Instead of learning the actual pixel values, VDSR harnesses the global residual learning paradigm to predict the differences between ground truth and bicubic interpolated image.This learning strategy makes the feature maps very sparse, enabling easy training and convergence.Compared with the traditional methods [39][40][41][42], this learning strategy on the benchmark datasets shows a significant superiority on reconstruction performance in terms of visual and quantitative indicators.In addition, DRCN [29] constructes a recursive-supervision structure to alleviate the difficulty in training a deep residual network further.Recently, Sheng et al. [25] proposed a deep Laplacian pyramid super-resolution network (LapSRN) to reconstruct the sub-band residuals of HR images at multiple pyramid levels with skip connection.(a) Flat-net (e.g., SRCNN [23] and FSRCNN [43]): Direct connections are commonly used to learn the features.(b) Skip-net (e.g., VDSR [24]) : An identity map with connecting input to the output is pioneered for SISR.(c) Dense-net (e.g., DenseNet [26] and SRDenseNet [27]): The feature maps are directly passed from the preceding layers to the current layers through the identity function with much richer connections.(d) UDB: Interacted multiple-path units are embedded for extracting local feature maps with a richer ultra-dense connections."C" and " + " denote the concatenation and adding operation, respectively.
Dense connection: Enlightened by previous works, Huang et al. [26] recently represented an intensive skip connection called dense connection.As shown in Figure 1c, the feature maps of the current layer are connected to every subsequent layer in a feed-forward manner.With rich local dense connections, the current layer can aggregate the information from all of the preceding layers within the dense block for further selection and fusion.These strategies effectively address the vanishing-gradient problem and enhance information propagation, thus strengthening the feature expression and boosting the convergence.Subsequently, Tong et al. [27] proposed an enhancement version called SRDenseNet.In SRDenseNet, the feature maps obtained from each dense block are propagated into the deconvolution layers to reconstruct SR images, providing an effective way to combine the low-level and high-level features, which further boosts the reconstruction performance.In addition, the dense skip connections in the network enable short paths to be built directly linking to the output from each layer, thus mitigating the vanishing-gradient problem.While considering the research on feature extraction and fusion, the earlier work of Gao et al. [38] is also noteworthy.They proposed a technique called multi-scale dense network for resource-efficient image classification.Their main idea is to train multiple classifiers in different stages using a two-dimensional multi-scale architecture, enabling them to preserve the coarse-and-fine level features all throughout the network.
Ultra-dense connection: These above mentioned strategies have been proven effective in addressing vanishing-gradient problem, guaranteeing accurate feature extraction and fusion.However, the directly concatenated operation on all layers in previous works [27,38] have led to high memory consumption and computation burden.In addition, conventional dense-connection-based networks have to construct a deeper network the more the skip paths required.Moreover, the increasing computational burden and memory consumption are unacceptable.
As shown in Figure 1d, on the basis of the dense network [26], we propose a multiple-path residual block called UDB.Compared with conventional skip or dense networks [24,26,27,29], UDB contains richer short and long paths with the same convolution layers.In particular, given the multiple-path units and transition layer, the feature channels becomes shallower, extremely reducing the parameters and decreasing the computational burden and memory consumption.

Network Architecture
As shown in Figure 2, the proposed model is a deep recursive neural network that can be roughly partitioned into three substructures, namely, local feature extraction and fusion, feature distillation, and feature compensation and SR reconstruction.Except for the upsampling operation, motivated by previous works on SISR [24,25,27,43], the entire process of local feature extraction and fusion is in the LR space.I LR and I SR are considered the LR input and HR output of the proposed DDRN, respectively.F i and B j refer to the output in the i th layer and the j th block, respectively.In this work, the LR RGB images are directly fed into the network and processed with the initial convolutional layers (two layers with 3 × 3 kernel) to extract features as follows: where H(•) denotes the convolution operation.F 1 and F 2 represent the shallow feature maps extracted through the initial convolutional layers, served as the input of the UDB.Moreover, the proposed residual block UDB is used as a basic module for local feature extraction in DDRN.For each UDB, the information cannot only be shared among layers and multiple-path units but also be used as the input for the subsequent residual blocks with ultra-dense connections.These strategies enforce information propagation and lead to fine feature expression by combining the multi-scale coarse-and-fine features in different stages.The operation can be defined as follows: where H block,i denotes the entire convolution operation in the i th UDB and B i−1 refers to the extracted feature maps from the (i − 1) th UDB.As shown in Figure 1, compared with the conventional CNN-based modules [24][25][26]29,30], whose commonly used residual block contains the simply direct or skip connections between layers, the proposed UDB module is composed of several interactive multiple-path units and parametric rectified linear units (PReLU).The dedicated architecture for UDB enjoys more linking paths in the same layers and provides more possibilities for feature extraction than do these previous strategies, thus matching the uneven content complexity of remote sensing imagery.Specifically, the simple links are adapted to smooth areas, whereas complex connections are suited for high-frequency texture details.According to previous SISR algorithms [24,27,29,30], the output of the current stage is directly transmitted to the next stage.Then the final residual maps are obtained at the top layer for SR reconstruction.However, information loss is inevitable during its propagation in the network, thereby weakening the continuity of information propagation.Previous works add a set of nodes to shorten the transmission distance, thus boosting information propagation and reducing information loss during propagation, so-called skip connections [24,29].However, increasing the nodes between the input and the output cannot only deepen the network but also increase computational burden and memory consumption.Differently, we facilitate information propagation with the multiple-path residual module UDB.Furthermore, we also present a distillation and compensation strategy for fine feature expression by compensating for extra-high-frequency details.As shown in Figure 3, unlike the traditional network, whose output in each block is directly transmitted to the subsequent part, our proposed method can adaptively distill and preserve the feature maps by partly choosing information from the current output yet retain its integrety.Then, these feature maps collected from different stages are aggregated and purified in MSPU to infer and compensate for the high-frequency details before the reconstruction operation.In this study, we denote the preserved part from B i as the distillation unit (DU) with the ratio of α.At the same time, B i is used as the input to the subsequent residual block for further extraction.This process can be formulated as follows:

MSPU
where α refers to the distillation ratio, which indicates that the feature maps in each stage with the ratio of α will be distilled and preserved.In our experiments, we set α to {0.0, 0.125, 0.25, 0.5}.S(•) represents the distillation operation, and DU i denotes the distilled information from the i th residual block B i .
In addition, the reserved feature maps DU i in different stages are aggregated through a concatenation operation, and then they are fed into the purified unit MSPU, where the HR components lost in the previous blocks are reactivated as a compensation for SR reconstruction.In Equation ( 5), H C (•) denotes the concatenation operation adopted to collect the distillation information and M(•) refers to the MSPU.Through the distillation and compensation mechanism, the high-frequency components compensated from MSPU can further promote reconstruction performance.
At the end of the network, the feature maps extracted from the top UDB and the compensated high-frequency details purified from MSPU are combined to infer and restore the HR components by a transition layer with 3 × 3 kernel.Then, a sub-pixel upsampling operation is used to project these features into HR space to obtain the residual image.The detailed operation is expressed as follows: where D n and P represent the feature maps extracted from the top UDB and the compensated details from MSPU, respectively.H S denotes a transition function that contains a 3 × 3 convolution layer to fuse features and infer HR components, adaptively.I B refers to the bicubic interpolated image.

PS(•)
represents the reconstruction operation performing a sub-pixel amplification to obtain the HR residual image in the ending part of the network.

Feature Extraction and Distillation
In this section, we present the design of each key module under our DDRN framework in details, including UDB, MSPU, and Resolution Lifting.

Ultra-Dense Residual Block (UDB)
It is acknowledged that rich dense connections can promote feature expression [26,27].Therefore, we design a dense connection module for feature extraction.In this study, a multiple-path residual block UDB is constructed to enforce the correlation among layers and blocks with rich dense connections.Compared with existing skip-or dense-connection-based methods, UDB considers diverse short and long linking paths (the multiple-path structure) and exhibits effective information-sharing capability among the layers.Therefore, our network provides additional possibilities for feature extraction, quite agreeing with the uneven complexity of image content.More precisely, simple links are adapted to smooth areas, whereas complex connections are suited for high-frequency texture details.As shown in Figure 1d, UDB includes several interactive multiple-path units, which can fuse the feature maps extracted from parallel multiple convolution paths.The information-sharing mechanism aggregates features in different levels to ensure a rich feature representation further.The function of the i th unit can be formulated as follows: Equations ( 7) and ( 8) formally show the operation process in a multiple-path unit.In Equation ( 7), ) refer to the single convolution operation and the feature congregation of multiple convolution layers in each unit, respectively.In Equation ( 8), y i denotes feature concatenation in the current unit.s i,n indicates the transition output in the n th path of the i th unit, and s i−1,n represents the output from the n th path of the (i − 1) th unit.Functionally, a group of skip connections is used to enforce the correlation among the input and output feature maps, where the transition layers represented as H 1 are embedded to reduce feature channels with 1 × 1 convolution kernel.
Unlike skip-or dense-connection-based algorithms [26][27][28], the proposed multiple-path ultra-dense connection block can simultaneously explore and infer local and global features.In particular, the feature maps in the multiple-path unit cannot only be shared among the layers in the current unit through aggregation and dense connections but also be used as the input of other units with skip connections.Given the simplicity, effectiveness, and robustness of this strategy, local features can be well expressed through numerous short and long paths.Furthermore, owing to the effective structure for feature extraction in UDB, the network can become shallow in the channels but wide for the convolution paths, which extremely reduces the parameters and simultaneously boosts the reconstruction performance.

Multi-Scale Purification Unit (MSPU)
In [44], the authors focused on channels and proposed a novel architectural unit termed "squeeze-and-excitation" (SE) block to recalibrate channel-wise feature responses adaptively by explicitly modeling the interdependencies between channels.The SE block can learn to use global information to emphasise informative features and suppress less useful features selectively.This model won the first place in the classification contest ILSVRC2017 [45].
In this study, we adopt the SE module because of its promising efficiency and efficacy.On the basis of this finding, we propose an applicable module MSPU for information compensation.The basic structure of MSPU building unit is illustrated in Figure 4. Contrary to the squeeze-and-excitation network (SEN) [44], the redundant residual connections between SE blocks used for features transmission are removed.In addition, given that the full connection layer can destroy the internal structure of the image, we therefore replace it with a 1 × 1 convolution layer.Moreover, we adopt a robust activation function, e.g., parametric rectified linear unit (PReLU), to replace the previous version rectified linear unit (ReLU).
On the basis of MSPU process, we further propose a distillation and compensation strategy to compensate for lost details.By partially distilling the components from B i with the distillation ratio of α, as shown in Figure 3, we can obtain feature maps originating from UDB in different stages.Then, these features are aggregated into MSPU to purify and gain more abundant and efficient information.The extraction functions can be defined as follows: In Equation ( 9), the input x denotes the concatenation of the distilled components in different satges, equivalent to 5), and H(•) represents a group of convolutional operations (with 3 × 3 kernel) that is adopted to fuse the features distilled from different levels.As expressed in Equation ( 10), A P denotes the global average pooling, H 1 refers to the group of transition layers that comprises the bottleneck structure, and σ represents the sigmoid function.

Resolution Lifting
To project a single LR image into HR space, the resolution of LR image must be increased to match that of the HR image at a certain point.Osendorfer et al. [46] presented a computationally efficient architecture for image SR by leveraging the fast approximate inference to increase the image resolution in the middle of the network gradually.Another well-known approach can also achieve spatial resolution enhancement by linear interpolation [23,24].They obtained the same image resolution by directly using the common bicubic interpolation before loading the dataset into the network.
In addition, the early work of Shi et al. [47] is noteworthy when considering the upsampling operation.Contrary to authors of previous works, the researchers proposed an efficient sub-pixel convolution layer to increase the image resolution only at the final layer, eliminating the need to perform most of the SR operations in the large HR space.Compared with the transposed convolution and bicubic interpolation, sub-pixel magnification [47] is actually a realignment of feature maps without extra parameters, thus quite decreasing memory consumption and computational cost.These reasons enable the network go deeper and be trained easily.
As expressed in Equation (11), PS is a shuffling operator that rearranges the elements of a H × W × C • r 2 tensor acquired in the top layer into a rH × rW × C tensor (where r is the magnification factor of the network, and C refers to the feature channels of the input image).Mathematically, the upsampling function can be expressed as follows: PS(T) x,y,c = T x/r , y/r (mod(x, r), mod(y, r)), (11) where T indicates the output from the final layer with the size of W × H × Cr 2 , (x, y) denotes the output pixel coordinate in the HR space, (x/r, y/r) represents the pixel area of r × r in the sub-pixel space, and (mod(x, r), mod(y, r)) refers to the pixel coordinate in LR space.The Cr 2 channels of each pixel in the same location in the LR space is rearranged into a region of 1r × 1r × C, which corresponds to a subblock in an HR image, and the feature image is rearranged into an HR image of rW × rH × C.
In this work, as in many CNN-based SISR methods [25,47,48], we adopt the sub-pixel upsampling strategy to reconstruct the HR image at the top layer because of its promising efficiency and efficacy.

Loss Function
It is well known that SISR is an ill-posed problem whose solution from the reconstruction constraint is not unique because of the insufficient number of LR images, ill-conditioned registration, and unknown degradation process.In previous works, the loss function is commonly used to fit the real target image by minimizing the distance between the reconstructed HR image and the ground truth.The commonly used distance measurements include pixel-based l 1 -norm [25] and l 2 -norm [23,24,29], and cosine distance based on feature level.
Most of the previous works [23,27,29] constrain the reconstruction image by minimizing the mean squared error (MSE) or maximizing the peak signal to noise ratio (PSNR), which is a common measure used to evaluate SR algorithms [49].However, the capability of MSE to capture perceptually relevant components, such as high-frequency texture details, is insufficient because they are defined on basis of pixel-wise image differences [50].For example, the previous works [23,29,43] use MSE loss as the cost function and produce overly smooth reconstruction results that are inconsistent with human vision.In [25,51], the authors proposed a novel optimal function charbonnier loss based on the l 1 -norm, which can recover a large amount of realistic details, more faithful to the ground truth.In our work, we therefore introduce the charbonnier penalty function to penalize the deviation of the prediction from the residuals of ground truth.The loss function can be expressed as follows: Loss(I SR , I HR , θ) = arg min θ ∑ ρ(I HR − f (I LR , θ)), (12) where θ denotes a set of model parameters to be optimized and ρ(x) = √ x 2 + ε 2 represents the charbonnier penalty function (a differentiable variant of l 1 -norm).We empirically set the compensation parameter ε of 10 −3 .I SR and I HR refer to the predicted HR image and the ground truth.

Experimental Results and Analysis
In this section, first, we describe the experimental settings, including the data collection and model parameters.Then, we assess the effect of the distillation ratio α and the network depth m on the reconstruction performance.Subsequently, we compare our results with these state-of-the-art techniques and provide a thorough analysis.We retrain the comparison algorithms with our training dataset to ensure a fair comparison, including SRCNN [23] and VDSR [24].Moreover, we directly apply the original models [23][24][25] trained with general image datasets, as the anchors.

Data Collection
For general image SR, a large quantity of public training and assessing datasets, such as DIV2K [52], BSD500 [53] and Yang291 [39], are used to evaluate the results.However, few available datasets can be used as the training samples for satellite imagery SR because of the special requirements of ground target resolution.We use two available satellite image datasets, namely, Kaggle Open Source Dataset and Jilin-1 video satellite imagery, to train and evaluate the proposed DDRN method.

1.
The first imagery dataset is the Kaggle Open Source Dataset (https://www.kaggle.com/c/draper-satellite-image-chronology/data), which contains more than 1000 HR images of aerial photographs captured in southern California.The photographs were taken from a plane and meant as a reasonable facsimile for satellite images.The images are grouped into five sets, each of which having the same setId.Each scenario in a set contains five images captured on different days (not necessarily at the same time each day).The images for each set cover approximately the same area but are not exactly aligned.Images are named according to the convention (setId-day).
In this dataset, the scene has 3099 × 2329 pixels and 324 different scenarios.A total of 1720 satellite images cover agriculture, airplane, buildings, golf course, forest, freeway, parking lot, tennis court, storage tanks, and harbor.In this study, 30 different categories are selected for the test and 10 for the evaluation.Meanwhile, a total of 350 images are used for the training.Regarding the training dataset, the entire images are cropped into many batches with 720 × 720 pixels, but only the central area of the testing images with size of 720 × 720 pixels is cropped for testing and evaluation.

2.
The second satellite dataset is from Jilin-1 video satellite imagery.In 2015, the Changchun Institute of Optics, Fine Mechanics, and Physics successfully launched the Jilin-1 video satellite which had 1.12 m resolution.To cover the duration of video sequences, we select one for every five frames from each video and crop the central part with the size of 480 × 204 as test samples.
We select several areas in different countries with certain typical surface coverage types, including vegetation, harbor, and a variety of buildings as the test images.

Model Parameters and Experiment Setup
In our experiments, we use an NVIDIA GTX1080Ti GPU and an Intel I7-8700K CPU for training and testing, respectively.Our model is implemented on TensorFlow with Python3 under Windows10, CUDA8.0, and CUDNN5.1 systems.We mainly focus on the up-scaling factor of 4, which is usually the most challenging case in image SR.
The original HR images are downsized by bicubic interpolation to generate LR images for training.We augment the training patches by horizontal or vertical flipping and rotating 90 • .By following the settings presented in [54], we send one batch consisting of 16 LR RGB patches with the size of 32 × 32 from the training datasets to our network each time.The learning rate is initialized to 10 −3 for all layers and halved for every 10 4 steps up to 10 −5 .In our model, each convolution layer contains 64 filters, followed by PReLU.We empirically set the distillation ratio α to {0.0, 0.125, 0.25, 0.5} and the number of parallel convolution layers n in each multiple-path unit to 3. For the basic module DDRN, the depth of UDB is 15.In our experiments, training a basic module consumes approximately 20 h under the previously presented experimental settings.

Quantitative Indicators (QI)
Similar to many previous representative works [23,24,28,29], we also select two commonly used evaluation metrics, i.e., PSNR and structural similarity (SSIM), to evaluate the model performance.These evaluation metrics differ in terms of visual perception but involve reference images for comparison.However, in real SR scenes, we have only LR images to be super-resolved, without the corresponding HR reference image.Therefore, we need to introduce quantitative non-reference image quality assessment methods.Quality with no reference (QNR) [55,56], generalized quality with no reference (GQNR) [57] and average gradient (AG) [58] are commonly used image quality evaluation algorithms without reference, which can reasonably assess the clarity of reconstructed image.Nevertheless, QNR and GQNR are used for multispectral or hyperspectral images rather than ordinary RGB images, which needs to calculate the spectral distortion index and spatial distortion index.Thus, in this study, we propose to alternatively use AG for objective evaluation without reference.This process can be expressed as follows: dx (i,j) = I (i+1,j) − I (i,j) , ( 14) where dx and dy refer to the horizontal and vertical gradients, respectively, and I (i,j) denotes the pixel value corresponding to the coordinate of (i, j).
The indicator of the AG can reasonably assess image clarity because it sensitively reflects content sharpness, detail contrast, and texture diversity.Generally, the larger the AG, the richer the details.Thus, the AG can be used to evaluate the reconstruction quality of satellite imagery in real-world scenes, such as Jilin-1 video satellite imageries.

Validation of the Ultra-Dense Residual Block
We examine the effectiveness of the proposed deep recursive CNN network DDRN and the multiple-path UDB.Given that SRCNN [23] and VDSR [24] are the most representative and most effective deep-learning-based SR methods, in our experiments, we retrain these two models by using the same training datasets and label them as SRCNN * and VDSR * .Figure 5 shows the comparison results according to the iterations of DDRN, SRCNN, and VDSR.Comparatively, our DDRN exhibits faster convergence and higher scores than do direct-connection-based SRCNN and skip-connection-based VDSR.This superiority can be mainly attributed to the proposed multiple-path ultra-dense connections which can readily capture local features.Thus, our framework significantly boosts the SR efficacy of remote sensing imagery.In Figure 6, we show the evaluation results of the proposed DDRN method and the comparison algorithms on the Kaggle Open Source Dataset to verify the usefulness of the ultra-dense connections strategy further.The test set contains 30 different scenarios, which are labeled 1 to 30 in Figure 6.The figure shows that by using ultra-dense connections, we obtain better reconstruction results than do the conventional CNN-based methods, i.e., SRCNN [23] and VDSR [24].For the average PSNR, our DDRN shows substantial improvements, surpassing VDSR by 0.92 dB, and SRCNN by 1.94 dB.Similarly, SSIM is also considerably improved.In summary, the proposed residual block UDB effectively captures realistic detail textures.Although SRCNN and VDSR are effective, the well-designed deep recursive framework DDRN is more suitable for satellite image SR reconstruction.

Influence of Parameters α and m
On the basis of the basic module DDRN, we implement a distillation and compensation mechanism to compensate for the HR components lost during information propagation to infer and restore more realistic high-frequency details.The improved model with MSPU embedment is called DDRN + .In particular, a couple of comparison simulation experiments are conducted to analyze the influences of (i) the hyperparameter α in Equation ( 4) for partial feature maps distillation and preservation, (ii) the depth value m of UDB on the reconstruction performance.
We report the training process of the proposed DDRN + with respect to different distillation ratios to verify the necessity of the proposed distillation and compensation mechanism.When α is set to 0, no components are distilled in the current stage, whereas MSPU does not function.Figure 7 shows the comparison results of the training process under different distillation ratios.From the figure, we learn that the proposed DDRN + exhibits better training performance than the basic module DDRN.In addition, we observe that, with an increase in the distillation ratio α, the module exhibits robust and fast convergence.This result can be attributed to the increasing compensated high-frequency details from the MSPU by an increased distillation ratio.However, we also observe that the performance starts to decline when α is set to a large value, e.g., 0.5.This result can be mainly attributed to the large distillation rate, which may result in information redundancy.In addition, excessive parameters might lead to overfitting.All of these results indicate that the proposed distillation and compensation mechanism show substantial improvements by compensating for high-frequency details.Therefore, embedding MSPU into the basic module for satellite image SR reconstruction is an effective and reliable choice.
In light of the observations in these previous works [26][27][28], fine features can be well inferred from a deep CNN framework.Thus, we gradually increase the depth of the network by simply adding the number of the UDB (i.e., m is set to 10, 15, 20, 25, 30, and 35).We assess the performance of different values of m.In Figure 8, we show the training details of the proposed DDRN + method with different depths.When simply increasing the value of m to 30, the improvement gradually increases and surpasses the basic module by approximately 0.22 dB in the scale of 4. By contrast, the performance declines when we continue to increase m to 35 and the network exhibits slow convergence.This result can be mainly attributed to the overfitting, and the convergence of the network becomes more difficult in such a depth.In particular, DDRN denotes the improved module with the distillation ratio α of 0, which is actually the basic module.On the basis of the experiments, we can obtain the optimal distillation ratio α and UDB depth m for satellite image SR reconstruction, which are set to 0.25 and 30, respectively.

Comparison Results with the State-of-the-Art
We compare our basic model DDRN and the improved version DDRN + (α = 0.25, m = 30) with other SISR algorithms, including Bicubic, SRCNN [23], VDSR [24], and LapSRN [25], by the scaling factors of ×2, ×3, and ×4.The implementations of these anchor methods have been released online and can thus be conducted on the same test datasets.
The reconstruction results obtained with above mentioned Kaggle Open Source Dataset for the proposed approaches and the comparison methods are shown in Figure 9.We select several different but representative scenarios (i.e., crossroads, factory, freeway, tennis court, and parking lot) to produce a visual presentation.Experimentally, we crop these representative scenarios into a sub-batch with the size of 120 × 120 pixels from each reconstructed SR image and compute PSNR and SSIM.Notably, the proposed method DDRN and its improved version DDRN + surpass these state-of-the-art methods by a large margin.Moreover, the modules that we propose exhibit the most accurate and realistic image details from the visual effect.Most of the comparison methods produce noticeable artifacts and blurred edges, whereas the proposed DDRN + can recover sharper and clearer edges because of successful feature extraction and fusion, more faithful to the ground truth.For example, as shown in Figure 9, only our proposed modules restore the clear court boundary in the tennis court scenario and the accurate and credible car outline in the four other scenarios.Therefore, all of the proposed models exhibit solid performance improvements compared with the conventional direct-or skip-connection-based algorithms [23][24][25].Objectively, Tables 1, 2 and 3 tabulate the detailed evaluating results in terms of PSNR, SSIM and AG with the magnification scales of ×2, ×3, and ×4, respectively.From these records, we learn that raw CNN-based or skip connection methods, such as SRCNN [23] and VDSR [24], exhibit lower scores than do DDRN-based methods (i.e., in terms of PSNR, the proposed DDRN + surpasses SRCNN and VDSR retrained by approximately 2.16 and 1.14 dB with the scale of 4 in the first test dataset, respectively.).Among these comparison methods, the basic module DDRN shows the best performances because of its ultra-dense-connection-based effective framework for local spatial information extraction.In addition, through the compensated high-frequency details obtained from the MSPU, the improved version DDRN + can produce fine detail textures.With regard to PSNR and SSIM, Figure 6 shows an more intuitive result that the proposed modules outperform these state-of-the-art methods [23][24][25] by a large margin.For the metric AG, the proposed DDRN and DDRN + are also better than previous works on average.In particular, in the comparison results shown in the three tables, our methods exhibit remarkable advantages when the upsampling factor is large, as reported at the bottom of the three tables.These results indicate the advantages of the proposed ultra-dense-connection manner in modeling the relationship between LR and HR images with lager magnification factors.Another group of comparison experiments are conducted with the Jilin-1 satellite imagery to illustrate the effectiveness and applicability of the proposed ultra-dense strategy and distillation and compensation mechanism further.Compared with the first dataset Kaggle Open Source Dataset, the test images obtained from Jilin-1 show lower quality (small ground objects and weak textures) but more realistic satellite imagery characteristics.Unlike the images in the training dataset, the test images have completely different imaging conditions, including ultra-high imaging distance, atmospheric scattering, relative motion between satellite and moving ground targets, and compression distortion.These severe imaging conditions pose substantial demands to SR networks.
With an operation similar to the previously presented preprocessing of the testing images, we crop the test images with the size of 480 × 204.The reconstruction results obtained from our proposed approaches and the comparison methods are shown in Figure 10.For the first and second images, most of the comparison methods produce noticeable artifacts and blurred edges.By contrast, the proposed DDRN and DDRN + can recover sharp and clear edges because of fine feature expression that is faithful to the ground truth.At the bottom of the figure, only our proposed modules can reconstruct a clear outline of the warships and dock, whereas the other conventional methods fail to restore the realistic details.These results further indicate the effectiveness of the proposed method.
Furthermore, we perform a set of realistic SR reconstruction experiments for the unknown real degradation process (i.e., using the observed LR images instead of the downscaled LR images as input).These test images are randomly selected from Jilin-1 satellite imagery using the same preprocessing to acquire the test images with the size of 480 × 204.Then, the processed images used as the LR input are directly transmitted to the network to obtain the reconstructed HR images.The comparison results with other state-of-the-art algorithms are shown in Figure 11 (we show only one example due to space constrains).Evidently, most of compared methods [23,24] produce noticeable artifacts and blurred building outlines, whereas the proposed DDRN and DDRN + yield better results with fewer jagged lines and ringing artifacts.Instead of the commonly used evaluation metrics PSNR and SSIM (because the original HR images are unavailable), we introduce the AG to measure the sharpness of the SR results.As shown in Figure 11, the proposed modules DDRN and DDRN + enjoy the second and first highest AG scores, respectively.The results for real video satellite imagery indicate that our model is more robust than the comparison methods in super-resolving the image with unknown degradation process.
In brief, the SR reconstruction experiments on different test datasets and magnification scales show the advantages of feature expression and indicate the robustness of our modules against images of unknown degradation models.

Conclusions
In this study, we propose a simple but very effective technique for remote sensing image SR reconstruction.In particular, we present a multiple-path UDB for local feature extraction and fusion.Unlike in the conventional methods, rich dense connections between layers and units promote information interaction and improve reutilization.In addition, we further promote feature expression by advocating a distillation and compensation mechanism.The feature maps distilled from different stages with a special distillation ratio α are aggregated to compensate for the high-frequency details lost during information propagation in MSPU.Extensive experiments on the test datasets indicate that the proposed DDRN and its improved version DDRN + outperform existing state-of-the-art feature extraction techniques, including conventional direct-and skip-connection-based methods.In particular, when the image degradation model is unknown, the proposed algorithm can still obtain competitive reconstruction results compared with the comparison algorithms.

Figure 1 .
Figure 1.Frameworks of the CNN-based modules.(a) Flat-net (e.g., SRCNN[23] and FSRCNN[43]): Direct connections are commonly used to learn the features.(b) Skip-net (e.g., VDSR[24]) : An identity map with connecting input to the output is pioneered for SISR.(c) Dense-net (e.g., DenseNet[26] and SRDenseNet[27]): The feature maps are directly passed from the preceding layers to the current layers through the identity function with much richer connections.(d) UDB: Interacted multiple-path units are embedded for extracting local feature maps with a richer ultra-dense connections."C" and " + " denote the concatenation and adding operation, respectively.

Figure 2 .
Figure 2. Outline of the proposed deep distillation recursive network (DDRN).The red distillation symbol followed the UDB represents the distillation operation with a special distilled ratio of α.

Figure 3 .
Figure 3.The distillation and compensation mechanism.The red components indicate that the distilled feature maps B i × α in current UDB are adaptively preserved.α denotes the distillation ratio for current UDB output B i .MSPU refers to the further purification operation.

Figure 4 .
Figure 4.The Multi-scale feature purification unit (MSPU).The distillation components preserved from the different stages are fused to obtain compensation information lost during the information delivery.X denotes the matrix multiplication.

Figure 5 .
Figure 5. Training process for different models with the scale of 4. On the top, the blue line denotes the convergence process of the basic module DDRN with depth of 15 while the green and the red lines at the bottom refer to the VDSR and SRCNN.The competitive algorithms marked by * denote the retrained versions with our dataset.

Figure 6 .
Figure 6.The SR performance comparisons for 30 different scenarios (denoted by label) from Kaggle Open Source Dataset.The competitive algorithms marked by * denote the retrained versions with our dataset.

Figure 7 .
Figure 7. Training process for different distillation ratios by the scale of 4. DDRN + represents the improved module with MSPU embedded at different ratios on the basis of the basic module.In particular, DDRN denotes the improved module with the distillation ratio α of 0, which is actually the basic module.

Figure 8 .
Figure 8. Training process for different depths of DDRN + with scale of 4 and the distillation ratio α of 0.25.We set UDB number m to 10, 15, 20, 25, 30 and 35 while keeping other parameters consistent.

Figure 9 .
Figure 9.The reconstruction results on Kaggle Open Source Dataset and by the scale of 4. We select several different but representative scenarios, i.e., crossroads, factory, freeway, tennis court and parking lot, and then crop them into small image batches in size of 120 × 120 for demonstration.Red and blue indicate the best and the second best performance, respectively.

Figure 10 .Figure 11 .
Figure 10.The reconstruction results on Jilin-1 dataset with the scale of 4. We select several different but representative scenarios, i.e., aircraft carrier, city suburb, and military harbour to make comparisons.Red and blue indicate the best and the second best performance, respectively.

Figure 11 .
Figure 11.An example for the reconstruction results on Jilin-1 imagery by the scale of 4. The experiment is performed with real low satellite images rather than simulation degradation.Red and blue respectively indicate the first and the second best performance in terms of AG.Note that the enlarged details are shown in the boxes on the bottom left and bottom right in each image.

Table 1 .
Quantitative evaluation of the proposed DDRN approach and its improved version DDRN + against some state-of-the-art SISR algorithms on Kaggle Open Source Dataset with 30 different scenarios for the scale factor of ×2.Bold indicates the best performance.Particularly, * refers to the modules retrained by us with Kaggle Open Source Dataset.

Table 2 .
Quantitative evaluation of the proposed DDRN approach and its improved version DDRN + against some state-of-the-art SISR algorithms on Kaggle Open Source Dataset with 30 different scenarios for the scale factor of ×3.Bold indicates the best performance.Particularly, * refers to the modules retrained by us with Kaggle Open Source Dataset.

Table 3 .
Comparison results of the proposed DDRN approach and its improved version DDRN + with some state-of-the-art algorithms on Kaggle Open Source Dataset for the scale factor of 4. Bold indicates the best performance.Particularly, * refers to the modules retrained by us with Kaggle Open Source Dataset.