A Comparable Study of CNN-Based Single Image Super-Resolution for Space-Based Imaging Sensors

In the case of space-based space surveillance (SBSS), images of the target space objects captured by space-based imaging sensors usually suffer from low spatial resolution due to the extremely long distance between the target and the imaging sensor. Image super-resolution is an effective data processing operation to get informative high resolution images. In this paper, we comparably study four recent popular models for single image super-resolution based on convolutional neural networks (CNNs) with the purpose of space applications. We specially fine-tune the super-resolution models designed for natural images using simulated images of space objects, and test the performance of different CNN-based models in different conditions that are mainly considered for SBSS. Experimental results show the advantages and drawbacks of these models, which could be helpful for the choice of proper CNN-based super-resolution method to deal with image data of space objects.


Introduction
The Space-Based Space Surveillance (SBSS) satellite [1], launched in September 2010, is a significant stepping stone towards a functional space-based space surveillance constellation. In February 2013, the Near-Earth Object Surveillance Satellite (NEOSSat) [2] was launched, which is the first space telescope dedicated to detecting and tracking asteroids and satellites. Optical imaging sensors of the vision systems aboard these satellites are the eyes for us to monitor the space. Previous researches have translated the information provided by space-based imaging sensors into many practical applications, such as autonomous rendezvous and docking [3][4][5], vision-based landing [6], position and pose estimation [7][8][9][10], space robotics and on-orbit serving [11][12][13][14], satellite recognition [15][16][17][18], 3D structure reconstruction [19,20], etc. These works have proved that high-resolution images play an important role in applications, because they contain richer information which is needed to achieve better performance in the tasks. However, it is a common scene that images of the target space objects captured by space-based imaging sensors usually suffer from low spatial resolution due to the extremely long distance between the target and the imaging sensor. Such a problem can be typically solved by image super-resolution.
The goal of image super-resolution (SR) is to restore a visually pleasing high-resolution (HR) image from a low-resolution (LR) input image or video sequence. HR images have higher pixel densities and finer details than LR images. Image SR has been proved to be of great significance in many applications, such as video surveillance [21][22][23], ultra-high definition TV [24], low-resolution face recognition [25][26][27][28][29] and remote sensing imaging [30,31]. Benefiting from its broad application prospects, SR has attracted huge interest, and currently is one of the most active research topics in image processing and computer vision. Early interpolation-based image SR methods [32][33][34] are extremely simple and fast. Unfortunately, severe aliasing and blurring effects make interpolation-based SR suboptimal in restoring fine texture details. Reconstruction-based image SR methods [35][36][37] combine elaborately designed image prior models with reconstruction constraints, and can restore fine structures. However, these image priors usually are incapable of modeling complex and varying contexts of natural images. In the past decade, most researches focus on learning-based image SR [38][39][40]. It utilizes machine learning techniques to capture the relationships between LR image patches and their HR counterparts from some samples. Recently, due to fast advances in deep learning, especially convolutional neural networks (CNNs), CNN-based SR [41][42][43][44][45] has shown promising performance in certain applications. However, there are still many challenging open topics of deep learning for image SR, e.g., new objective functions, new architectures, large scale images, depth images, various types of corruption, new applications, etc.
Therefore, this paper emphasizes the important role of CNN for single image SR with the purpose of space application. We comparably study four recently popular models including SRCNN [41] (Super-Resolution Convolutional Neural Network), FSRCNN [42] (Fast Super-Resolution Convolutional Neural Network), VDSR [44] (Very Deep Super-resolution Convolutional Networks), and DRCN [43] (Deeply-Recursive Convolutional Networks) for single image super-resolution based on CNNs. In view of the differences between natural images and images of space objects, we specially fine-tune the super-resolution models mentioned above using simulated images of space objects, and test the performance of different CNN-based models in typical conditions that are common for SBSS. Our experimental results obviously show the advantages and disadvantages of these models, thus, could be helpful for the choice of proper CNN-based super-resolution method to deal with image data of space-based sensors.
The rest of this paper is organized as follows. Section 2 describes the four CNN-based SR methods briefly and shows parameters used in this paper in detail to benefit researchers in this field. Section 3 demonstrates extensive experiments we have done to compare these four models comprehensively. Section 4 gives discussions about the experimental results. Section 5 concludes this paper.

SRCNN
SRCNN [41] (Super-Resolution Convolutional Neural Network) is the first deep learning method for single image super-resolution, which can directly learn an end-to-end mapping between the low/high-resolution images. The network structure layout is simple as shown in Figure 1. It only contains three layers, and each layer has a convolution layer with an activation function. The input image of the network is a bicubic interpolation image of a low-resolution image, with the same size as the output HR images. The first layer mainly extracts patches and representations of low-resolution images. The second layer maps the n 1 − dimensional representations (feature vectors) of several patches into an n 2 − dimensional one, making a non-linear mapping. The number of patches for each mapping operation depends on the kernel size of the second convolution layer. Then the last layer realizes the reconstruction of high-resolution image. The parameters of SRCNN used in this paper are shown in Table 1, which are optimized to achieve the best performance of SRCNN, because of gradient vanishing, increasing the numbers of network layers cannot improve the performance of SRCNN.

FSRCNN
FSRCNN [42] (Fast Super-Resolution Convolutional Neural Network) is an upgraded version of SRCNN, focusing on accelerating the speed of high-resolution reconstruction. The structure of FSRCNN is a little more complicated and can be roughly divided into five parts, i.e., feature extraction, shrinking, mapping, expanding and deconvolution, as seen in Figure 2. The deconvolution layer is an important improvement which makes it possible to learn the mapping directly from the original low-resolution image to the high-resolution one without the interpolation operation at the beginning as SRCNN. In this way, the size of the input image does not need to be enlarged, which reduces the computation and improves the speed. As the non-linear mapping of SRCNN is operated in higher dimensional space, which is complex and time-consuming. FSRCNN solves this problem by adding a shrinking layer before the mapping operation to reduce the feature dimension. Besides, an expanding layer after the mapping layer is also added for better generating the HR image. The speed of FSRCNN is much faster than SRCNN, and the performance of FSRCNN is better as well. Table 2 shows the parameters of FSRCNN used in this paper in detail. The parameters of FSRCNN refer to the original work.   Figure 3 uses 20 layers with small filters to obtain larger receptive field. Convergence speed is greatly affected by network depth. To get better performance and accelerate the speed at the same time, learning residuals has become a good choice, based on the fact that LR images and HR images share the same information to a large extent. The residuals between HR and LR images learned using extremely high learning rate combine LR images to generate final HR images. Note that images need bicubic interpolation to form input data, and all feature maps are in the same size by zero padding, so that the prediction effect of image edges is better. The parameters of VDSR are shown in Table 3. According to the experimental results, we find that 12 filters in convolution layer are enough to reconstruct space object images. Therefore, to train the model and reconstruct the images faster, we adjust the number of filters of convolution layer from 64 in [44] to 12 in this paper.

DRCN
DRCN [43] (Deeply-Recursive Convolutional Networks) introduces a very deep recursive layer into the field of SR reconstruction. It may perform better if the depth of recursive layers increases, but the numbers of parameters do not increase much since all recursions share the same parameters which is contrary to convolution layers. It is also the obvious significance of importing recursive layers. The reconstruction results are obtained by weighted average of the results of each recursive convolution layer as shown in Figure 4. Bicubic interpolation is also a necessary procedure before training. The parameters of VDSR used in this paper are shown in Table 4. It should be noted that we changed the number of recursive layers from 16 in [43] into 5 for accelerating the training speed, because when the number of recursive layers is more than 5, the reconstructed results for space object images are almost invariant with the increase of recursive layers in our experiments.  Figure 4. Network structure of DRCN used in this paper. Table 4. Parameters of DRCN used in this paper.

Input
Bicubic interpolation of LR images Number of layers 9

Residual unit
No Parameters of 1st layer

Dataset
Our experiments use space object dataset BUAA-SID 1.0 [15,17] to explore the ability of the above four CNN-based SR methods in the application of space objects. BUAA-SID 1.0 cotains 20 categories of space objects, and each category has 230 images with the size of 240 × 320 forming a dataset with totally 4600 images. The images in each class are captured in different viewpoints.
We firstly divide all images in BUAA-SID 1.0 into 460 parts in order. For each part that contains ten images, nine images are selected randomly as training samples and one for testing or validation. In terms of the validation set, we randomly choose one image for every space object category, i.e., a total of 20 images. Thus the testing set contains 440 images. Since the images in BUAA-SID 1.0 have no background, we extract the region of interest (ROI) namely the external rectangle of the space object. Particularly, taking the probable impact of noise into account, we extract the external rectangle of all pixels whose gray value is above ten instead, and increase the length and width of the rectangle by 30 pixels without exceeding the image boundary. Since the four CNN-based SR models in Section 2 make no restrictions on the size of the input image, the image sizes in our dataset can be diverse. Therefore, to get more training data, every image in the training set is downsampled to 1, 0.95 and 0.9, generating 12,420 images at all. Furthermore, four patches are randomly extracted from every image as training HR patches, and 2, 3 and 4 times downsampling of these images are done to obtain corresponding LR patches. Therefore, the number of image pair in training set, validation set and testing set are 12,420, 20 and 440, respectively. It should be noted that the length and width of the external rectangle of extracted ROIs in testing set are 10 pixels larger than those in training set.
In addition, for better and comprehensive research and comparison, 91 images proposed in Yang et al. [46] which we name T91 are used as another independent training set, and two standard benchmark datasets, i.e., Set5 [47] and Set14 [48], are chosen for the corresponding testing set. We train the four popular CNN-based SR networks using T91 and BUAA-SID 1.0, respectively, and test the performance of them on three testing set when train on T91 dataset. By this way, we can not only compare our experimental results with the original paper to ensure its validity, but also explore the transfer performance of these networks between different data sets.

Index for Evaluation
We use peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [49] as the metrics to evaluate the performance of each experiment.
Peak signal-to-noise ratio is widely used in the field of image quality assessment. It is defined by the maximum possible pixel value (denote as L) and the mean squared error (MSE) between images. Given the ground truth X with a total of N pixels and its corresponding constructed image X SR , the MSE and the PSNR can be calculated by the following equations: The structural similarity index (SSIM) [49] focuses on measuring the structural similarity between images. It incorporates three relatively independent elements, including luminance, contrast and structure. The definition of SSIM is as follows: where C 1 = (k 1 L) 2 and C 2 = (k 2 L) 2 are constants to avoid instability. The mean and the standard deviation of the ground truth X are denoted as µ X and σ X , respectively, and the mean and the standard deviation of constructed image X SR are denoted as µ X SR and σ X SR . σ XX SR is the covariance between X and X SR .

Training with Natural Images in Fixed Scale
We first train SRCNN, FSRCNN, VDSR and DRCN using T91 dataset to train three models for each network fixing the scale as 2, 3 and 4, respectively. Scale 2 means the spatial resolution of reconstructed image is twice of the input image. The larger the scale factor is, the worse the reconstructed image is, because the input image has relatively less information.
The experimental results are shown in Table 5. The best results are marked in red and the second best in blue. The results show us that VDSR and DRCN perform better on natural images, while the reconstruction speed of FSRCNN is fastest except for the baseline bicubic method. When the testing data and training data is different to a large extent, DRCN and VDSR are also well adapted since they still rank the first and second, respectively. SRCNN, by contrast, do not work that well. FSRCNN works better than SRCNN, but worse than DRCN and VDSR. Figure 5 shows the visualization of sample reconstruction results on three testing sets.
In order to meet different requirements, we often need to train multiple networks according to the reconstruction scale, because the network trained by fixing the scale as a certain number is only adapted to reconstruct this certain scale, i.e., fixed scale super-resolution. When the testing scale is different from the training scale, the reconstruction result will be worse. In addition, training several networks means multiplied number of parameters and time consuming process of training. This is a problem that cannot be ignored in practical application. Super-resolution results of "cobe" (BUAA-SID 1.0) with scale factor × 2(T91 trainset)

Training with Natural Images in Multiple Scales
In response to the problem mentioned in Section 3.3, we use hybrid training strategy. That is to say we train a single model that is universal to different reconstruction scales by randomly selecting HR/LR image patches of all scales as input data. In this way, the parameters to be trained are greatly reduced. Images can be reconstructed at any scale using one set of model parameters, i.e., multiple scale super-resolution.
Because of the existence of deconvolution layer in FSRCNN, the structure of network will be different if the training HR/LR patches are in different scales. So FSRCNN cannot be trained to reconstruct different scale images using this strategy. Multiple scale super-resolution results of the other three networks trained on T91 dataset are shown in Table 6. PSNR-and SSIM-denote the difference between the multiple scale experimental results and fixed scale super-resolution reconstruction results. The experimental results prove that it is feasible to reconstruct the image at any scale by using this training strategy. The performance of VDSR and DRCN is relatively good. Compared with the fixed scale super-resolution results in Table 5, multiple scale super-resolution results are not much different. The strategy of mixing HR/LR patches of different scales as training set overcomes the shortcoming that a new requirement of a certain scale SR needs a new model. It may greatly improve the efficiency of reconstruction.

Comparison of Fixed Scale and Multiple Scale
For further comparison and analysis in the field of space objects, we perform more comprehensive experiments using BUAA-SID 1.0 dataset. We design experiments for each network to explore the performance of models trained by fixing scale or mixing scales when testing at a certain scale. That is to say, we test the reconstruction ability at three scales of every model we trained, not just the scale it is trained for. The experimental results of SRCNN, VDSR and DRCN are shown in Tables 7-9, respectively. Figure 6 shows the results of the comparison. In addition, in order to ensure the results are statistically significant, we train 3 different models repeatedly for every experiments, and report the means and standard deviations of PSNR and SSIM for evaluation. Table 7. Cross-scale experiments of SRCNN trained and tested on BUAA-SID 1.0 (mean ± standard deviation). The red font indicates the best performance, while the blue font indicates the second best.  Figure 6. Scale factor experiment for "glonas" in BUAA-SID 1.0. The method sm − sn means the method is trained for ×m SR and tested for ×n SR. Figure 6. Scale factor experiment for "glonas" in BUAA-SID 1.0. The method sm − sn means the method is trained for ×m SR and tested for ×n SR.  Table 9. Cross-scale experiments of DRCN trained and tested on BUAA-SID 1.0 (mean ± standard deviation). The red font indicates the best performance, while the blue font indicates the second best. By analyzing the experimental results of the above three methods trained by multiple scale and single scale image pairs, we can get a consistent conclusion. Using multiple scale image pairs to train the network can achieve the purpose of reconstructing HR images at any scale and the model performs well. It is only a little worse than the model whose training and testing scale is perfectly matched. While models trained by fixed scale cannot fit well when the testing scale does not match the training scale. Besides, the bigger the gap between them is, the worse the effect appears. In practical application, it is often necessary to reconstruct the space object image at any scale, but not just a fixed scale. Therefore, getting a single model which is universal to reconstruct HR images at any scale is a better choice. As for the performance of each individual network, it can be can easily see in Table 10 that DRCN is the best, VDSR is the second and SRCNN is the worst. Table 10. Multiple scale super-resolution results of networks trained and tested on BUAA-SID 1.0 (mean ± standard deviation). The red font indicates the best performance, while the blue font indicates the second best.  Figure 6 shows scale factor experiment for "glonas" in BUAA-SID 1.0. It can explain the experiment results and conclusion mentioned above more clearly. The method sm − sn means the method is trained for scale ×m SR and tested for scale ×n SR. We can observe that if the scale of training does not include the scale of testing, the reconstructed image has poor image quality. Specifically, if the scale of testing bigger than the scale of training, i.e., (s test > s train ), the SR results are blurry and the the high frequency textures are significantly lost. In construct, if s test < s train , the SR results show unnatural artifacts caused by over-enhancing high-frequency edges. In addition, if the network is trained by multiple scale, the reconstructed images for any scale have satisfying quality.

Comparison of Direct Training and Transfer Training
How to train our networks is also an important factor that may affect the final results. Direct training and transfer training are two common choices. Direct training means training a randomly initialized network directly using space object training set, while transfer training in our experiments is pre-training the network parameters firstly with T91 training set, and then using space object data to fine-tune the pre-trained network. We compare the effect of these two training methods on the task of reconstruct HR images of space objects.
We can see the final results of four networks in Table 11. There is a little difference between direct training and transfer training, and the results of transfer training is slightly better than that of direct training. This is to say transfer training cannot obviously improve the reconstruction effect of network on space object dataset. However, it can be seen from the training process in Figure 7 that transfer training can converge faster. The results indicate that transfer training is beneficial for accelerating network convergence, and the features learned by natural images (T91 training set) are helpful to super-resolution of space objects images.   Figure 8 shows the reconstruction results of different training methods. Notice that whether it is direct training or transfer training, the testing results are better than that trained by natural image dataset. This illustrates that it is necessary and effective to use the same or similar images with the image category to be reconstructed as the training set.

Computational Complexity
The computational complexity of the methods is also an important factor to measure their time efficiency and memory cost. We compare the times of multiplication calculation and the number of parameters of the four CNN-based networks, in order to theoretically analyze their computational complexity. Results in Table 12 show that FSRCNN has the least theoretical calculations and parameters, thus, it will run faster and cost less memory. Table 5 also validates that FSRCNN costs the least running time when reconstructing images. The only inconsistency between Tables 5 and 12 is VDSR. VDSR runs slowest while its theoretical computational complexity is the second best. This may be caused by the GPU acceleration when implementing the CNN-based networks. Since all of the networks using GPU for accelerating, the actual reconstruction time is not completely linear correlated with the theoretical calculations. In Table 5, the running speeds of SRCNN, VDSR and DRCN are not significantly different. This inspires us to implement CNN-based SR networks on a programming platform with better hardware acceleration for CNN.   Figure 8 shows the reconstruction results of different training methods. Notice that whether it is direct training or transfer training, the testing results are better than that trained by natural image dataset. This illustrates that it is necessary and effective to use the same or similar images with the image category to be reconstructed as the training set.  In practice, the space object images to be reconstructed may have different levels of noise, and the addition of noise will have a certain impact on the reconstruction effect. So it is necessary to experimentally test the anti-noise performance of the four CNN-based networks. Gaussian noise with a standard deviation (std) of 1-10 is added to the LR images of the testing set, as well as salt and pepper noise and Poisson noise. The super-resolution reconstruction results are compared with the noise-free HR image to obtain the PSNR/SSIM between the them. Table 13 shows the detail results and Figure 9 makes it easier to compare and analyze.  We can see from Table 13 and Figure 9 that the reconstruction effect of these four networks is affected to some extent with the increase of noise, among which the SRCNN is less affected by noise than the other three networks. In our experiments, we use a noise-free training set to train the SR networks, therefore the well-trained networks may not study suitable strategy to process images with various modes of noise. Generally, SRCNN has better noise robustness than other three network. The reason may be that SRCNN has the simplest structure, and thus, the model is less affected by noise. This indicates that the SR reconstruction algorithms based on deep neural networks may not have good anti-noise ability when training with noise free data, and the addition of noise has a great impact on their performance. Noise robustness may be a new branch of CNN-based SR reconstruction that need to be studied and improved.

Discussion
The analysis of the advantages and disadvantages of these four deep learning models can help choose the most suitable model for single image super-resolution of space objects.
In the circumstance that we do not have enough space object images to train a deep learning model, we take a model trained by natural images as shown in Section 3.3 instead. We can see from Table 5 that FSRCNN runs fastest to reconstruct HR images than other three models. In terms of reconstructed quality, DRCN and VDSR are the first and second, respectively. SRCNN does not work so well. FSRCNN works better than SRCNN, but worse than DRCN and VDSR. If we want to use a single model trained by natural images to reconstruct multiple scales, DRCN is the best model that is more generalized to space object images.
Mostly previous work using single scale LR/HR trainset to train the network. According to Tables 7-9, the network trained by fixing the scale as a certain number is only adapted to reconstruct this certain scale. The network performs poorly when the testing scale does not meet the training scale. Such a shortcoming limits the application of super-resolution for space object images. In order to overcome the weakness, we use hybrid training strategy. The experimental results in Tables 7-9 show that multiple scale network can achieve comparable results against fixed scale ones, especially when the testing scale is high (3, 4 in our experiments). It proves that it is feasible to reconstruct the image at any scale by using this training strategy. In addition, VDSR and DRCN are more suitable to use the strategy because their networks are complicated enough to process different scales images. Therefore, hybrid training strategy is meaningful for super-resolution of space object images. The well-trained network can process input images of all scales, i.e., the network can reconstruct the input image to any size, and the results are much better than the images generated by interpolation method, e.g., bicubic.
We also design the experiments about direct learning and transfer learning in Section 3.5.2. The results of transfer training is slightly better than that of direct training. Figure 7 shows that transfer training can converge faster. This indicates that transfer training is beneficial for accelerating network convergence and improving reconstructed results.
Furthermore, we analyze the computational complexity of these four deep learning models. According to Table 12, FSRCNN takes the lowest the least theoretical computational complexity. However, in order to get better efficiency and lower memory cost in practice, we should also consider the software optimization and hardware acceleration when implementing CNN-based SR models on a programming platform.
At last, we analyze noise robustness of four networks. All the four methods trained by noise free data cannot process images with noise effectively. Generally, SRCNN has better noise robustness than the other three network. If the image to be reconstructed contains strong noise, a feasible approach is to first denoise the image and then construct it.
Overall, SRCNN has the simplest structure, but the main body and edge of the space target are not well reconstructed by SRCNN since only three layers of SRCNN limit its ability to express and reconstruct space target image features. FSRCNN contains eight layers and uses a deconvolution layer to raise image resolution, because the first seven convolutional layers are calculated on low resolution images, FSRCNN runs faster than SRCNN while its SR performance is unremarkable. VDSR reconstructs the residual image, making it easier to study the difference between LR and HR. The edge and texture of the space target reconstructed by VDSR are clearer. DRCN uses recursive convolution networks. Its output layer takes the advantages of the information of the 3rd to 7th layers, thus, the main structure and edge details of the space target can be super-resolved best among the four CNN-based models, in both fixed scale and multiple scales. As a result, we suggest using DRCN fine-tuned from pretrained model on natural dataset as CNN-based SR model for space-based imaging sensors.

Conclusions
To meet the needs of image super-resolution in space applications, we have comparably studied four recent popular models for single image super-resolution based on convolutional neural networks. We not only explore the difference in the performance of these models, but also find some common properties which may be more important to inspire further research. Firstly, a multiple scale training strategy has been proven as an efficient way to obtain a single model to reconstruct HR images at any scale. Solving multiple scale SR tasks with one model is more valuable in practice. Secondly, transfer training makes the network easier to converge, and has slightly better results than training the initialized network using space object data directly. Thirdly, testing results will be better if the consistency between the training set and testing set is high. It is the key to success on a particular mission, but it is also an obstacle to expansion on other tasks. Finally, noise is a killer for image super-resolution because it is also amplified during reconstruction. In general, DRCN is the best model of the four models in this paper, since DRCN performs best in super-resolution of space object images in fixed scale and multiple scale. According to this work, researchers may see the advantages and disadvantages of CNN-based super-resolution methods more clearly and then promote the development of image super-resolution in space applications.