Video Satellite Imagery Super-Resolution via Model-Based Deep Neural Networks

: Video satellite imagery has become a hot research topic in Earth observation due to its ability to capture dynamic information. However, its high temporal resolution comes at the expense of spatial resolution. In recent years, deep learning (DL) based super-resolution (SR) methods have played an essential role to improve the spatial resolution of video satellite images. Instead of fully considering the degradation process, most existing DL-based methods attempt to learn the relationship between low-resolution (LR) satellite video frames and their corresponding high-resolution (HR) ones. In this paper, we propose model-based deep neural networks for video satellite imagery SR (VSSR). The VSSR is composed of three main modules: degradation estimation module, intermediate image generation module, and multi-frame feature fusion module. First, the blur kernel and noise level of LR video frames are ﬂexibly estimated by the degradation estimation module. Second, an intermediate image generation module is proposed to iteratively solve two optimal subproblems and the outputs of this module are intermediate SR frames. Third, a three-dimensional (3D) feature fusion subnetwork is leveraged to fuse the features from multiple video frames. Different from previous video satellite SR methods, the proposed VSSR is a multi-frame-based method that can merge the advantages of both learning-based and model-based methods. Experiments on real-world Jilin-1 and OVS-1 video satellite images have been conducted and the SR results demonstrate that the proposed VSSR achieves superior visual effects and quantitative performance compared with the state-of-the-art methods.


Introduction
Over the past few years, video satellite imagery [1][2][3][4] has received considerable attention in the remote sensing and aerospace field. Compared with the traditional satellites that obtain static images [5][6][7][8], video satellite provides a novel way to capture continuous videos. It can acquire dynamic information from the objects on the Earth's surface and thus has great advantages in dynamic monitoring, such as moving ship detection [9], object tracking [10], and object detection [11]. However, for the sake of increased temporal resolution and the degradation in imaging procedures, the spatial resolution is lost to a certain extent, which hinders the further application of video satellites. Super-resolution (SR) [12][13][14] is an effective way to recover the sharp and natural high-resolution (HR) images (or sequence) from their low-resolution (LR) counterparts. Note that SR is a classical ill-posed inverse problem [15] that can increase the spatial resolution and clarity of lowquality images, it is an important but challenging task in video satellite imagery.
In the literature, much work has been devoted to improving the quality of images/videos by SR. From the perspective of the length of LR images used, SR can be categorized into single image SR (SISR) [16][17][18] and multi-image SR (MISR) [19][20][21]. SISR a half-quadratic splitting strategy. However, USRNet is a SISR method, which cannot fully use the spatial-temporal information of adjacent frames. In addition, both the blur kernel and noise level in the USRNet should be preset in advance.
To overcome the above-mentioned drawbacks, we propose a video satellite imagery SR method termed VSSR. The proposed VSSR is composed of three main modules, i.e., degradation estimation module, intermediate image generation module, and multi-frame feature fusion module. The degradation estimation module is designed to estimate the blur kernel and noise level of the input LR frames. The intermediate image generation module is utilized to unfold the MAP framework and iteratively solve two optimal subproblems, while the multi-frame feature fusion module is constructed to fuse the features from multiple adjacent video frames. To sum up, the main innovative contributions of this work lie in the following three aspects: • We propose a novel VSSR method for video satellite imagery SR. To the best of our knowledge, it is the first attempt to combine DL and model-based methods in the field of video satellite SR. • The proposed VSSR can split the SR problem into two sub-optimization problems under the umbrella of the MAP framework. One of the subproblems has an analytical solution, and the other subproblem is solved by subnetworks. By alternatively optimizing the sub-optimization problems, we can obtain intermediate SR results. • The proposed VSSR can leverage the information from adjacent frames through a three-dimensional (3D) feature fusion subnetwork. Different from the SISR methods or the MISR methods based on optical flow estimation, the VSSR is a MISR method, in which the features from multiple frames are effectively fused by 3D residual blocks.
The remainder of this paper is organized as follows. In Section 1, the proposed VSSR is described in detail. In Section 2.2.5, the experimental results and analysis are reported. Finally, this paper is concluded in Section 5.

Data Collection
In this paper, we collect the real-world Jilin-1 and OVS-1 video satellite data (see Table 1) as materials. A detailed description of those two datasets is given as follows:

Proposed Method
In this section, we explain the proposed VSSR method in detail, including the network architecture (see Figure 1) and the loss function for model optimization.

Network Architecture of the VSSR
The overall architecture of the proposed VSSR method is shown in Figure 1, which inputs T = 2t + 1 consecutive LR frames {L i−t , . . . , L i , . . . , L i+t } from the video satellite, and outputs the SR result of the center frame L i . For simplicity, we show the scenario with t = 1 in Figure 1. To fully analyze the degradation process, we formulate the degradation model of LR and HR pairs as where L and H denote the LR and HR images, respectively, K represents the blur kernel, S denotes downsampling, and n is the additive Gaussian noise with noise level σ. Equation (1) has been extensively discussed in the model-based SR methods, in which the optimization objective function can be expressed as the following combination of a data term and a prior term under the MAP framework where 1 2σ 2 L − SKH 2 is the data term, Φ(H) refers to the prior term, and α denotes the trade-off between the data term and prior term.
By adopting the half-quadratic splitting algorithm, we introduce an auxiliary variable I and rewrite Equation (2) as where β is the penalty parameter. Therefore, the problem in Equation (3) can be split into the following two subproblems where Equations (4) and (5) are associated with I and H respectively, and I j and H j are the solution in the j-th iteration. Apparently, Equation (4) has analytical solution since it is a least square problem, whose solution can be modeled as where the matrix D = 1 T denotes the conjugate transpose, s is the scale factor, F (·) refers to the Fourier transform, Λ = [Λ 1 , Λ 2 , . . . , Λ s ], in which the matrix Λ i = 1, 2, . . . , s satisfies the relationship diag{Λ 1 , Λ 2 , . . . , Λ s } = Λ, the diagonal elements of the diagonal matrix Λ are the Fourier coefficients of the first column of the blur kernel K, and E denotes the identity matrix. For Equation (5), it is actually a denoising problem with noise level ψ = α/β. Motivated by [46,47], we propose a wavelet-based U-net to estimate the clear H j .
In VSSR, each LR frame from {L i−t , . . . , L i , . . . , L i+t } can be represented by Equation (1) and the corresponding HR frame can be obtained by iteratively solving Equations (4) and (5). Subsequently, all the estimated adjacent HR frames are stacked and fed into a 3D feature fusion subnetwork, which can effectively use the features from multiple frames. In a nutshell, the VSSR is composed of three main modules, i.e., degradation estimation module, intermediate image generation module, and multi-frame feature fusion module (see Figure 1), whose details are as follows.

Degradation Estimation Module
In the degradation estimation module, the noise level estimation and blur kernel estimation are respectively performed on the input LR frames. As to the noise level estimation, we calculate the noise level σ of the center frame L i by referring [48]. As to the blur kernel estimation, we assume the LR frames suffer from Gaussian blur and design a fine-grained classification-based subnetwork to estimate the standard deviation of the blur kernel. In greater detail, the standard deviation is divided into 6 classes, i.e., [0.5, 1.0, 1.5, 2.0, 2.5, 3.0], each category is quite similar, and therefore, the fine-grained classification method is suitable for recognizing the classes with small inter-category variations [49]. As illustrated in Figure 2, the blur kernel estimation subnetwork Knet accepts the blurred image B as input and outputs the estimated standard deviation where Knet is the blur kernel estimation subnetwork, ρ is the standard deviation, B denotes the blurred image, Θ Knet represents a set of parameters in Knet. It is shown from Figure 2 that the embedded features are extracted by the first 30 layers of VGG16 [50] and bilinear pooling is adopted to combine the pairwise interactions between features extracted by the two subnetworks with shared weights. To optimize the parameters Θ Knet , The following two loss terms are utilized here L 1 and L 2 are the cross entropy loss and L 1 loss, respectively, λ is a trade-off parameter, x is the output vector of the fully connected layer (FC) in Knet, class refers to the ground truth class of the standard deviation, k gt is the ground truth blur kernel, andk denotes the estimated kernel.

Intermediate Image Generation Module
In the intermediate image generation module, the auxiliary variable I j and HR image H j (j = 1, 2, . . . , N) are iteratively solved. The closed-form solution of I j is calculated by Equation (6), while H j is estimate by the wavelet-based U-net. In greater detail, it is observed from Figure 1 that the I j calculation block takes k, s, φ and H j−1 as input and outputs I j . Moreover, we design a WUnet (see Figures 3 and 4), which utilizes I j and ψ to estimate H j . Specifically, WUnet can be formulated as where Θ WUnet refers to the parameters to be optimized in WUnet. Discrete wavelet transform (DWT) and inverse wavelet transform (IWT) are used as downsampling and upsampling layers in WUnet to enlarge the receptive field. As to the parameters, the standard deviation of k is obtained by Knet, s represents the scale factor, and the parameters φ and ψ are generated by the hyper-parameter estimation subnetwork, which is simply composed of three fully connected layers with two rectified linear units (ReLU) [51] as the former two activation functions and Softplus [52] as the last.

Multi-Frame Feature Fusion Module
In the multi-frame feature fusion module, all the estimated intermediate HR frames of the input LR frames are stacked and put into a 3D feature fusion subnetwork Fnet (see Figure 5), which utilizes several 3D residual blocks to fuse the information of adjacent frames. The final SR result of the central frame H Final is generated by Fnet, which can be modeled asĤ  Figure 3. Architectures of the wavelet-based denoise subnetwork WUnet with corresponding number of feature maps (n), padding (p) size, and dilation (d) size. The kernel size and stride size of all the 2D convolution layers are set to 3 and 1, respectively.  Figure 4. Architectures of the f 0 , f 1 , f 2 , h, g 0 , g 1 , and g 2 used in WUnet with corresponding number of feature maps (n), padding (p) size, and dilation (d) size. The kernel size and stride size of all the 2D convolution layers are set to 3 and 1, respectively

Model Optimization
In VSSR, we use the following L 1 loss as loss term to optimize the parameters Θ WUnet , Θ Fnet and the parameters in the hyper-parameter estimation subnetwork whereĤ Final and H Final denote the final SR results obtained by the VSSR and the corresponding reference result, respectively. Moreover, we adopt the widely-used adaptive moment estimation (Adam) solver [53] to minimize the loss term L VSSR .

Results
In this section, we validate the effectiveness of our proposed VSSR by conducting a group of experiments on the real-world Jilin-1 and OVS-1 video satellite data (see Table 1). First, implementation details are introduced. Second, the VSSR is compared against state-ofthe-art SR methods, and the experimental results on the Jilin-1 and OVS-1 data are displayed. Next, an ablation study is described to assess the contribution of each component. Finally, the sensitivity of different parameters is analyzed.

Implementation Details
In the experiments, we compare our VSSR method with several state-of-the-art SR methods, including bicubic interpolation (termed as Bicubic), SRCNN [34], VDSR [35], EDSR [36], DBPN [37], SAN [38], USRNet [45], DBVSR [44], and M D [40]. Notably, Bicubic, SRCNN, VDSR, EDSR, DBPN, SAN, USRNet, and M D are SISR-based methods, which directly map the single LR image into HR image, while DBVSR and VSSR are MISR-based methods, which can make use of the spatio-temporal information from neighboring LR frames. In both DBVSR and VSSR, 3 adjacent LR frames are fed into the network to reconstruct the SR result of the center frame. The Bicubic method generates the HR image by using bicubic interpolation, while the rest are DL-based methods, which learn the mapping function from LR-HR pairs. M D applies convolution layers and a deconvolution layer to enhance the resolution of video satellite images. USRNet, DBVSR, and VSSR perform SR by combining learning-based methods with model-based ones under the MAP framework. Besides the 10 videos from the Jilin-1 data (see Table 1), 30 additional videos download from a video website (https://pixabay.com/videos, accessed on 1 October 2021). are leveraged to enlarge the amount of training set. To maintain consistency, we also extract 10 frames with 1280 × 1280 spatial pixels and 3 RGB bands from those additional videos. By using Jilin-1 data as training samples, the learning-based methods can better understand the characteristics of satellite videos, while the 30 additional videos are conducive to increasing the quantity and diversity of the training set. Since the scale factor s = 4 is more challengeable than s = 2 or s = 3, we only consider s = 4 in the experiments.
All the methods are performed on a workstation with Intel (R) Xeon (R) Gold 5218 CPU@2.30 GHz@2.29 GHz dual processor and Nvidia GeForce RTX 2080Ti GPU. For the sake of fairness, all the DL-based methods are retrained by using the same training set as VSSR. The configurations of competing methods are following their corresponding references. The detailed architectures of VSSR are illustrated in Section 2.2.1. The Equations (4) and (5) are iteratively solved by 6 times, which means the variable N in Equation (12) equals to 6. The learning rate is initialized as 0.0005, and the parameters in the Adam solver are set to 0.9 and 0.999, respectively. The patch size and batch size are set to 96 and 5, respectively, while the training is stopped within 50 epochs since more training epochs cannot lead to further significant improvement.
To quantitatively compare different SR methods, we apply 5 commonly used evaluation metrics for comparison. In greater detail, the evaluation metrics are root mean square error (RMSE), peak signal-to-noise ratio (PSNR), correlation coefficient (CC), structure similarity index (SSIM), and erreur relative globale adimensionnelle de synthèse (ERGAS). Larger PSNR, CC, and SSIM indicate better SR results, while opposite for other indicators.

Experimental Results on the Jilin-1 Data
To better understand the procedure of the VSSR, we first visualize part of the learned features in Figure 6, which displays 16 feature maps extracted by the eighth 3D residual block of Fnet (see Figure 5) for the San-ya data. Since 3D convolutions are used in Fnet, each feature map shown in Figure 6 is a RGB image. Moreover, it is observed that different structures and abstract features are learned and highlighted by different convolutional channels. We then compare the qualitative and quantitative results of various methods in Figures 7-9 and Table 2 to verify the effectiveness of the VSSR. Specifically, Figures 7-9 visualize the SR results of the third frame in the San-ya, San Diego, and Macao satellite videos, respectively. The maps shown in the third and fourth rows are the enlarged counterparts of the scenes signed by red boxes in the first and second rows. In Table 2, we display evaluation results of our proposed VSSR against the competing methods on the Jilin-1 data with scale factor equals to 4. Bold and italic underline represent the best and the second-best performance, respectively. According to the experimental results on the Jilin-1 data, DL-based methods have advantages in obtaining better performance than bicubic interpolation. In most cases, Bicubic yields the lowest PSNR, CC, SSIM and the highest RMSE, and ERGAS among all SR methods. For instance, the RMSE of Bicubic is at least 0.4904 higher than DL-based methods, while the gap between Bicubic and DL-based methods is at least 0.3214 dB in PSNR. Intuitive comparisons are shown in Figures 7-9, which demonstrates that Bicubic exhibits the most inaccurate and blurred image details in an aspect of visual effect. The reason for the poor results of Bicubic is that the degradation process is represented by a predefined interpolation pattern, which is not suitable for reflecting the realistic relationship between LR and its corresponding HR results in video satellite images.    As to the DL-based methods, SRCNN is inferior to other methods, EDSR leads to better results than M D and VDSR, DBPN has a slightly worse or comparable performance compared with EDSR. SAN yields superior performance than EDSR and DBPN, but achieves worse SR results than USRNet and DBVSR, while VSSR consistently outperforms the competing methods. For the San-ya data, the PSNR of M D and VDSR surpass SRCNN about 0.2075 dB and 0.5806 dB, respectively, while the PSNR of EDSR is 0.4633 and 0.0902 dB higher than M D and VDSR, respectively. In addition, the PSNR of DBPN is 0.0621 lower than EDSR. SAN surpasses EDSR and DBPN by 0.0683 dB and 0.1304 dB in PSNR, respectively, while the improvement of USRNet and DBVSR is at least 0.6004 dB comparing with SAN. The PSNR of VSSR is the highest among all methods. The RMSE, CC, SSIM, and ERGSA of VSSR are also better than other methods, while DBVSR achieves the second-best evaluation metrics. Similar properties can also be found in the San Diego and Macao data.
From the perspective of visual presentation, DL-based methods produce more noticeable texture details than Bicubic. As shown in Figures 7-9, the SR results generated by SRCNN contain more blurred edges and artifacts than other DL-based methods. Among all the methods, the SR results of our proposed VSSR show the clearest visual effect and the sharpest edges. For instance, the outlines of domes, buildings, roads in San-ya (see Figure 7) are clearer than other methods. The streets, trees, and buildings in San Diego and Macao (see Figures 8 and 9) are also closer to the ground-truth images than other methods. The aforementioned phenomena demonstrate the advantage of VSSR in reconstructing HR video satellite images. Figure 10 compares the number of parameters and floating point operations (FLOPs) of the VSSR against other DL-based approaches when we acquit a 1280 × 1280 video frame. In greater detail, Figure 10a compares the number of parameters, while Figure 10b compares the FLOPs. As shown in Figure 10a, the number of parameters of SRCNN, M D , VDSR, USRNet, and EDSR are relatively low, they are all less than 2 M. DBPN and SAN are within 20 M, both of which have more parameters than the five methods mentioned above.
The number of parameters of VSSR is 22.6 M, and the parameter of DBVSR is 50.5 M, which is the largest among all methods. It is depicted in Figure 10b that the FLOPs of SRCNN, M D , and VDSR are all less than 70 G, while the FLOPs of USRNet, SAN, and DBVSR are larger than SRCNN, M D and VDSR. The VSSR has larger FLOPs than the other methods except DBPN, and the FLOPs of DBPN are greater than 9000 G, which is the largest among all methods. Moreover, the average inference time of various methods for reconstructing the San-ya scene is shown in Figure 11a, from which we can observe that Bicubic is the fastest in all the methods, while SRCNN and M D cost less time than other DL-based methods. The reason is that an efficient interpolation pattern is used in Bicubic, and the network structures of SRCNN and M D are much simpler than other methods. It can also be observed from Figure 11a that VDSR, EDSR, and USRNet spend more time than SRCNN and M D , but much faster than DBVSR. This is due to the fact that SRCNN and M D use only 3 and 4 convolution layers, respectively, while VDSR, EDSR, and USRNet use multiple convolutional layers and residual connections, multiple residual blocks, and multiple U-net blocks, respectively. Moreover, DBPN is slower than other methods, while the inference time of VSSR is moderate (i.e., less than 7 s) among all the comparison methods. Therefore, the proposed VSSR can be effectively adopted to practical application. Furthermore, Figure 11b compares the average inference time of each module in the VSSR for reconstructing the San-ya scene with size 1280 × 1280. We can observe from Figure 11b that the degradation estimation module consumes the least amount of time (i.e., 0.0040 s), and the intermediate image generation module consumes the most time (i.e., 4.7859 s). It should be noted that the total time of the above three modules is less than the average inference time of VSSR shown in Figure 11a. This is because Figure 11b does not count the data import time and the time to store the results as images.

Experimental Results on the OVS-1 Data
To further examine the practicability of VSSR in real scenarios, we perform an additional group of experiments on OVS-1 data. Different from the Jilin-1 data, which utilizes 10 Jilin-1 videos for producing the LR-HR training pairs, the LR frames from OVS-1 data are directly fed into the networks trained without using any OVS-1 data. The visual performance of different methods is compared in Figures 12 and 13, while the detailed evaluation results are shown in Table 3.
From these results, we can observe that the quality of SR results is better than the original LR frames. For instance, as shown in Figure 12, the details of the LR frame are very blurry, while the clarity of the HR frames obtained by various SR methods is improved to a certain extent. Similar phenomenon can also be found in the Marseille video (see Figure 13). It is worth stressing that when inputting a lower image spatial resolution, the accuracy obtained by using DL-based methods will also be lower. For example, the spatial resolution of OVS-1 data is 0.98 m less than that of Jilin-1 data, and the PSNR obtained is about 8 dB lower than that of Jilin-1 data. As displayed in Table 3, Bicubic provides worse results than other methods. The DL-based methods, which combine learning-based methods with model-based ones under the MAP framework (i.e., USRNet, DBVSR, and VSSR), outperform the comparison methods without fully considering the degradation process. Moreover, MISR-based methods (i.e., DBVSR and VSSR) yield superior performance than SISR-based methods (i.e., SRCNN, VDSR, EDSR, DBPN, SAN, USRNet, and M D ). For instance, it is notable from Table 3 that the evaluation indexes of DBVSR and VSSR achieve the second-best and best performance among all methods. As plotted in Figures 12 and 13, the visual results of DBVSR and VSSR are more realistic and clearer than other methods. The reasons for good results of VSSR are that it integrates the advantages of both learning-based and model-based methods by splitting the SR problem into two sub-optimization problems, and the blur kernel and noise level is flexibly estimated. Last but not least, the spatial-temporal information of neighboring frames are fully considered by 3D residual blocks. In a nutshell, the VSSR outperforms the comparison methods in restoring realistic SR results.  Table 3. Quantitative evaluation of the proposed VSSR against different methods on the OVS-1 data with scale factor of 4, Bold indicates the best and italic underline indicates the second-best performance.

Ablation Study
Ablation experiments are designed in this subsection to assess the advantages of the proposed VSSR. All the experiments are conducted on the San-ya video from the Jilin-1 data.
In our ablation study, the effectiveness of each module (i.e., degradation estimation module, intermediate image generation module, and multi-frame feature fusion module) are examined. To that end, the modules in VSSR are gradually removed to evaluate the change in PSNR and SSIM. In greater detail, when the degradation estimation module is deleted from VSSR, fixed blur kernel and noise level are used rather than changing with the input LR frames. The fixed blur kernel and noise level are then fed into the intermediate image generation module, whose outputs are stacked and input into the Fnet to generate the final SR results. When the intermediate image generation module is removed, the degradation estimation module will lose its efficacy. Motivated by the SRCNN and VDSR, each LR frame is upsampled by bicubic interpolation, and then those upsampled frames are fed into the multi-frame feature fusion module to produce the SR results. When the multi-frame feature fusion module is removed from VSSR, the H N of the center frame is taken as SR results.
Note that the degradation estimation module is important in the VSSR, we visualize the estimated blur kernel and noise level obtained by the degradation estimation module in Figure 14. The estimated blur kernel is shown in Figure 14a, from which we can observe that all the 10 frames have the same standard deviation value (i.e., 0.5). The estimated noise level is depicted in Figure 14b, which demonstrates that the noise level of different frames fluctuates in a small range between 1.92 and 1.97. Moreover, the PSNR and SSIM of the above-mentioned methods are compared in Table 4, from which one can observe that both PSNR and SSIM will drop in case one of the modules is removed. Moreover, the performance is the worst when the intermediate image generation module is discarded. Based on the aforementioned analysis, all three modules are important to the VSSR.  Table 4. Results of the ablation study on evaluating the effectiveness of each module in VSSR, bold indicates the best and italic underline indicates the second-best performance. " " denotes the module is adopted while "" denotes the module is not adopted.

Sensitivity Analysis of Different Parameters
The sensitivity of several free parameters, including patch size, batch size, the length of adjacent LR frames, and epochs during training the VSSR, are discussed in this subsection. Figure 15 plots the influence of the above-mentioned parameters on the performance of VSSR in the San-ya video.
We first analyze the sensitivity of patch size and batch size. The patch size is chosen from {48,96,144,192}, while the batch size is selected from {1,5,10,15}. As shown in Figure 15a, the performance of VSSR fluctuates with the change of patch size and batch size. As to the patch size, the PSNR will increase in case the patch size is larger than 48 and lower than 144. As to the batch size, the SR results are satisfactory when it equals to or is larger than 5. In the experiments, we set the patch size to 96 and the batch size to 5 to make a trade-off between the model effectiveness and efficiency.
We then discuss the influence of the length of neighboring LR frames T on the SR performance. The SR results are reconstructed with various length of LR sequences, i.e., {1,3,5,7,9}. In particular, it is observed from Figure 15b that when T = 1, VSSR degenerates to a SISR method, whose performance is inferior to T > 1. When T increases from 3 to 9, the PSNR slightly fluctuates in a small range. Notably that larger T will increase extra calculation burden, we set T = 3 in the experiments.
Finally, the impact of epochs on the SR performance is analyzed. The loss versus training epochs is shown in Figure 15c, in which the epochs are varied from 1 to 100 with an interval of 1. It is clearly visible in Figure 15c that the loss drops rapidly in the first couple of epochs, then decreases slowly as epochs increasing, and finally tends to be stable.
In the experiments, we train VSSR by using 50 epochs to generate a stable and effective model for SR of video satellite images.

Discussion
This paper aims to propose a model-based deep neural network for SR of video satellite images. The quality and clarity of video satellite images are improved, which is conducive to a wider application of video satellites in dynamic monitoring. In this section, we discuss the following issues.
First, we explain and discuss how the modules are trained. In the training process, we first train the degradation estimation module by using the loss term L Knet shown in Equation (8), and then the trained degradation estimation module is adopted to estimate the noise level σ and the standard deviation of k, and the intermediate image generation module and multi-frame feature fusion module are optimized by using Equation (13). The reasons for full training the first module and then training the other two modules are: (1) Use the trained degradation estimation module can estimate more accurate k than an untrained one; (2) The σ and k estimated by the trained degradation estimation module can be directly utilized to train the intermediate image generation module and multi-frame feature fusion module. That means, we only need to optimize these two modules at the same time instead of three modules, which can speed up the training speed. As to the effectiveness of the VSSR on the training and test sets, It is shown in Figure 15c that the training loss tends to be stable after about 20 epochs and fluctuates in a narrow range around 0.03. Due to the different characteristics of different videos, the evaluation metric values (e.g., PSNR and SSIM) obtained by different training videos are different. As to the test set, the results are shown in Tables 2 and 3. The accuracy also varies from video to video. For instance, the PSNR of San-ya, San Diego, and Macao is 27.0287, 26.1814, and 28.5170 dB, respectively.
Second, it should be noted that the proposed VSSR is an offline method rather than an online method. This is because it uses pre-acquired videos for training, while an online method continues to learn from live data. In the testing procedure, the whole trained VSSR is adopted to reconstruct the HR counterpart of the input LR frames. When new video data is acquired, the trained VSSR network can be treated as a pre-trained model, and then the pre-trained model is fine-tuned with the newly acquired data. Of course, the model can also be retrained with all the video data (i.e., pre-acquired and newly acquired data), but this will be more time-consuming.
Furthermore, there is still room for improvement. For instance, in our VSSR, a deep unfolding module based on two-dimensional wavelet-based U-net is designed to obtain the intermediate HR results, and a 3D feature fusion subnetwork is subsequently used to fuse features from adjacent frames. In other words, VSSR uses 2D wavelet-based U-net and 3D Resnet separately instead of directly using 3D wavelet-based U-net to obtain the final HR result. This is because when the number of feature maps n increases to 128, 256, or 512, 3D wavelet-based U-net will consume more calculation and storage space than its corresponding 2D version. However, if lightweight convolution models and waveletbased U-net calculation methods can be proposed, it is worthwhile to directly adopt 3D wavelet-based U-net to obtain the final HR results in the future.

Conclusions
In this paper, a model-based deep neural network (i.e., VSSR) is particularly designed to perform SR on video satellite images. In contrast to existing DL-based methods, which do not fully consider the degradation process when mapping the relationship between LR frames and their HR counterparts, the proposed VSSR integrates the advantages of learning-based methods with model-based ones by splitting the objective function of the degradation model into two sub-optimization problems. By constructing a degradation estimation module, both blur kernel and noise level are flexibly estimated and varied with the input data instead of manually setting those parameters in advance. Intermediate image generation module plays a vital role to deal with the sub-optimization problems, one of which has analytical solution while the other can be solved by a wavelet-based U-net (i.e., WUnet). The main contribution of the multi-frame feature fusion module is to fuse the spatial-temporal information from multiple video frames by using 3D residual blocks. Experiments are performed on both Jilin-1 and OVS-1 data. The visual results have demonstrated that HR frames generated by VSSR contain sharper edges, fewer artifacts, and clearer visual effects than comparison methods. Moreover, the SR results are also quantitatively evaluated by 5 evaluation metrics. It is worth underlining that the proposed VSSR yields higher PSNR, CC, SSIM and lower RMSE and ERGAS than state-of-the-art methods. The proposed method could be of great interest for a wider application of video satellites. Future research can be conducted to design lightweight networks for SR of video satellite images. Studying in-depth the impact of resolution, bit depth, file format on image quality is also very promising. Additional studies are also needed in future to optimize the network structure by neural architecture search. Data Availability Statement: The Jilin-1 data in this study are openly and freely available at http: //charmingglobe.com/ (accessed on 10 September 2021). The OVS-1 data in this study are openly and freely available at https://www.myorbita.net/index.aspx (accessed on 10 September 2021). The 30 additional videos are openly and freely available at https://pixabay.com/videos (accessed on 1 October 2021).