Deep Gradient Prior Regularized Robust Video Super-Resolution

: This paper proposes a robust multi-frame video super-resolution (SR) scheme to obtain high SR performance under large upscaling factors. Although the reference low-resolution frames can provide complementary information for the high-resolution frame, an effective regularizer is required to rectify the unreliable information from the reference frames. As the high-frequency information is mostly contained in the image gradient ﬁeld, we propose to learn the gradient-mapping function between the high-resolution (HR) and the low-resolution (LR) image to regularize the fusion of multiple frames. In contrast to the existing spatial-domain networks, we train a deep gradient-mapping network to learn the horizontal and vertical gradients. We found that adding the low-frequency information (mainly from the LR image) to the gradient-learning network can boost the performance of the network. A forward and backward motion ﬁeld prior is used to regularize the estimation of the motion ﬂow between frames. For robust SR reconstruction, a weighting scheme is proposed to exclude the outlier data. Visual and quantitative evaluations on benchmark datasets demonstrate that our method is superior to many state-of-the-art methods and can recover better details with less artifacts.


Introduction
Image/video super-resolution (SR) plays an important role in various applications such as computer vision, image recognition and high-definition display devices. The demand for high-performance SR algorithms is growing as high and ultra-high-definition displays have become prevalent. In general, video super-resolution can be divided into two categories: single image-based methods and multi-frame-based methods.
Bilinear, bicubic and spline interpolation are usually used for video super-resolution due to their low complexity. For these methods, fixed interpolation kernels are used to estimate the unknown pixels on the HR grid. However, the fixed kernel strategy will produce visually annoying artifacts such as jaggy edges, ringing effects and blurred details in the output image. Advanced interpolation methods [1][2][3][4][5] which take image structure into consideration can produce less jaggy edges. However, these methods still tend to produce blurry images, especially for large upscaling ratios. Learning-based methods try to reconstruct the high-resolution images via the mapping between the LR and HR images [6][7][8][9]. Timofte et al. [7,10] propose to replace the LR patches by the most similar dictionary atoms with pre-computed embedding matrix. Self-example approaches [11] exploit the fact that patches of similar pattern tend to recur in the image itself. More recently, deep neural networks have shown its potential to learn hierarchical representations of the highdimensional data. Convolutional neural network (CNN)-based methods have achieved impressive results [8,[12][13][14][15][16][17][18][19] in image/video SR.
Multi-frame-based super-resolution methods [20][21][22][23][24][25][26][27][28][29][30][31][32][33] use multiple images that describe the same scene to generate one HR image. They assume that different frames contain complementary information of the high-resolution frame. Thus, the key points of multiple frame SR include registration and fusion of the frames. Typical multi-frame SR methods [20,21,25,32] align the frames in sub-pixel level and reconstruct the HR frame based on the observation model. These methods perform well if the motions between the LR frames are small and global. However, it is difficult for them to handle large scale factors and large motions. Learning-based multi-frame SR methods learn a mapping directly from low-resolution frames to high-resolution frames [27][28][29]. These methods use the optical flow estimation to warp the frames according to the current frame and learn multi-frames fusion progress from the external database. Liao et al. [27] propose to handle the large and complex motion problems in multi-frame SR by deep-draft ensemble learning based on convolutional neural networks. More advanced methods learn the sub-pixel registration and the fusion function simultaneously via the deep neural networks [26,30]. However, the complex motion makes the learning of multiple fusion difficult and important image information may be eliminated by these methods.
In this paper, a robust multi-frame video super-resolution scheme is proposed to deal with large upscaling factors. Because of the ill-posedness of SR problem, a gradient prior learning network is trained to regularize the reconstruction of the HR image. The gradient network takes the upsampled LR image as inputs and learns the gradient prior knowledge from the external dataset. Then the learned gradient prior participates in the multi-frame fusion to predict the final HR image. Instead of directly learning the mapping from the LR gradients to HR gradients, we add the low-frequency information to the input of the network to stabilize the gradient learning and boost the performance. The HR reconstruction branch takes the LR frames as inputs, which provide the complementary information for the high-resolution frame. In the fusion stage, the learned gradients prior regularizes the reconstructed HR image to be visually nature. Experimental results demonstrate that our method is superior to many state-of-the-art single and multi-frame super-resolution methods in large upscaling factor, especially the edge and texture regions.
The contributions of the proposed scheme include: (1) We propose a novel deep gradient-mapping network for video SR problems. The network learns the gradient prior from the external datasets and regularize the SR reconstructed image. The effectiveness of this prior is analyzed.
(2) To obtain the high-resolution motion fields, we propose to estimate the motions in the low-resolution scale and then interpolate them to the high resolution. The motion field is regularized by a forward-backward motion field prior, which brings in more accurate estimation around the motion boundary.
(3) A weighting scheme is proposed to exclude the outlier data for robust SR reconstruction.
The rest of this paper is organized as follows. Section 2 gives the background of this paper. Section 3 introduces the proposed gradient prior learning network. Section 4 studies the estimation of the motion field and the robust SR reconstruction using the reference LR frames. Experimental results are reported in Section 5 and Section 6 concludes the paper.

Framework of Multiple Frames SR Reconstruction
As shown in Figure 1, the degradation of video frames is usually caused by the atmospheric turbulence, inappropriate camera settings, downscaling determined by the output resolution and noise produced by the sensor. Based on some studies on camera sensor modeling, the commonly used video frames observation model describes the relationship between an HR frame and a sequence of LR frames: the LR frames are acquired from the corresponding HR frame through motion, blurring and down-sampling. In this process, the LR frames may be disturbed by noise. Thus, the video frames observation model can be formulated as follows: where y k represents the k-th Low-resolution (LR) frame of size PQ × 1. x denotes the vectorized HR frame of size s 2 PQ × 1, where s is the down-sampling factor. 2M + 1 is the number of LR frames. F(u k , v k ) represents the geometric warping matrix between the HR frame and the k-th LR frame, where u k and v k represents the horizontal and vertical displacement fields, respectively. H is the blurring matrix of size s 2 PQ × s 2 PQ and D denotes the down-sampling matrix of size PQ × s 2 PQ. n k represents the additive noise of the k − th LR frame with the size of PQ × 1. Here, we define the y 0 frame as the current LR frame and the neighboring LR frames, {y k } k =0 are the reference frames.  Assuming that the neighboring LR frames in the temporal domain describe the same HR scene and have complementary information to each other, we intend to estimate the HR frame using the LR frames. In this paper, we cast the multi-frame video super-resolution as an inverse problem. Given multiple LR frames {y k } M k=1 , the original HR frame x can be estimated via the maximum a posterior probability (MAP) estimator: where log(Pr(y k |x)) indicates the likelihood of x and log(Pr(x)) corresponds to the image prior knowledge. As Pr(y|x) characterizes the relationship between y k and x, the noise probability model should be established.

Gradient-Based Super-Resolution
During the image acquisition process, the LR images lose parts of its visual details compared with the original HR images. The lost visual details are high-frequency in nature, and is believed to be mostly contained in the image gradient field. Many approaches try to recover the high-frequency image details by modeling and estimating the image gradients.
SR framework of the gradient-based methods is illustrated in Figure 2. The LR image y is first upsampled to the high resolution using a simple interpolation method. This upsampled LR image y u usually contains visual artifacts due to the loss of high-frequency information. The lost image details such as edges and textures are mainly contained in image gradients. Therefore, the framework extracts the gradient field G y u from y u and process it by a gradient recover operation, say P (·): whereG x is the estimated HR gradient field.G x is supposed to contain more accurate information about the image details. Finally, the HR imagex is reconstructed by fusing the LR image y with the obtained HR gradient fieldG x : where F (·) is the fusion operation. For the reconstruction-based SR methods, the fusion operation F (·) is usually formulated as an MAP estimator (2). Sun et al. [44] try to model the gradient-mapping function from the LR image to HR image by a statistical and parametric model. As the sharp edges in the natural image are related to the concentration of gradients perpendicular to the edge, Sun et al. [44] develop the gradient transform to convert the LR gradients to the HR gradients. However, it is rather difficult to model the gradients of an image with only a few parameters. Thus, the obtained HR images are usually over-sharped or suffer from false artifacts due to the incorrect estimation of gradients. Zhu et al. [47] propose a deformable gradient compositional model to represent the non-singular primitives as compositions of singular ones. Then they use the external gradient pattern information to predict the HR gradients. Although it is more expressive than the parametric gradient prior models, performance limitations also exist, especially in the complex detail areas.
Recently, deep neural networks showed its power in learning the representations of high-dimensional data. The convolutional neural networks (CNN) have already been used for many low-level vision applications such as denoising, super-resolution and de-rain. Dong et al. [8] first develop a three-layer neural network named SRCNN to learn the non-linear mapping between the LR image and the corresponding HR image. Later, Kim et al. [13] propose a very deep CNN with residual architecture to achieve outstanding SR performance, which can use broader contextual information with larger model capacity. Another network is also designed by Kim et al. [12], which contains recursive architectures with skip connection to boost image SR performance while only a small number of model parameters are exploited. However, these methods seldom impose any prior constraints on the recovered HR image. Yang et al. [48] introduce a deep edge guided recurrent residual (DEGREE) network to progressively perform image SR by imposing properly modeled edge priors. However, the edge priors only contain small parts of the high-frequency information and limited performance improvements are reported.
In contrast to the existing CNN-based methods, we develop an end-to-end network that learns the gradient recover operation P (·) and then combine it with the MAP estimator F (·) for multiple frames SR. An overview of the framework of the proposed method is shown in Figure 3. As illustrated, our SR framework conceptually contains the following two branches: the gradient branch learns the gradient priors and the reconstruction branch estimates the HR image by fusing multiple frames regularized by the learned gradient prior knowledge.  Figure 3. The architecture of the proposed multi-frame video SR. The framework contains two branches: the gradient prior learning branch and the HR image reconstruction branch. The gradient branch aims to predict the accurate gradient information while the image reconstruction branch fuse multiple LR frames and the gradient prior information to predict the final HR image. The motion field estimation is performed on the LR frames followed by the interpolation of the motion field to the high resolution.

Deep Gradient Prior Learning Network
In this section, we will present the technical parts of our gradient-learning network in details. In the framework, the LR image y is first upsampled by the bicubic interpolation to the desired size y u and then extract the horizontal G h y u and vertical gradients G v y u by convolve the image by discrete gradient operator [−1/2, 0, 1/2] and [−1/2, 0, 1/2] T , respectively. The gradients G h y u , G v y u and the upsampled image y u are combined to be fed into the network. The network performs convolutions to the input data to estimate the HR gradientsG h x ,G v x . The estimated image gradients are treated as image priors to regularize the high-frequency information of the reconstructed HR imagex.

Gradient-Learning Network
As stated before, the gradient branch aims to learn the mapping: Due to the high-frequency nature of image gradients, P in Equation (5) is actually a high-frequency to high-frequency mapping function. During image degradation, the high-frequency components are corrupted and become more unstable compared with the low-frequency components. Thus, existing methods almost learn the low-frequency to high-frequency mapping for SR, instead. In this paper, we stabilize the learning process using the upsampled image y u . In contrast to the existing works that learn the gradientmapping operation P (·) from the upsampled LR gradient G y u to the HR gradientG x , we propose to learn the mapping from the upsampled LR image to the HR gradient. Then learning of HR gradients becomes: Similar to [49,50], we could transpose the vertical gradients so that the vertical and horizontal gradients can share the weights in the training process. Learning the vertical and horizontal gradients in one network can use the correlation between the vertical and horizontal gradients.
Residual structure exhibit excellent performance in computer vision problems from the low-level to high-level tasks. As shown in Figure 2, gradients G h y u , G v y u and gradients G h x , G v x are similar in values. Thus, it is efficient to let the network learn the difference only. Then we have: The gradient-learning network can be expressed as: where H i is the output of the i th layer, W H i and B H i is the filter and the bias. ReLU denotes the rectified linear unit and M denotes the final layer number. In other words, the proposed network maps the low-frequency features to the high-frequency residual features.
The proposed network has 20 convolutional layers. All the convolutional layers except the first and the last layers are followed by a ReLU layer. We simply pad zeros around the boundaries before applying convolution to keep the size of all feature maps the same as the input of each level. We add a batch norm (BN) layer after the ReLU layer. We initialize the network using the method of He et al. [51].

Training Loss Function
The final layer of the end-to-end gradient-learning network is the loss layer. Given a set of HR images {x i } and the corresponding LR images {y i }, we upsample the LR images {y i } by bicubic interpolation to obtain {y u i } and extract the horizontal and vertical gradients {G h Usually, the Mean Square Error (MSE) is adopted for training the network to guarantee high PSNR (Peak Signal to Noise Ratio) of the output HR image. As generally known, natural image gradient exhibits a heavy-tailed distribution. Thus, statistical gradient priors such as total variation (TV) adopt the Laplacian distribution and the L 1 norm in the regular term. Motivated by this, the L 1 norm is adopted for training the gradients to impose sparsity on gradients. The training process is achieved by minimizing the following total loss function: where denotes all the parameters of the network. P (·) denotes the gradient predictor.

Further Study of The Gradient Prior Learning Network
As shown above, one of the key points of the proposed multi-frame SR scheme is the gradient prediction branch. It regularizes the recovered high-frequency information to be closer to natural images. As gradients reveals the local variation of the image intensity, we choose to learn the mapping function from the upsampled LR image to the gradient residual of the HR image in this paper. We intend to add the reliable low-frequency information from the LR image to stabilize the learning. In this section, some experiments are conducted to support our design of this scheme.
The straight-forward strategy mentioned above is to learn the mapping function between the upsampled LR gradient and the HR gradient residual: As shown in Figure 4, we respectively show the above-mentioned networks which are known as Scheme#1 and Scheme#2. Except for the different network input data, the network structure and the output of the networks are identical. To evaluate the learning ability of the networks, we define the gradient mean square error (GMSE) as follows to measure the horizontal and vertical gradient prediction accuracy: whereG and G denotes the predicted gradients and the groundtruth. We test the models on four commonly used test dataset and the average GMSE results are shown in Table 1.
We can see that learning the gradient prior directly from the low-resolution gradients is hard for the network as only the high-frequency information is given to the network. On the contrary, learning from the intensity image itself is simpler as it can provide the low-frequency information as well.

Robust Super-Resolution Reconstruction from Multiple Frames
In the literature, most works modeled n k as signal independent Gaussian noise [20,32]. The Gaussian model converges to the mean estimation and is not robust to data outliers caused by the brightness inconsistency and occlusions. In this paper, we model the data errors from the reference frames as Laplacian instead. Therefore, the first term in Equation (2) can be formulated as the L 1 norm which converges to the median estimation [20]: where σ k denotes the noise / error variance. Thus, the optimization problem (2) can be generally reformulated as: arg min where Υ(x) represents the regularity term and λ is the regularization parameter. In past decades, many image prior models have been proposed. The most prominent approach in this line is total variation (TV) regularization [34], which well describes the piecewise smooth structures in images. From a statistical point of view, TV is actually assuming a zero mean Laplacian distribution as the statistical model for the image gradient at all pixel locations. However, using zero value as a mean prediction for gradients of all the pixels is misleading as natural images are typically non-stationary, especially at the edge and texture regions. Although the original HR image x is not available, we can obtain a good estimation of the gradient mean from the LR frames using the learned gradient-mapping network. Generally speaking, we want the gradient field of the reconstructed HR image is close toG. Thus, Υ(x) here can be formulated as: Here, ∇ i denotes the discrete gradient operator at pixel location i.G i is the expectation of the gradient at location i. In this paper, we assume that the gradients follow the non-zero mean Laplacian distribution and set p = 1.

Displacement Estimation for the Warping Operator
One of the key problems of multi-frame video super-resolution is to construct the warping matrix F(u k , v k ). To obtain F(u k , v k ), we need to align the reference LR frames {y k } k =0 to the current HR frame x. In the literature, various motion estimation/registration techniques have been proposed. For the multi-frame video SR problem, the sub-pixel motion should be accurately estimated. Optical flow estimation concerning the dense and sub-pixel matching between frames has been studied in recent decades. However, different from the standard optical flow estimation in the same scale, we intend to estimate the motion between the LR scale and the HR scale. We can estimate the displacement between the LR and the HR image by minimizing an energy function defined as: where [a L , b L ] denotes the horizontal and vertical coordinate of the LR frame and [a H , b H ] denotes the coordinate of the HR frame. λ is a regularization parameter. The fidelity term measures the matching error between x and y k . As the above optical flow estimation between different scales is an ill-posed problem, prior knowledge of the displacement field should be imposed to regularize the flow field. Assuming the local motion consistency, the widely used TV model is used here to penalize the deviation of the flow field in the two directions while preserving the motion discontinuities. To reduce the computational cost, we here adopt a simple approximation using the interpolated flow field on the LR frames: To better deal with the outliers and occlusions, we use the L 1 norm in this paper and should be the opposite. Then the objective function can be formulated as: In practice, we compute the forward and backward flow respectively and then fuse the forward and backward flow to obtain the final flow estimation: where w f and w b are weight matrix with weights defined as: (22) div(·) denotes the divergence of the optical field, which measures the occlusion for each pixel. Finally, the high-resolution optical field can be obtained by interpolating the lowresolution optical field via simple interpolation methods (e.g., bicubic). Figure 5 shows the estimated optical flow fields by different schemes. We can see that the output flow field by our method is better estimated especially around the motion boundary. Once we obtain the optical flow fields, the warping matrix can be constructed. In this paper, we use bilinear interpolation kernel to estimate the sub-pixel values of the reference HR frame x. To be concrete, the warping matrix F u f k , v f k is a sparse matrix which can be formulated as: Here j denotes the integer location of x and i is the sub-pixel location computed from the flow field u f k , v f k . The kernel weight ω j i is proportional to the distance between i and j.

Robust SR Reconstruction
Given the estimated HR gradient fieldG, the warping matrix F, we study how to reconstruct the HR image from the multiple LR frames in this section. The HR frame can be estimated by solving the following Bayesian-based optimization function: arg min Here we assume the noise on the current frame is Gaussian noise. As described previously, the errors caused by noise, outliers and occlusions of the reference frame is modeled as Laplacian noise. We simultaneously estimate the HR frame and the noise/error variance in an overall framework. Although the L 1 norm can be robust to outliers to some extent, we add a weight matrix W k in the L 1 norm to further exclude the unreliable reference frame data in the reconstruction. W k is defined as: Figure 6 illustrate the effectiveness of the proposed weighting strategy. It can be seen that adding W k in the framework can occlude the contribution of the outlier data in the reconstructed HR image. To solve the optimization function (24), we use Generalized Charbonnier (GC) function (x 2 + 2 ) α with α = 0.55 for approximation to replace the L 1 norm here. Then the objective function can be efficiently solved by alternatively updating the following function via the gradient descent algorithm: where (W k (y k − DHF k x 0 )) 2 + 2 0.55 is pixel-wise now and the noise level:

Experimental Results
In this section, experiments are conducted to evaluate the proposed method. Color RGB frames are converted to YCbCr color space and the proposed method is applied only on the luminance component. Bicubic interpolation is used for the other components. Both visual quality and quantitative quality comparisons are used for evaluation.

Experimental Settings
In our experiments, we focus on upscaling the input LR frames by factor of 4, which is usually the most challenging case in super-resolution. Two commonly used degradation models are evaluated in this paper: (1) The LR frames are generated by first applying a Gaussian kernel with standard deviation 1.4 to the original image and then down-sampling; (2) The LR frames are generated by down-sampling using the Matlab function imresize with bicubic kernel. In our implementation, the frame number M is set as 15. In other words, we fuse 30 reference LR frames with the current LR frame to reconstruct one HR frame. For the estimation of the optical flow, β is set to 0.3 and h is set as 0.18. λ is set to 0.0002. We set the maximum outer iteration number as 8 and the maximum inner iteration number as 15. is set as 0.001. In SR reconstruction process, the step size of the gradient descent algorithm is set as 0.03 to achieve good results.

Training Details
We use 91 images from Yang et al. [6] and 200 images from the training set of Berkeley Segmentation Dataset as our training data. The validation data are 19 images from Set5 and Set14. For the network training, the data augmentation is first conducted on the training dataset with (1) flipping images horizontally and vertically (2) randomly rotate image by 90, 180, and 270 rotations. Thus, eight different versions are obtained for every image. Training images are split into patches of size 48 × 48. We use the ADAM optimizer to train our model and set β 1 = 0.9, β 2 = 0.999, = 10 −8 . The training mini-batch size is set to 32. The learning rate is initialized as 10 −4 and decreased by a factor of 10 for every 10 epochs. The proposed network is trained with the MatConvNet package on a PC with NVIDIA GTX1080Ti GPU, 64GB memory and Intel Core i7 CPU.

Comparisons with the State-of-the-Art Methods
In this section, both quantitative and qualitative results are given. We test our method on seven videos: calendar (720 × 576), city (704 × 576), foliage (720 × 480), walk (720 × 480), jvc009001 (960 × 540), jvc004001 (960 × 540) and AMVTG004 (960 × 540), each of which contains 31 frames. We compare our method with several recent image and video SR methods: SRCNN [8], VDSR [13], Bayesian [23], Draft [27] and LTD [29] on the seven test sequences, which include both deep learning-based methods and non-deep learning-based methods. For SRCNN [8], VDSR [13] and Draft [27], we use the models provided by the authors to generate the corresponding results, respectively. For Bayesian [23] and LTD [29], the source code is not available. The results of calendar, city, foliage and walk are downloaded from the authors' websites. We use the re-implementation provided by Ma et al. [52] to generate the rest of the test videos for Bayesian [23]. Only the center frames (# 15) of each video sequence are reported in the paper. For fair comparison, we crop the image boundary pixels before evaluation.
In Figures 7-10, visual results are shown to compare our SR method with other video SR methods. Details of the output HR images are given for better illustration. We can see that our method is able to produce more visually pleasant texture regions and with less artifacts around the edge regions. In Figure 7, only our method reconstructs the letters and digits clearly and with less artifacts. In Figure 9, most of the textures are smoothed out by the compared methods. In contrast, our method can reconstruct more textures. Although the outputs of the Bayesian [23] method look sharper than our method, visual artifacts of Bayesian [23] are severe. In Figure 10, better edge regions and texture regions are reconstructed by the proposed method compared with the state-of-the-art single image SR methods. The most challenging video city is shown in Figure 8. We can see that all the compared methods fail to recover the details of the building while our method can recover most of the textures.
To evaluate the quantitative quality, PSNR and SSIM are adopted here. PSNR and SSIM [53] results on the seven tested videos are respectively reported in Tables 2 and 3, with the best results highlighted in bold. B and G refer to the bicubic kernel and Gaussian + downsample kernel, respectively. It can be seen that the proposed method achieves the highest PSNR and SSIM among the compared SR algorithms over almost all the benchmark videos. Our SR network significantly outperforms the state-of-the-art methods Bayesian [23], Draft [27] and LTD [29], especially on the video city. Specifically, for video calendar, the proposed method obtains 1.96 dB, 1.08 dB, 0.80 dB, 0.27 dB, 0.28 dB and 0.41 dB PSNR gains over bicubic, SRCNN [8], VDSR [13], Bayesian [23], Draft [27] and LTD [29]. For video city, the proposed method obtains 1.72 dB, 1.31 dB, 1.11 dB, 1.36 dB, 0.47 dB and 0.53 dB PSNR gains over bicubic, SRCNN [8], VDSR [13], Bayesian [23], Draft [27] and LTD [29]. For video jvc009001, the proposed method obtains 2.95 dB, 1.63 dB and 1.21 dB PSNR gains over bicubic, SRCNN [8], VDSR [13].

Comparisons on Running Time
In this section, the computation time of the SR algorithms are evaluated. The experiments are conducted with Matlab R2016b on a PC with Intel Core i7 3.6 GHz CPU, 16 G memory and a GTX 760Ti GPU. Our scheme mainly includes two parts: (1) motion field estimation; (2) HR image reconstruction. In Table 4, the running times of the compared SR methods on video calendar with scaling factor of 4 are listed. Please note that VDSR [13] and the gradient network of our method is running on the GPU. The rest of the compared methods and reconstruction branch of the proposed method is running on the CPU. As illustrated, the running time of the proposed method is relatively lower compared with other multi-frame video super-resolution methods. The proposed method can be implemented in C code to further accelerate its speed. Figure 9. Super-resolution results of "walk" with scaling factor of x4. (a) Bicubic, (b) SRCNN [8], (c) VDSR [13], (d) Bayesian [23], (e) Draft [27], (f) LTD [29], (g) proposed method and (h) the groundtruth. Please enlarge the figure for better comparison.

Ablation Study
The proposed SR scheme (24) in Section 4.2 contains components including learned gradient priorG i and robustness weights W k . To verify the effectiveness of the two components, we compare the proposed scheme with its variants on the 7 test videos. The comparison results are listed in Table 5. Base refers to our full baseline. Base-1 refers to Base withoutG i . Base-2 refers Base without W k . We can see that the full baseline achieves the best SR performance.

Conclusions
This paper presents a robust multi-frame video super-resolution scheme. A deep gradient-mapping network is trained to learn the horizontal and vertical gradients from the external dataset. Then the learned gradients are used to assist the reconstruction of the HR image. Instead of directly learning the mapping from the LR gradients to HR gradients, we add the low-frequency information to the input of the network to stabilize the gradient learning and boost the performance. The HR reconstruction branch takes the LR frames as input, which provide the complementary information for the high-resolution frame. In the fusion stage, the learned gradient field regularizes the reconstructed HR image to be close to nature image. Experimental results show that our method outperforms many state-of-the-art methods to a large margin on many benchmark datasets. For the future work, apart from the current frame, we could also use the reference frames to learn the HR gradients as the reference frames contain complementary information to the current frame. Furthermore, deep learning-based optical flow algorithms can be considered to better deal with the occlusion and the fast-moving scenes, which our work cannot handle very well. The code is available at https://github.com/KevinLuckyPKU/VSR.