DSTnet: Deformable Spatio-Temporal Convolutional Residual Network for Video Super-Resolution

: Video super-resolution (VSR) aims at generating high-resolution (HR) video frames with plausible and temporally consistent details using their low-resolution (LR) counterparts, and neighboring frames. The key challenge for VSR lies in the effective exploitation of intra-frame spatial relation and temporal dependency between consecutive frames. Many existing techniques utilize spatial and temporal information separately and compensate motion via alignment. These methods cannot fully exploit the spatio-temporal information that signiﬁcantly affects the quality of resultant HR videos. In this work, a novel deformable spatio-temporal convolutional residual network (DST-net) is proposed to overcome the issues of separate motion estimation and compensation methods for VSR. The proposed framework consists of 3D convolutional residual blocks decomposed into spatial and temporal (2+1) D streams. This decomposition can simultaneously utilize input video’s spatial and temporal features without a separate motion estimation and compensation module. Furthermore, the deformable convolution layers have been used in the proposed model that enhances its motion-awareness capability. Our contribution is twofold; ﬁrstly, the proposed approach can overcome the challenges in modeling complex motions by efﬁciently using spatio-temporal information. Secondly, the proposed model has fewer parameters to learn than state-of-the-art methods, making it a computationally lean and efﬁcient framework for VSR. Experiments are conducted on a benchmark Vid4 dataset to evaluate the efﬁcacy of the proposed approach. The results demonstrate that the proposed approach achieves superior quantitative and qualitative performance compared to the state-of-the-art methods.


Introduction
In recent years, image and video super-resolution have attracted a lot of attention due to their wide range of applications, including, but not limited to, medical image reconstruction, remote sensing, panorama video super-resolution, UAV surveillance, and high-definition television (HDTV) [1][2][3]. Because video is one of the most often used forms of multimedia in our everyday lives, super-resolution of low-resolution videos has become critical. In general, image super-resolution (ISR) algorithms process a single image at a time. In contrast, video super-resolution algorithms handle many successive video frames at a time to reconstruct the target HR frame using the relationship between the frames. Video super-resolution (VSR) may be considered a subset of image super-resolution since it can be processed frame by frame by image super-resolution methods. However, performance is not always satisfactory due to the artefacts and jams that may be introduced, resulting in unreliable temporal coherence across frames [4].
Earlier VSR methods estimated motion through separate multiple optical flow algorithms [5] and enabled an end-to-end trainable process for VSR. However, it made VSR methods highly dependent upon externally estimated optical flow between frames. based on the regression method that further improved the LR and HR patches mapping and accuracy. However, all these methods required extensive pre and post-processing to reconstruct the HR image. On other hand, learning-based methods used low and high-resolution exemplar pairs to learn the mapping function. The sparse-coding-based method [15] is an example of the learning-based methods. Earlier learning-based SR methods extracted patches from input LR images to learn representation and reconstruct LR images in the same fashion. Hence, in all aforementioned methods patches were the focus of optimization. Different techniques were used to extract LR patches before the main SR process and aggregation of HR output, which lead to considerable computational overhead. With the introduction of deep learning, features can be learned directly from the raw data. Which eliminates the need for any hard-coded pre-or post-processing of input data. These deep representation-based approaches are generally called end-to-end learning-based methods. In the subsequent sections, state-of-the-art deep learning-based methods for image and video SR are discussed.

Single Image Super-Resolution (SSIR)
The single image super-resolution (SSIR) takes a single LR image as input to reconstruct the HR image. In this direction, Dong et al. proposed SRCNN [16], a CNN-based architecture to learn the non-linear mapping between LR and HR images for image superresolution, and reported superior results. Then, Kim et al. [17] proposed a CNN-based architecture with increased network depth and residual learning as an extension to SRCNN. Later, Tai et al. [18] came up with a CNN-based network with residual connection and reported better efficiency. The major limitation of these methods was using a pre-processing step for up-scaling the LR images to the desired output size before training the model, which caused increased computational complexity and alteration in details of input LR frames.Shi et al. [19] avoid this problem by using an efficient sub-pixel CNN layer instead of a deconvolution layer to upscale the LR feature map to HR at the end of the network and achieved better results than the previous method for the SR. In the NTIRE 2017 challenge, Timofte et al. [20] provided a huge dataset of diverse 2K resolution images (DIV2K). This dataset enabled the researcher to develop deeper and more effective models for SR, such as EDSR [21], RCAN [22], and RDN [23].

Video Super-Resolution (VSR)
VSR methods generally divide the video super-resolution task into three steps; feature extraction and motion compensation, feature fusion, and SR reconstruction. In this direction, Kappeler et al. [24] extended SRCNN [16] for VSR and used multiple consecutive frames as input to predict HR frames. Later, Wang et al. [25] used optical flow information for inter-frame motion compensation and temporal alignment and reported a better temporal alignment method. Liao et al. [5] used two different classical optical flow algorithms for motion compensation and then used a CNN-based model to reconstruct SR video frames. Later, Liu et al. [26] proposed an improved optical-flow alignment method that generated HR frames in temporal scales by a temporal adaptive method. Caballero et al. [27] proposed an end-to-end efficient sub-pixel convolution neural network for video (VESPCN), comprised of three modules: a spatial-transformer for motion compensation and feature extraction, feature fusion, and HR reconstruction. Tao et al. [28] emphasized the importance of accurate inter-frame alignment and motion compensation for VSR. They used a sub-pixel motion compensation (SPMC) layer in their method to simultaneously achieve motion compensation and super-resolution. All these methods revealed that precise optical flow prediction is crucial for VSR, errors in the optical flow computation or the image-level wrapping operation can introduce artefacts in the resultant VSR.
Alternative techniques were proposed for VSR to capture temporal-relation without explicit motion compensation. For example, Huang et al. [29] proposed bidirectional recurrent convolution networks capture long-term spatio-temporal relations between frames for VSR. Another method, frame recurrent video super-resolution (FRVSR) [30] recurrently used two deep CNN models by taking previously estimated HR feature value as input and reconstructed subsequent HR frames. Some other deep learning-based methods, such as feed-forward networks [31], generative adversarial networks (GANs) [32] were also proposed. Although these methods improved visual quality, these methods are slower than many CNN-based VSR methods. To overcome this issue, Zhang et al. [33] used pixel correlations extracted by compression algorithms to exploit dense representation of the network; by transferring the SR result between adjacent frames, they accelerated the VSR process by almost 15 times with little performance loss. Another method proposed by Xue et al. [6] was a turning point for the optical flow-based method for VSR. It concluded that traditional optical flow is not an ideal motion representation for video restoration tasks, including VSR. To circumvent the problem, Jo et al. [34] proposed an implicit motion compensation model that generated dynamic up-sampling filters using each pixel's local spatio-temporal neighborhood and HR residual image. Tain et al. [8] proposed a temporally deformable alignment network (TDAN) to avoid explicit motion compensation problems using a one-stage temporal alignment process at the feature level. This method used features from both the reference frame and neighboring frames using deformable convolution, and then applied these learned kernels to perform the frame alignment. More recently, Bao et al. [35] proposed an end-to-end trainable motion estimation and compensation network, by combining both kernel-based and flow-based methods for frame interpolation. They designed a unique adaptive warping layer that integrates both estimated optical flow and interpolation kernels to synthesize target HR frame pixels and achieved encouraging results for video enhancement tasks including VSR.

Deformable Convolution-Based Methods
Dia et al. [36] proposed the deformable convolution (Dconv) that enhanced the capability of traditional CNN-based methods to learn geometric transformation. Deformable convolution appends learned offset to the sampling grid of regular convolution kernel, which enables it to learn information away from its local neighborhood. Deformable convolutions are widely used in high-level computer vision tasks, such as action recognition, semantic video segmentation [37]. For example, Zhang et al. [38] proposed a deep deformable 3D convolutional neural network for task of gesture recognition, that not only achieved excellent accuracy but also met the demand of real-time processing. However, Tain et al. [8] was the first method to introduce a deformable convolution-based method for VSR and successfully achieved the frame alignment without explicitly computing optical flow. They reported superior results as compared to state-of-the-art VSR methods. Wang et al. [7] also used an enhanced deformable convolution network for video restoration tasks, including VSR. Their proposed architecture consists of two modules: (1) a pyramid and cascading alignment based on TDAN [8] (2) a temporal and spatial attentionbased fusion model. Furthermore, the deformable convolution layers are proposed to integrate with the convolutional LSTM [39], and recurrent convolutional network [40] that enhanced the performance of VSR methods. Recently, Lpeztapia et al. [41] proposed gated recurrent neural networks for VSR that incorporate some of the key components of a gated recurrent unit and deformable convolution.

3D Convolution-Based Methods
The most straightforward way to learn spatio-temporal information from the input video sequence is to employ 3D convolution (Conv3D). Furthermore, there are considerable similarities between LR and desired HR videos, so the residual connection is widely used in VSR methods. Li et al. [11] used residual connections and proposed a model termed as fast spatio-temporal residual network (FSTRN) for VSR by utilizing factorized Conv3D for learning spatio-temporal features. This method used spatial and temporal kernels in different layers and effectively reduced computational complexity at training time. Other methods, such as [11,42], used C3D [12] as their backbone architecture. Similarly, some other methods, such as [21,43], gained much success in image SR tasks by efficiently using the ResNet [44] as a backbone architecture. However, these techniques are not fully explored and utilized for VSR [11].

Methodology
In this section, the detail of the proposed architecture is presented. As shown in Figure 1, the proposed architecture consists of four modules: (1) spatio-temporal convolutional residual blocks (resST), (2) deformable spatio-temporal convolutional residual blocks (Deformable resST), (3) features fusion, and (4) SR reconstruction. Let us denote the input and output of the proposed DSTnet method F (DSTnet) by X LR and Y HR . First, N number of LR frames X LR are fed to a spatio-temporal convolutional block (2+1) D with convolution kernel of size 3 × 3 × 3 to extract features, which can be expressed as: where F ((2+1)D) represents (2+1) D convolution function to obtain initial feature map P L 0 . That later used as input for spatio-temporal convolutional (2+1) D residual (resST) blocks, to learn in-depth spatio-temporal features. Assume K number of resST blocks are used, the first residual block uses P L 0 as input. The next resST block further learns features using the previous residual block's output and so forth; more details about the resST blocks are presented in Section 3.1 The output of the Kth resST block P L K can then be obtained by: where F (resST,K) denotes the operation of resST blocks. P L K is further used as input for the special deformable spatio-temporal convolutional residual blocks. These blocks are designed to enhance the proposed network's ability to learn the complex motion. In this module each 2D spatial convolution layers in resST blocks are replaced with the deformable convolution layers [38]. More details will be presented in Section 3.2. Assume D such residual block has been used in model the output of Dth block can be denoted by P L D . To combine estimated P L D across time-space, the temporal fusion module is used, detail of the module is presented in Section 3.3. The term that expresses this fused deep feature map is P L F in Figure 1, which is further used as input for the SR reconstruction module. The output SR feature map can be denoted by P L SR . Finally, the output of the network is composed of SR mapping from LR space termed as P L SR from SR reconstruction module and mapping of reference LR frame in HR space the obtained P H SR . The estimated HR frame Y HR is obtained by performing concatenation of P L SR and P H SR using a global residual connection. The overall output of DSTnet is as following, where F DSTnet represents the overall operations performed by the proposed DSTnet to reconstruct the HR video frames Y HR .

Spatio-Temporal Convolutional Residual Blocks
Residual blocks have gained much success in computer vision tasks by ensuring excellent performance [43,45]. Lim et al. [21] proposed the modified residual blocks for SR by removing the batch-normalization layer from residual blocks, as shown in Figure 2b. To apply residual neural networks in videos, generally, 2D convolutions are replaced with 3D convolutions to utilize spatial and temporal relations between frames. However, Li et al. [11] proposed to decompose 3D convolution into 2D convolution followed by a 1D convolution for the VSR task, as shown in Figure 2a. This section presents details about the proposed spatio-temporal convolutional residual block (resST), shown in Figure 2c. In the proposed residual block, each N number of 3D convolution filters with dimension t × s × s are replaced by (2+1) D convolutional block consists of N number of spatial convolution filter of dimension 1 × s × s. Followed by M temporal filters of dimension 1 × 1 × t where M determines the subspace dimensionality between both filters. The value of M is calculated in a way similar to [46]. This ensures that (2+1)D block learnable param-eters are not more than that required for a 3D convolution kernel. The proposed residual module consists of two (2+1)D convolutional blocks with ReLU as the activation function. Additionally, the same-padding strategy is adopted in every spatio-temporal convolutional block to avoid dimensionality reduction. The use of (2+1) D convolutional blocks has two major advantages over 3D convolution. First, it can better model more complex functions due to additional non-linearity between 2D and 1D convolution in each layer. Second, it provides better optimization with superior performance at training and test time without additional computation.

Deformable Spatio-Temporal Convolutional Residual Block
Dai et al. [36] proposed a deformable convolution (Dconv) that achieved much success in the field of computer vision. Let us consider a simple convolution operation Y with stride = 1 summarized as follows: where P 0 represents a location in the output feature and P k represents the convolutionsampling grid. As depicted in the equation above, convolution is a weighted summation of sampled input features using a convolution kernel for a fixed location P 0 . In contrast, a deformable convolution kernel can augment the sampling grid by learning an additional offset P k for each sampling location. Thus, it can enlarge the spatial receptive field. Deformable convolution shows superiority in many high-level computer vision tasks. Inspired by its success, it was recently used for temporal frame alignment in the state-of-the-art VSR method [8].
The proposed variant of 3D convolution can learn and model spatio-temporal relations simultaneously. However, convolutional neural networks (CNN) have an inherited limitation in learning complex geometric transformation and motion. To overcome this issue, in this work deformable convolution is used that enhances the model's capability to estimate and compensate complex motion between input LR frames. Although, in the proposed network, deformable convolution can easily be integrated within every spatio-temporal convolutional block of each resST block. However, it is evident from the literature that deformable convolution requires high-level semantic information to perform best [38]. Integrating it in every layer, especially using it in the starting layers of the network would only bring extra computational complexity, in the form of learned offset for each deformable convolution. Hence, to make model architecture robust and the training process optimized, deformable convolution layers are proposed to use only in the high-level layer of the network (i.e., towards the end of the network).

Temporal Fusion
The main objective of this module is to temporally combine spatial and temporal features of the output learned features map of the residual blocks. In the proposed feature fusion module a 2D convolution layer is used as a bottleneck layer to fused residual block's output feature maps temporally. The obtained HR features maps are passed to 2D convolution residual blocks to further fine-tune the fused features.

SR Reconstruction
The SR reconstruction module is used to obtain the estimated super-resolved video in HR space after efficiently extracting deep features in LR space. In this module, sub-pixel convolution layer proposed by Shi et al. [19] is used for HR frame reconstruction. This module consists of simple convolution layers followed by the rearrangement of the pixels by using pixel-shuffle operation. This upscale the feature map to desired resolution using values from all learned feature maps.

Dataset
Vimeo-90K [6], and Vid4 [47] are two benchmark datasets explicitly developed for video super-resolution. The Vimeo-90K is comprised of 89,800 video clips. Each video clip presents different contents, diverse scenes, and motions. These videos were extracted from a popular video-sharing website, vimeo.com. Although Vid4 is comprised of four video sequences: city, calendar, foliage, and walk. Each video sequence has a different motion sequence and is recorded to varying resolutions under different circumstances.
The Vimeo-90k dataset is comparatively more challenging due to diverse camera and object motion recorded for high-quality videos in different circumstances. In this research work, the Vimeo-90k dataset has been used to train the proposed model, and the Vid4 benchmark dataset is used to evaluate the performance of the proposed VSR method. This strategy is in-line with state-of-the-art methods to ensure a fair comparison.

Experiment Details
The video super-resolution is a supervised learning task, aims to infer the degradation function between input LR video and corresponding HR ground truth. Following other VSR methods, we use MATLAB's imresize function as a degradation method to generate LR video frames using HR frames of both datasets. Hence, the final training dataset has 89,900 pairs of HR frames and their corresponding LR frames for training. For the evaluation of the proposed VSR method, the Vid4 dataset is used.
All experiments are conducted using Pytorch [48] with an NVIDIA 2080Ti GPU, with a batch size of 32. During training Adam [49] is used as an optimization algorithm with β 1 = 0.9 and β 2 = 0.999. The model initial learning rate was set to 1 × 10 4 and halved after every 30 epochs. LReLU [50] is chosen activation function. Furthermore, motivated by experimental results in some modules of DSTnet ReLU is selected as an activation function. Mean square error (MSE) is the loss function used during the proposed network training. A sequence of 7 frames is utilized as input to the proposed model. Following the previous methods [26], Ref. [6] we consider only luminance (Y) channel in YCbCr color space of input frames during training.
If not specified otherwise, each convolution layer has 64 filters, and kernel size is set to 3 × 3 × 3. The proposed model consists of three spatio-temporal convolutional residual blocks followed by five residual blocks with integrated deformable convolution. Further detail of the proposed network's modules, output shape of each module, and their order in model architecture is given in Table 1. For quantitative comparison of the proposed method with existing methods, both PSNR and SSIM [51] are used as evaluation metrics.

Results
The proposed method is compared with several single image super-resolution methods and VSR methods, including VESPCN [27], RCAN [22], VSRnet [24], SOF-VSR [25], BRCN [29], DBPN [31], VSRResNet [32], TOFlow [6], and TDAN [8] on the benchmark VSR Vid4 [47]. Quantitative results are shown in Tables 2 and 3, first and last two frames of videos are not consider for evaluation following TDAN [8] for fair comparison. Additionally, note that most of the aforementioned methods are trained on different datasets and, the comparison is made on the results they all provide in their research work. However, the same training as reported in TOFlow [6] and TDAN [8] is used in this work. The proposed method achieved the highest Structural Similarities Index Measure (SSIM) value on a benchmark dataset and reported comparative PSNR. Qualitative results of all four videos of the Vid4 dataset are shown in Figures 3-6. It can be observed that the proposed method can produce more visually appealing results, with less blur and motion artefacts around objects. As shown in Figure 3, the name is more visible in proposed method generated results. In Figure 4, the texture and building detail is only reconstructed by the proposed model results. Similarly, in Figures 5 and 6, the texture and scene details of the roof of the car and bags, respectively, are preserved by the proposed methods while all other methods produced overly smoothed and distorted results. Conclusively, the proposed approach outperforms the state-of-the-art VSR method in terms of PSNR and SSIM on a benchmark dataset. The residual connection and spatiotemporal convolutional blocks have played an important role in learning deep generalized representations. DSTnet can also fully utilize and model spatio-temporal relation with the ability to model complex motion, using the novel deformable convolution module for super-resolution of video clips with dynamic scenes and motion. As illustrated in the results, the proposed approach outperforms benchmark datasets in terms of structural similarity (SSIM) and PSNR as compared to state-of-the-art approaches for VSR with reduced parameters.     [24], (c) SRCNN [16], (d) VESPCN [27], (e) Proposed and (f) Ground-Truth on "walk" from Vid4 with scaling factor of x4. Furthermore, our model has 3.91 million learnable parameters, which influence the model's size. Table 4 shows the number of learnable parameters of several networks of VSR. It can be observed from Table 4 statistics that the proposed model sizes is small in comparison with the networks having leading VSR performance: ToFlow [6], RDN [23], and RCAN [22] except from TDAN [8]. This demonstrates that the proposed method is computationally efficient, due to the tiny model size. Even with such a lightweight model, the proposed model is deep in comparison with other deep-learning-based models using 3D convolution for VSR. However, the proposed method still achieves encouraging VSR performance and outperforms the state-of-the-art methods, as shown in Tables 2 and 3.

Conclusions and Future Work
In this paper, a novel deformable spatio-temporal convolution residual network (DSTnet) is proposed for video super-resolution. This method consists of spatio-temporal (2+1) D convolutional residual block with deformable convolution layers to simultaneously utilize spatial and temporal information. Experiments confirm that DSTnet can effectively capture and model complex motion between frames and outperform state-of-the-art methods on the benchmark Vid4 dataset. The proposed method is evaluated using two well known and widely used metrics for VSR methods, i.e., SSIM and PSNR. It achieves SSIM of 0.795 and PSNR of 26.39 dB, which are higher than state-of-the-art VSR methods. Moreover, the proposed method has fewer parameters to learn during training, making it computationally lean and proving its fast learning ability. As a future research direction, we would like to extend this method to handle various complex motions by improving the feature-fusion module of this method.

Conflicts of Interest:
The authors declare no conflict of interest.