Bidirectional Temporal-Recurrent Propagation Networks for Video Super-Resolution

Recently, convolutional neural networks have made a remarkable performance for video super-resolution. However, how to exploit the spatial and temporal information of video efficiently and effectively remains challenging. In this work, we design a bidirectional temporal-recurrent propagation unit. The bidirectional temporal-recurrent propagation unit makes it possible to flow temporal information in an RNN-like manner from frame to frame, which avoids complex motion estimation modeling and motion compensation. To better fuse the information of the two temporal-recurrent propagation units, we use channel attention mechanisms. Additionally, we recommend a progressive up-sampling method instead of one-step up-sampling. We find that progressive up-sampling gets better experimental results than one-stage up-sampling. Extensive experiments show that our algorithm outperforms several recent state-of-the-art video super-resolution (VSR) methods with a smaller model size.


Introduction
Super-resolution (SR) is a class of image processing techniques that generates a high-resolution (HR) image or video from its corresponding low-resolution (LR) image or video. SR is widely used in various fields, such as surveillance imaging [1], medical imaging [2], and satellite imaging [3]. With the improvement of display technology, the video super-resolution (VSR) becomes more and more critical for LR video.
Recently, neural networks have made remarkable achievements in the single-image super-resolution (SISR) [4][5][6][7][8]. One way to perform VSR is to run the SISR frame by frame. However, SISR methods do not consider the inter-frame temporal relationship. The output HR videos usually lack temporal consistency, which results in the flickering artifact [9]. Most existing VSR methods [10][11][12][13][14] consist of similar steps: motion estimation and compensation, feature fusion and up-sampling. They usually use optical flow to estimate the motion between the reference frame and supporting frames, and then align all other frames to the reference with warping operations. Therefore, the results of these methods depend heavily on the accuracy of optical flow estimation. Inaccurate motion estimation and alignment may introduce artifacts around image structures in the aligned supporting frames. Furthermore, it takes a lot of computational resources to compute the optical flow on every pixel between frames.
To alleviate the above issues, we propose an end to end bidirectional temporal-recurrent propagation network (BTRPN). We design a bidirectional temporal-recurrent propagation unit (BTRP unit). The BTRP unit can implicitly utilize motion information without explicit estimation and alignment. Therefore, the reconstructed HR video frames will have fewer artifacts due to inaccurate motion estimation and alignment. In addition, instead of using multiple consecutive video frames 1. We propose a novel end to end bidirectional temporal-recurrent propagation network, which avoids the complicated combination network of optical estimation and super-resolution.
To better integrate the two subnetworks, we take the channel attention mechanism to fuse the extracted temporal and spatial information. 2. We propose a progressive up-sampling version of BTRPN. Compared to one-step up-sampling, progressive up-sampling means solving the SR optimization issue in a small solution space, which decreases the difficulty of learning and boosts the performance of reconstructed images.

Video Super-Resolution
Temporal alignment, either explicitly or implicitly, plays an essential role in the performance of VSR. Previous explicit methods, such as [10], split temporal alignment into two stages. They compute optical flow in the first stage and perform motion compensation in the second stage. VESCPN [31] is the first end-to-end VSR network that jointly trains optical flow estimation and spatial-temporal networks. SPMC [11] proposed a new sub-pixel motion compensation layer (SPMC), which can simultaneously achieve sub-pixel motion compensation and resolution enhancement. FRVSR [14] introduced a frame-recurrent structure to process video super-resolution reconstruction, which avoided the repeated redundant operation of the same frame image in some multiple-input VSR methods and improved the computing efficiency of the network. Reference [12] achieved temporal alignment through a proposed task-oriented flow (ToFlow), which achieved better VSR results than fixed flow algorithms. However, all these methods rely on the accuracy of optical flow estimation. At present, even state-of-the-art optical flow estimation algorithms are not easy to obtain sufficient high-quality motion estimation. Even with accurate motion fields, the image warping for motion compensation will also produce artifacts around the LR frames, which may affect the final reconstructed HR frames. Our proposed BTRPN performs an implicit temporal alignment without depending on optical flows, which will alleviate the issues caused by optical flow based methods.
Recently, some implicit algorithms were proposed. Reference [32] exploited a 3D convolution-based residual network for VSR instead of explicit motion alignment. Reference [19] proposed a dynamic filter network for VSR. Reference [30,33] utilized deformable convolution to perform temporal alignment. These methods used implicit temporal assignment and avoided the issues in optical flow. However, they all used seven consecutive input frames to predict an intermediate frame, which led to huge training costs.
The work most related to ours is FRVSR [14], which also used frame-recurrent. However, in [14], the optical flow was used for explicit motion estimation, which may lead to artifacts around the image structure. In addition, our BTRPN uses a bidirectional structure, which ensures full utilization of temporal information. Compared to [14], our method achieves better VSR results with a smaller network.

Network Architecture
The overall network framework is shown in Figure 1. The structure and weight of the two subnetworks of BTRPN are precisely the same, and they process, respectively, the positive and reverse LR input video on the time axis, thus allowing the bidirectional temporal information flow. In the TRP unit, the method of progressive sampling is adopted to avoid the large one-step scale sampling. At the end of BTRPN is a fusion module with the channel attention mechanism, through which the features of the two sub-networks are combined, and the reconstructed video frames are output.

TRP Unit
The TRP unit is illustrated in Figure 2. The input of the TRP unit is composed of a three section cascade: consecutive video frames X t−1:t+1 (the current frame is in the middle), the temporal status of the last moment S t−1 , the result of Space to Depth processing of the reconstructed output from the last moment y t−1 . The output of the TRP unit is the temporal status of the current moment S t and the SR result of the current frame y t .
The TRP unit is composed of two branches, which output temporal status S t and SR reconstruction results y t , respectively. These two branches share the feature extraction module, which consists of multiple convolutional layers followed by the Rectified Linear Unit (Relu) activation layers. The Relu activation layers can make convergence much faster while still present good image quality. The branch of output y t can be regarded as a residual network with the number of channels r 2 . The output of the residual network is up-sampled through Depth to Space to obtain the reconstructed frame y t . Space to Depth is the inverse of Depth to Space proposed by FRVSR [14]. It is illustrated in Figure 3.

Bidirectional Network
We found that VSR results obtained by positive and reverse sequence input are different, as Table 1 shows. We set up a small FRVSR [14] network: FRVSR 3-64. The optical flow estimation is obtained by stacking three residual blocks, and the number of channels in the hidden layer is 64. After the FRVSR 3-64 was trained to convergence, we tested the model on the VID4 dataset and recorded results. Then we processed the four videos on the VID4 dataset in reverse order on the timeline and tested the model on the reverse VID4 dataset. From Table 1, we can see the difference between the forward and reverse processing of the same video. Only unilateral information flows can provide limited temporal information. So we designed a bidirectional network to extract inter-frame temporal information fully. As shown in Figure 4, the network consists of two sub-networks, which input videos on the forward and backward time axis, respectively. Then the two subnetworks are combined by a fusion module to obtain the final output. The two subnetworks are identical in structure and parameters, distinguished only by the timeline order of the input video.

Attentional Mechanism
We use a channel attention fusion module [15] in Figure 5 to fuse features from the two subnetworks. The attention mechanism [15] has been applied to the SISR and improved SR performance. Attention can be viewed as a guidance to bias the allocation of available processing resources towards the most informative components of an input. In our network, the output features of the two subnetworks can be regarded as a set of vectors in the space based on local priors. These different channels of the two subnetworks feature vectors contain different information, and different channels have different effects on the SR results. Adding channel attention in the fusion module in Figure 6 can help the network adaptively rescale and adjust the features of the channel so that the network can focus on more informative features and get better SR results. In the fusion module, firstly, features are scaled by an attention mechanism after the two subnetwork outputs are concatenated and input into two 3 × 3 kernel 2D convolutional layers. Secondly, the scaled features are added to the original input at the pixel level. Thirdly, the second-step result is channel-compressed through a 1 × 1 kernel convolutional layer.
Finally, the channel compressed result performs Depth to Space to get the final SR result. The formula for the whole process is as follows: In the formula, F input and I SR , respectively, represent the input feature vectors of the fusion module and the final reconstruction results. CA(·) represents the channel attention mechanism. ReLu(·) represents ReLU nonlinear activation unit. D2S(·) represents Depth to Space. W represents the weight matrix. Subscripts 1, 2 and 3, respectively, represent the three convolutional layers from shallow to deep in the fusion module, and the superscript represents the size of the convolution kernel.

Progressive Up-Sampling
For the image SR work, one-step mapping on the large scale factor means that the optimization solution will be carried out in a more extensive solution space compared with small scale factors, which will increase the difficulty of model learning and affect the final image. Network design needs to avoid one-time large scale mapping, as much as possible in the form of multiple small scales (2×) mapping. Therefore, we propose a progressive improved TRP version, as Figure 7 shows. For 4× enhancement, We used two 2× TRP units level 1 and level 2. The TRP Unit level 1 is the same as Section 3.2 describes. The input of TRP Unit level 2 is the consecutive video frames X t−1:t+1 after interpolation, the temporal status of the last moment S t−1 after interpolation, the result of Space to Depth processing of TRP Unit level 1 SR output X t−1 ↑ 2 . The final output is 4× SR result y t .

Datasets and Training Details
We train all the networks using videos from the REDS [34] dataset proposed by NTIRE2019. REDS consists of 300 high-quality (720p) clips: 240 training clips, 30 validation clips, and 30 testing clips (each with 100 consecutive frames). We use the training clips and validation clips as a training dataset (270 clips). Limited by the device, the REDS raw data need to be compressed and sampled randomly before training to ensure storage space. Under the 4× VSR task, we firstly compress all the original videos of 1280 × 720 into 960 ×540 as the training HR videos. Then the videos are down-sampled four times to obtain the input LR videos of 240 × 135. The image resize function in Matlab (imresize) completes the above sampling operation. Furthermore, the training data of the original 100 consecutive video frames only takes the first 30 frames for the training to further save space.
We set the batch size as 8 with size 128 × 128 for HR patches. We set the learning rate as 1 × 10 −4 and decrease it by a factor of 10 for every 200 K iterations for a total of 400 K iterations. We initialize all the weights based on a Xavier Initialization. For all the activation units following the convolutional layers, we use ReLu. We use Adam [35] with a momentum of 0.9 and weight decay of 1 × 10 −4 for the optimization. We use Huber loss [36] as the loss function for training BTRPN referring to DUF-VSR [19]. The expression for Huber Loss is as follows: When training, the δ is set as 0.01. All experiments are performed using Python3.7 and Pytorch1.1.0 on a 2.1 GHz CPU and NVIDIA 1080Ti GPU. All tensors involved in the training and testing process are interpolated using the bilinear interpolation function provided in the Pytorch. According to mainstream practice, training and testing are conducted only on the Y channel of the YCbCr space, PSNR and SSIM are only calculated on the Y channel.

Depth and Channel Analysis
We construct multiple BTRPN networks of different depths and channels: BTRPN10-64, BTRPN10-128, BTRPN20-64, BTRPN20-128. 10/20 means that the network has 10/20 convolutional layers. 64/128 means that each convolutional layer channel is 64/128. Table 2 shows the performance of the four models for the 4× VSR. We can see that BTRPN20-128 has the best performance. The BTRPN20-64 has double-depth compared with BTRPN10-64, but the performance was not significantly improved. However, the BTRPN10-128 with double numbers of channels compared to BTRPN10-64 has a significant performance improvement. It indicates that it is more useful for the shallow network to increase the channel numbers in each layer than deepen the network.  Table 3 records training time for the four models. Table 4 records test time for the four models. Figure 8 shows the convergence rates of different models. Since the BTRPN network is not extensive, almost all BTRPN models converge at 250 K iterations.

Bidirectional Model Analysis
We test bidirectional and unidirectional models on positive and reverse VID4 dataset. To simplify the experiment and keep the same experimental conditions of the other control groups, we do not use progressive up-sampling TRP unit in Table 5. BTRPN-5L consists of 5 convolutional layers, and the fusion module is simplified to the convolutional layers concatenation. Due to the lack of a fusion module in TRPN, we use seven convolutional layers represented as TRPN-7L to guarantee the parameters consistent with BTRPN-5L. The parameters of TRPN-7L and BTRPN-5L are all around 1070 K. Experiments in Table 5 show that there is a big difference in the VSR results of positive and reverse video sequences for unidirectional TRPN-7L. The average PSNR difference of the VID4 dataset can reach 0.1db, and the maximum PSNR difference of a single video can reach 0.37db. However, the VSR results of the positive and reverse video sequences for bidirectional BTRPN-5L are almost identical. Furthermore, the results of BTRPN-5L on the VID4 dataset are better than those of TRPN-7L in both positive and reverse sequence. These results indicate that the bidirectional temporal network makes more use of temporal information and can reach better SR reconstruction.

Attention Mechanism
To demonstrate the effect of the attention mechanism, we use concatenation and channel attention fusion module to deal with the output of the two subnetworks, respectively. Experiments in Table 6 show that channel attention boosts the PSNR from 26.36 db to 26.78 db. This indicates that channel attention can direct the network to focus on more informative features and improve network performance. We test the BTRPN networks using one-step up-sampling and progressive up-sampling. The models in Table 7 both contain ten layers of convolution and maintain the same level of model size. Experiments show that the result of progressive up-sampling is better than that of one-step up-sampling. This shows that progressive up-sampling can indeed help the network achieve better SR performance than one-step up-sampling.

Quantitive and Qualitative Comparison
We compare the proposed BTRPN20-128 (referred to as BTRPN in the later section) with several the-state-of-the-art VSR algorithms on the VID4 dataset: VSRNet [10], VESPCN [31], DRVSR [11], Bayesian [17], B1,2,3+T [13], BRCN [37], SOF-VSR [38], FRVSR [14], DUF-16L [19], RBPN [18], RCAN [25]. Table 8 shows that our BTRPN network has the best average PSNR and the best average SSIM on the VID4 dataset. The qualitative result in Figures 9 and 10 also validates the superiority of the proposed method. In the short video of the city, BTRPN restores the clearest building edge lines, which reflects that the BTRPN network with progressive up-sampling has a strong reconstruction ability of regular patterns. In the foliage video, compared with other methods, BTRPN, accurately captures the motion trajectory of the white car and achieves a good motion compensation effect, which again proves the effectiveness of BTRPN's temporal propagation mechanism.

Parameters and Test Time Comparison
We compare the parameters and test times of BTRPN and other networks. We take the calendar clip in the VID4 dataset with 180 × 135 images input and 720 × 540 HR images output to record the test time for 4× enlargement. Figure 11 shows that BTRPN has achieved a good trade-off between model size and reconstruction effect. BTRPN makes an excellent reconstruction effect at only one-third of the size of the RBPN model. Compared with the same frame-recurrent type of FRVSR, BTRPN also obtains better video reconstruction quality with smaller network capacity. Table 9 shows that BTRPN has a distinct speed advantage compared with other methods.  Figure 11. Parameters of different models for 4x VSR on the VID4 dataset.

Conclusions
In this paper, we propose a novel bidirectional neural network that can integrate temporal information between frames. To fuse the bidirectional neural network better, we use the channel attention. We also find that progressive up-sampling is better than one-step up-sampling. Extensive experiments on the VID4 dataset demonstrate the effectiveness of the proposed method.