BFRVSR: A Bidirectional Frame Recurrent Method for Video Super-Resolution

: Video super-resolution is a challenging task. One possible solution, called the sliding window method, tries to divide the generation of high-resolution video sequences into independent subtasks. Another popular method, named the recurrent algorithm, utilizes the generated high-resolution images of previous frames to generate the high-resolution image. However, both methods have some unavoidable disadvantages. The former method usually leads to bad temporal consistency and has higher computational cost, while the latter method cannot always make full use of information contained by optical ﬂow or any other calculated features. Thus, more investigations need to be done to explore the balance between these two methods. In this work, a bidirectional frame recurrent video super-resolution method is proposed. To be speciﬁc, reverse training is proposed that also utilizes a generated high-resolution frame to help estimate the high-resolution version of the former frame. The bidirectional recurrent method guarantees temporal consistency and also makes full use of the adjacent information due to the bidirectional training operation, while the computational cost is acceptable. Experimental results demonstrate that the bidirectional super-resolution framework gives remarkable performance and it solves time-related problems.


Introduction
Video super-resolution, which solves the problem of reconstructing high-resolution images from low-resolution images, is a classic problem in image processing. It is widely used in security, entertainment, video transmission, and other fields [1][2][3]. As compared with single image super-resolution, video super-resolution can use more information to output better high-resolution images, such as the feature information of adjacent frames. However, the reconstruction of video super-resolution images is generally difficult because of various issues, such as occlusion, adjacent frame information utilization, and computational cost.
With the rise of deep learning, video super-resolution has received significant attention from the research community over the past few years. The sliding window method and recurrent method are two of the latest state-of-the-art methods based on deep learning. Specifically, the sliding window video super-resolution (SWVSR) method solves this problem by combining a batch of low-resolution images to reconstruct a single high-resolution frame and divides the video super-resolution task into multiple independent super-resolution subtasks [4]. Each input frame is processed several times, which wastes calculations. In addition, the generation process is an independent subtask, which may reduce time consistency, resulting in flickering and artifacts. Unlike the SWVSR method, the recurrent video super-resolution (RVSR) method generates the current high-resolution image from the previous Figure 1. The recurrent video super-resolution (RVSR) method has a problem which is the correlation between time (frame number) and Peak Signal to Noise Ratio (PSNR) which is used to evaluate image quality.
In our work, we propose an end-to-end trainable bidirectional frame recurrent video superresolution (BFRVSR) framework to address the above issues. We adopt forward training and reverse training to solve the problem of insufficient utilization of information and preserve temporal consistency, as shown in Figure 2. The BFRVSR has several benefits, which achieves a balance between RVSR and SWVSR. Each input frame needs to be processed no more than twice, while each output frame makes full use of the information contained by optical flow or any other calculated features. In addition, passing the previous high-resolution estimate directly to the other step helps the model to recreate fine details and produce temporally consistent videos. The work of the BFRVSR method is available at https://github.com/IlikethisID/BFRVSR.
Our contributions are mainly reflected in the following: (a) Propose a bidirectional frame recurrent video super-resolution method, in which no pretraining step is required. (b) Address the correlation between image quality and time and preserve temporal consistency.

Related Work
With the rise of deep learning, computer vision, including image super-resolution and video security [7][8][9], have received significant attention from the research community over the past few years.
Image super-resolution (ISR) is a classic ill-posed problem. To be specific, in most cases, there are several possible output images corresponding to one given input image, thus, the problem can be seen as a task of selecting the most appropriate one from all the possible outputs. The methods are divided into interpolation methods such as nearest, bilinear, bicubic, and dictionary learning [10,11]; example-based methods [12][13][14][15][16]; and self-similarity approaches [17][18][19][20]. We refer the reader to three review documents [21][22][23] for extensive overviews of prior work up to recent years. Figure 1. The recurrent video super-resolution (RVSR) method has a problem which is the correlation between time (frame number) and Peak Signal to Noise Ratio (PSNR) which is used to evaluate image quality.
In our work, we propose an end-to-end trainable bidirectional frame recurrent video super-resolution (BFRVSR) framework to address the above issues. We adopt forward training and reverse training to solve the problem of insufficient utilization of information and preserve temporal consistency, as shown in Figure 2. The BFRVSR has several benefits, which achieves a balance between RVSR and SWVSR. Each input frame needs to be processed no more than twice, while each output frame makes full use of the information contained by optical flow or any other calculated features. In addition, passing the previous high-resolution estimate directly to the other step helps the model to recreate fine details and produce temporally consistent videos. The work of the BFRVSR method is available at https://github.com/IlikethisID/BFRVSR. The recent progress in deep learning, especially in convolutional neural networks, has shaken up the field of ISR. Single image super-resolution (SISR) and video super-resolution are two categories based on ISR.
SISR uses a single low-resolution image to estimate a high-resolution image. Dong et al. [24] introduced deep learning into the field of super-resolution. They imitated the classic super-resolution solution method and proposed three steps, i.e., feature extraction, feature fusion, and feature reconstruction, to complete the SISR. Then, K. Zhang et al. [25] reached state-of-the-art results with deep CNN networks. A large number of excellent results have emerged [26][27][28][29][30]. In addition, the loss function also determines the result of image super-resolution, thus, some parallel efforts have studied the loss function [31][32][33].
Video super-resolution combines information from multiple low-resolution (LR) frames to reconstruct a single high-resolution frame. The sliding window method and recurrent method are two of the latest state-of-the-art methods.
The sliding window method divides the video super-resolution task into multiple independent subtasks, and each subtask generates a single high-resolution output frame from multiple lowresolution input frames [4,[34][35][36]. The input is adjacent 2N + 1 frames of low-resolution images like { − ， − +1 … , , … + −1 , + } . Then, an alignment module is used to align { − ， − +1 … + −1 , + } with the . Finally, is estimated through the aligned 2N + 1 lowresolution frames. Drulea and Nedevschi et al. [29] used the optical flow method to align −1 and +1 with and used them to estimate . The recurrent method generates a high-resolution image from the previous high-resolution image, the previous low-resolution image, and the low-resolution image. Huang et al. [37] used a bidirectional recurrent architecture but did not use any explicit motion compensation in their model. Recurrent structures are also used for other tasks, such as blurring [38] and stylization [39,40] of videos. Kim et al. [38] and Chen et al. [39] passed the feature representation to the next step, and Gupta et al. [40] passed the previous output frame to the next step, generating time-consistent stylizations in parallel work video. Sajjadi et al. [6] proposed a recursive algorithm for video super-resolution. The FRVSR [6] network estimates the optical flow → −1 of −1 and , and uses −1 and → −1 to generate ̃, and finally, sends ̃ and to the network for reconstruction to obtain . However, insufficient use of information caused by FRVSR leads to the correlation between image quality and time.

Methods
The framework of BFRVSR is shown in Figure 2. All network modules can be replaced. For example, the optical flow module can use existing methods that have been pretrained instead of training and building the network from scratch. You can also consider using a deformable convolution module [41] to replace the optical flow module.
After presenting an overview of the BFRVSR framework in Section 3.1, we define the loss functions used for training in Section 3.2. Figure 2. Overview of the proposed bidirectional frame recurrent video super-resolution (BFRVSR) framework. N frames are input as a group. In the group, each input is two frames of low-resolution and one frame of high-resolution, and the output is one frame of high-resolution. Forward estimation generates N frames of high resolution. Reverse estimation generates N − 1 frames of high resolution. Our contributions are mainly reflected in the following: (a) Propose a bidirectional frame recurrent video super-resolution method, in which no pretraining step is required. (b) Address the correlation between image quality and time and preserve temporal consistency.

Related Work
With the rise of deep learning, computer vision, including image super-resolution and video security [7][8][9], have received significant attention from the research community over the past few years.
Image super-resolution (ISR) is a classic ill-posed problem. To be specific, in most cases, there are several possible output images corresponding to one given input image, thus, the problem can be seen as a task of selecting the most appropriate one from all the possible outputs. The methods are divided into interpolation methods such as nearest, bilinear, bicubic, and dictionary learning [10,11]; example-based methods [12][13][14][15][16]; and self-similarity approaches [17][18][19][20]. We refer the reader to three review documents [21][22][23] for extensive overviews of prior work up to recent years.
The recent progress in deep learning, especially in convolutional neural networks, has shaken up the field of ISR. Single image super-resolution (SISR) and video super-resolution are two categories based on ISR.
SISR uses a single low-resolution image to estimate a high-resolution image. Dong et al. [24] introduced deep learning into the field of super-resolution. They imitated the classic super-resolution solution method and proposed three steps, i.e., feature extraction, feature fusion, and feature reconstruction, to complete the SISR. Then, K. Zhang et al. [25] reached state-of-the-art results with deep CNN networks. A large number of excellent results have emerged [26][27][28][29][30]. In addition, the loss function also determines the result of image super-resolution, thus, some parallel efforts have studied the loss function [31][32][33].
Video super-resolution combines information from multiple low-resolution (LR) frames to reconstruct a single high-resolution frame. The sliding window method and recurrent method are two of the latest state-of-the-art methods.
The sliding window method divides the video super-resolution task into multiple independent subtasks, and each subtask generates a single high-resolution output frame from multiple low-resolution input frames [4,[34][35][36]. The input is adjacent 2N + 1 frames of low-resolution images like I LR t−N , I LR t−N+1 . . . , I LR t , . . . I LR t+N−1 , I LR t+N . Then, an alignment module is used to align is estimated through the aligned 2N + 1 low-resolution frames. Drulea and Nedevschi et al. [29] used the optical flow method to align I LR t−1 and I LR t+1 with I LR t and used them to estimate I HR t . The recurrent method generates a high-resolution image from the previous high-resolution image, the previous low-resolution image, and the low-resolution image. Huang et al. [37] used a bidirectional recurrent architecture but did not use any explicit motion compensation in their model. Recurrent structures are also used for other tasks, such as blurring [38] and stylization [39,40] of videos. Kim et al. [38] and Chen et al. [39] passed the feature representation to the next step, and Gupta et al. [40] passed the previous output frame to the next step, generating time-consistent stylizations in parallel work video. Sajjadi et al. [6] proposed a recursive algorithm for video super-resolution. The FRVSR [6] network estimates the optical flow F LR t→t−1 of I LR t−1 and I LR t , and uses I HR t−1 . And F LR t→t−1 to generate I HR t , and finally, sends I HR t and I LR t to the network for reconstruction to obtain I HR t . However, insufficient use of information caused by FRVSR leads to the correlation between image quality and time.

Methods
The framework of BFRVSR is shown in Figure 2. All network modules can be replaced. For example, the optical flow module can use existing methods that have been pretrained instead of training and building the network from scratch. You can also consider using a deformable convolution module [41] to replace the optical flow module.
After presenting an overview of the BFRVSR framework in Section 3.1, we define the loss functions used for training in Section 3.2.

Bidirectional Frame Recurrent Video Super-Resolution (BFRVSR)
The proposed model is shown in Figure 3. Trainable modules include the optical flow estimation network, i.e., FlowNet and the super-resolution network, i.e., SuperNet. The input of our model is the low-resolution image of the current frame I LR t , the low-resolution image of the previous frame I LR t−1 , and the high-resolution image estimation of the previous frame I HR t−1 . The output of our model is the high-resolution image estimation of the previous frame I HR t .
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 11 Figure 2. Overview of the proposed bidirectional frame recurrent video super-resolution (BFRVSR) framework. N frames are input as a group. In the group, each input is two frames of low-resolution and one frame of high-resolution, and the output is one frame of high-resolution. Forward estimation generates N frames of high resolution. Reverse estimation generates N − 1 frames of high resolution.

Bidirectional Frame Recurrent Video Super-Resolution (BFRVSR)
The proposed model is shown in Figure 3. Trainable modules include the optical flow estimation network, i.e., FlowNet and the super-resolution network, i.e., SuperNet. The input of our model is the low-resolution image of the current frame , the low-resolution image of the previous frame −1 , and the high-resolution image estimation of the previous frame −1 . The output of our model is the high-resolution image estimation of the previous frame .

Flow Estimation
The network structure of FlowNet is shown in the Figure 4. First, the network uses the optical flow estimation module to estimate the low-resolution image of the previous frame −1 and the lowresolution image of the current frame to obtain a low-resolution motion vector diagram → −1 . Our method of FlowNet is similar to the method in FRVSR [6].
→ −1 shows the position information from the current image to the previous frame.

Upscaling Flow
In this step, we process the low-resolution optical flow map that has been obtained, and we use bilinear interpolation with scaling factor s for upsampling to obtain the high-resolution optical flow map.

Flow Estimation
The network structure of FlowNet is shown in the Figure 4. First, the network uses the optical flow estimation module to estimate the low-resolution image of the previous frame I LR t−1 and the low-resolution image of the current frame I LR t to obtain a low-resolution motion vector diagram F LR t→t−1 . Our method of FlowNet is similar to the method in FRVSR [6].
F LR t→t−1 shows the position information from the current image to the previous frame.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 11 Figure 2. Overview of the proposed bidirectional frame recurrent video super-resolution (BFRVSR) framework. N frames are input as a group. In the group, each input is two frames of low-resolution and one frame of high-resolution, and the output is one frame of high-resolution. Forward estimation generates N frames of high resolution. Reverse estimation generates N − 1 frames of high resolution.

Bidirectional Frame Recurrent Video Super-Resolution (BFRVSR)
The proposed model is shown in Figure 3. Trainable modules include the optical flow estimation network, i.e., FlowNet and the super-resolution network, i.e., SuperNet. The input of our model is the low-resolution image of the current frame , the low-resolution image of the previous frame −1 , and the high-resolution image estimation of the previous frame −1 . The output of our model is the high-resolution image estimation of the previous frame .

Flow Estimation
The network structure of FlowNet is shown in the Figure 4. First, the network uses the optical flow estimation module to estimate the low-resolution image of the previous frame −1 and the lowresolution image of the current frame to obtain a low-resolution motion vector diagram → −1 . Our method of FlowNet is similar to the method in FRVSR [6].
→ −1 shows the position information from the current image to the previous frame.

Upscaling Flow
In this step, we process the low-resolution optical flow map that has been obtained, and we use bilinear interpolation with scaling factor s for upsampling to obtain the high-resolution optical flow map.

Upscaling Flow
In this step, we process the low-resolution optical flow map that has been obtained, and we use bilinear interpolation with scaling factor s for upsampling to obtain the high-resolution optical flow map.

Warping HR Image
Use the obtained high-resolution optical flow diagram and the high-resolution image of the previous frame to estimate the high-resolution image of the current frame.
We implemented warping as a differentiable function using bilinear interpolation similar to Jaderberg et al. [42].

Mapping to Low Resolution (LR) Space
We map high-dimensional spatial information to low-dimensional depth information using the space-to-depth transformation.
Our method of mapping to low-dimensional space is similar to the method in FRVSR [6]. The mapping to LR space operation process is shown in the Figure 5.

Warping HR Image
Use the obtained high-resolution optical flow diagram and the high-resolution image of the previous frame to estimate the high-resolution image of the current frame.
We implemented warping as a differentiable function using bilinear interpolation similar to Jaderberg et al. [42].

Mapping to Low Resolution (LR) Space
We map high-dimensional spatial information to low-dimensional depth information using the space-to-depth transformation.
Our method of mapping to low-dimensional space is similar to the method in FRVSR [6]. The mapping to LR space operation process is shown in the Figure 5.

Super-Resolution
In this step, the low-dimensional depth map of the high-resolution image of the current frame ℎ and the low-resolution image of the current frame are sent to the SuperNet to obtain the final high-resolution frame. The network structure of SuperNet is shown in the Figure 6. In summary, the overall process of the network is as follows:

Loss functions
In our network architecture, the optical flow estimation module and the super-resolution Figure 5. Space-to-depth module. Compress the spatial information of high-resolution images into low-resolution image depth information.

Super-Resolution
In this step, the low-dimensional depth map of the high-resolution image of the current frame H depth t and the low-resolution image of the current frame I LR t are sent to the SuperNet to obtain the final high-resolution frame. The network structure of SuperNet is shown in the Figure 6.

Warping HR Image
Use the obtained high-resolution optical flow diagram and the high-resolution image of the previous frame to estimate the high-resolution image of the current frame.
We implemented warping as a differentiable function using bilinear interpolation similar to Jaderberg et al. [42].

Mapping to Low Resolution (LR) Space
We map high-dimensional spatial information to low-dimensional depth information using the space-to-depth transformation.
Our method of mapping to low-dimensional space is similar to the method in FRVSR [6]. The mapping to LR space operation process is shown in the Figure 5.

Super-Resolution
In this step, the low-dimensional depth map of the high-resolution image of the current frame ℎ and the low-resolution image of the current frame are sent to the SuperNet to obtain the final high-resolution frame. The network structure of SuperNet is shown in the Figure 6. In summary, the overall process of the network is as follows: In summary, the overall process of the network is as follows:

Loss Functions
In our network architecture, the optical flow estimation module and the super-resolution module are trainable, therefore, in the training process, two loss functions are used to optimize the results.
The first loss function is the error between the high-resolution image generated by the super-resolution module and the real image label I lable t as follows: Because the dataset does not have the ground truth of optical flow, we use a method similar to the FRVSR [6] to calculate the spatial mean square error on the curved LR input frame to optimize the optical flow estimation module as the second loss function as follows: The loss function of training final backpropagation is L total = L 1 + L 2 .

Training Datasets
Vimeo-90k [43] is our training and testing dataset. We abbreviate the Vimeo-90k test dataset as Vimeo-Test and the Vimeo-90k train dataset as Vimeo-Train. The Vimeo-90k dataset contains 91,701 7-frame continuous image sequences, and is divided into Vimeo-Train and Vimeo-Test. In Vimeo-Train, we randomly cropped the original 448 × 256 image to the 256 × 256 real label image. In order to generate LR images, we performed Gaussian blur and downsampling processing on the real label image and used a Gaussian blur with standard deviation σ = 2.0.

Training Details
Our network is end-to-end trainable, and there are no modules that need to be pretrained. The Xavier method is used for initialization. We train 600 epochs, and the batch size is 4; the optimizer uses Adam optimizer; and the initial learning rate is 10 −4 , which is reduced by 0.1 times every 100 epochs. In a batch, each sample is 7 consecutive images. We conduct video super-resolution experiments at 4× factor.
In order to obtain the first high-resolution image I HR 1 , two methods can be used. In the first method, we set I HR 0 to a completely black image. This can force the network to learn detailed information from low-resolution images. In the second method, we upsample I LR 1 to I HR 1 through the bicubic interpolation method and estimate I HR 2 from I LR 2 , I LR 1 , I HR

1
. In order to compare with the RVSR method, we used the first method for experimentation.

Baselines
For a fair evaluation of the proposed framework on equal ground, we compare our model with the following three baselines that use the same optical flow and super-resolution networks: SISR Only a single low-resolution image is used to estimate a high-resolution image without relying on timing information. The input is I LR t and the output is I HR t .
VSR Through I HR t−1 , I LR t−1 , I LR t , without the optical flow network estimation, relying on the learning space deformation ability of the convolution operation itself to obtain I HR t . RVSR Through I HR t−1 , I LR t−1 , I LR t , with the optical flow network estimation, and then sent to SuperNet to obtain I HR t . The operation process is the same as the forward propagation in the BFRVSR network.
We ensure that the network model is consistent during the evaluation. The key parameters of the training parameters are the same. The initialization uses Xavier initialization, and the accelerator uses Adam optimizer. The initial learning rate is 10 × 10 −4 , which is reduced to 0.1 times every 100 rounds. All networks are trained with the same training set, and the coefficient of Gaussian blur is 2.0.

Analysis
We train baselines and BFRVSR to convergence under the same parameter conditions. We compare and test the pretrained model on the Vimeo-Test. Table 1 shows the comparison image PSNR results of baselines and BFRVSR. As compared with baselines, our proposed framework has the best effect in continuous 7-frame video sequences, and it is 0.39 dB higher than the RVSR method. PSNR of BICUBIC and SISR is only related to current low-resolution images, and no correlation between high-resolution images. PSNR of VSR and RVSR has correlation between image quality and time. Because of motion compensation by optical flow network, the RVSR performance is better than the VSR. BFRVSR performs a forward estimation and a reverse estimation. The BRVSR is equivalent to an RVSR network in forward estimation. It transmits global detail information by using I HR t−1 and performs timing alignment operations. However, there are some problems, that is, the details of I LR j cannot be obtained for I LR i to optimize the image (i > j). Reverse estimation solves this problem. Reverse estimation makes each frame implicitly use all the information to estimate the high-resolution image of the frame. Use the I HR t , I LR t−1 , I LR t to generate I HR t−1 . RVSR can be trained on video clips of any length. However, if the video clip is too long, RVSR has a problem which is the correlation between image quality and time. In fact, RVSR also has the problem on shorter video clips. BFRVSR solves this problem, as shown in Figure 7. BFRVSR has two processes, i.e., forward estimation and reverse estimation. I HR t−1 is used to transmit global information and perform timing alignment operations. In the forward estimation, BFRVSR is equivalent to RVSR. The problem in forward estimation is obvious. When the generated video sequence is I LR 1 , . . . , I HR i , . . . , I LR j , . . . , I HR N , the reference information generated by the forward estimation of I HR Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 11 Figure 7. We show the quality of each frame in the forward propagation of BFRVSR and the quality of each frame in the reverse propagation. We found that global information is implicitly used in backpropagation to generate high-resolution images.
The video super-resolution, based on the sliding window method processes each frame 2N + 1 times, the video super-resolution based on the recurrent method processes each frame once, and the BFRVSR processes each frame, at most, two times.
On the RTX-2080Ti, the time for a single image Full HD frame for 4× super-resolution is 291 ms.

Conclusions
We propose an end-to-end trainable bidirectional frame recurrent video super-resolution method. Due to the operation of bidirectional training, with more information utilized to feed the model to deal with the correlation between image quality and time, BFRVSR successfully solves the problem shown in Figure 1. To be specific, it decouples the correlation between image quality and time. In addition, the proposed method achieves better image quality, while the computational cost is lower than the sliding window method.

Future Work
There is still room for improvement in the field of video super-resolution. If the problem of occlusion and blur is considered, much more computational cost would be required. We can deal with the problem by adding cross connections. In addition, a deformable convolution module, which has been frequently investigated recently, shows enormous potential in the field of image We show the quality of each frame in the forward propagation of BFRVSR and the quality of each frame in the reverse propagation. We found that global information is implicitly used in backpropagation to generate high-resolution images.
The video super-resolution, based on the sliding window method processes each frame 2N + 1 times, the video super-resolution based on the recurrent method processes each frame once, and the BFRVSR processes each frame, at most, two times.
On the RTX-2080Ti, the time for a single image Full HD frame for 4× super-resolution is 291 ms.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 11 Figure 7. We show the quality of each frame in the forward propagation of BFRVSR and the quality of each frame in the reverse propagation. We found that global information is implicitly used in backpropagation to generate high-resolution images.
The video super-resolution, based on the sliding window method processes each frame 2N + 1 times, the video super-resolution based on the recurrent method processes each frame once, and the BFRVSR processes each frame, at most, two times.
On the RTX-2080Ti, the time for a single image Full HD frame for 4× super-resolution is 291 ms.

Conclusions
We propose an end-to-end trainable bidirectional frame recurrent video super-resolution method. Due to the operation of bidirectional training, with more information utilized to feed the model to deal with the correlation between image quality and time, BFRVSR successfully solves the problem shown in Figure 1. To be specific, it decouples the correlation between image quality and time. In addition, the proposed method achieves better image quality, while the computational cost is lower than the sliding window method.

Future Work
There is still room for improvement in the field of video super-resolution. If the problem of occlusion and blur is considered, much more computational cost would be required. We can deal with the problem by adding cross connections. In addition, a deformable convolution module, which has been frequently investigated recently, shows enormous potential in the field of image

Conclusions
We propose an end-to-end trainable bidirectional frame recurrent video super-resolution method. Due to the operation of bidirectional training, with more information utilized to feed the model to deal with the correlation between image quality and time, BFRVSR successfully solves the problem shown in Figure 1. To be specific, it decouples the correlation between image quality and time. In addition, the proposed method achieves better image quality, while the computational cost is lower than the sliding window method.

Future Work
There is still room for improvement in the field of video super-resolution. If the problem of occlusion and blur is considered, much more computational cost would be required. We can deal with the problem by adding cross connections. In addition, a deformable convolution module, which has been frequently investigated recently, shows enormous potential in the field of image classification, semantic segmentation, etc. Thus, it may achieve better results if we replace the optical flow module with a deformable convolution module. Furthermore, it is believed that video super-resolution and frame insertion have considerable similarities, thus, we may try to utilize BFRVSR to perform these two tasks simultaneously.