Video Super-Resolution Network with Gated High-Low Resolution Frames

Ouyang, Ning; Ou, Zhishan; Lin, Leping

doi:10.3390/app13148299

Open AccessArticle

Video Super-Resolution Network with Gated High-Low Resolution Frames

by

Ning Ouyang

,

Zhishan Ou

and

Leping Lin

^*

School of Information and Communication, Guilin University of Electronic Technology, Guilin 541004, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(14), 8299; https://doi.org/10.3390/app13148299

Submission received: 17 May 2023 / Revised: 16 June 2023 / Accepted: 19 June 2023 / Published: 18 July 2023

Download

Browse Figures

Versions Notes

Abstract

:

In scenes with large inter-frame motion variations, distant targets, and blurred targets, the lack of inter-frame alignment can greatly affect the effectiveness of subsequent video super-resolution reconstruction. How to perform inter-frame alignment in such scenes is the key to super-resolution reconstruction. In this paper, a new motion compensation method is proposed to design an alignment network based on gated high-low resolution frames. The core idea is to introduce a gating mechanism while using the information of high-low resolution neighboring frames to perform motion compensation adaptively. Meanwhile, within this alignment framework, we further introduce a pre-initial hidden state network and a local scale hierarchical salient feature fusion network. The pre-initial hidden state network is mainly used to reduce the impact of unbalanced quality effects between frames that occur in one-way cyclical networks; the local scale hierarchical salient feature fusion network is used to fuse the features of aligned video frames to extract contextual information and locally salient features to improve the reconstruction quality of the video. Compared with existing video super-resolution methods, this method achieves good performance and clearer edge and texture details.

Keywords:

video super-resolution; gating mechanisms; high-low resolution frame information; adaptive motion compensation; hidden states; locally significant features

1. Introduction

The aim of video super-resolution is to generate high-resolution video in high definition from low-resolution video by filling in the missing details in continuous low-resolution video. How to make full use of the high-low resolution information between frames is the key to video super-resolution reconstruction. Most of the existing video super-resolution algorithms [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17] use the optical flow method for inter-frame alignment. These algorithms can handle the inter-frame information of the video but cannot make full use of the inter-frame information, especially in scenes with large inter-frame motion variations and distant targets, and there is still a need to continue to investigate how to perform better motion estimation and compensation.

The optical flow method of video super-resolution consists of three main steps: motion estimation and compensation, feature fusion network, and the reconstruction network. These steps are all designed to establish temporal dependencies using aligned neighboring frame information to improve the effectiveness and quality of the video super-resolution process. Of these, motion estimation and motion compensation are designed to process motion information between adjacent frames to obtain aligned video frames, and these methods can exploit temporal dependencies to facilitate the extraction of temporal and spatial information between successive adjacent frames. Feature fusion networks and reconstruction networks learn to map into high-resolution video frames by motion-compensated, low-resolution video frames. It is important to note that the accuracy of the motion estimation [1,2,4,6] can have a significant impact on the subsequent motion compensation, feature fusion, and reconstruction networks, and inaccurate estimation may lead to misalignment and, thus, affect the quality results of the reconstruction.

As shown in Figure 1a,b, in the past there were two major types of motion compensation approaches for deep learning-based video super-resolution algorithms: the first approach was to perform motion estimation on low-resolution adjacent frame sequences [1,8,9,10,11,12,13], and the second approach was to perform motion estimation on adjacent low-resolution frame sequences and obtained stream estimates for up-sampling to generate high-resolution stream estimates [3,4,5,14].

The first approach does not make good use of the high-frequency information of the high-resolution features of the adjacent frames. The second approach does not make good use of the similarity between low-resolution frames and high-resolution frames, and the motion compensation performed takes a secondary operation that greatly affects the details lost between up- and down-sampling. Combining the above two alignment ideas, a new motion compensation is proposed, and a gating mechanism is introduced to adaptively select useful information, as shown in Figure 1c. The gated high-low resolution frames network (GHLRF) is designed as a new video super-resolution algorithm based on gated high-low resolution frames, aiming to solve the problems of existing methods that do not make full use of the information of adjacent high-low resolution frames and the large error in motion estimation. The method first obtains motion information between neighboring frames through a motion estimation method and then uses the neighboring frame information for motion compensation. Next, a gating mechanism is used to adaptively select useful information in high-low resolution neighboring frames to avoid the effect of motion estimation errors, thus acquiring aligned video frames. In addition, a pre-initial hidden state network is adopted to reduce the impact of video frame quality imbalance that occurs in a one-way cyclical framework, and a local scaled hierarchical salient feature fusion network is used to fuse the features of the aligned video frames, allowing for better extraction of contextual information, locally salient features, and the extraction of high-frequency detail information of the edges and textures of the video frames to improve the reconstruction quality of the video.

Specifically, the low-resolution feature information obtained after optical flow estimation is firstly used with the loop-based preserved high-resolution feature information fed into a gated motion compensation network, followed by feature refinement of the aligned features through a fusion network, and, finally, the corresponding high-resolution video frames are generated through a reconstruction network.

Compared to existing methods, the advantage of this model is that it can use both high-low resolution neighboring frame information, effectively improving the accuracy of motion compensation and the reconstruction effect. The introduction of the gating mechanism, on the other hand, enables the adaptive selection of useful information to avoid the effects of motion estimation errors. As a result, this model has better robustness and stability. The main contributions of this paper are as follows:

(1): A new gated motion compensation approach for high-low resolution frames is proposed, adaptively selecting useful information in high- and low-resolution neighboring frames to avoid the effect of motion estimation errors;
(2): A pre-initial hidden state network is designed to reduce the impact of video frame imbalance that occurs by adopting a one-way cyclical frame;
(3): A local scale hierarchical salient feature fusion network is designed, which can focus on different scales as well as different regional features to obtain locally salient feature information;
(4): A plug-and-play hierarchical hybrid attention module is proposed, which can filter useful information from different levels of features and recover better localized high-frequency detail information.

2. Related Works

In recent years, with the rise of deep learning, various types of deep learning-based methods have been applied in the field of computer vision. For example, Dong et al. [15] proposed SRCNN, which is known as a pioneering work by introducing a convolutional neural network entirely. The network used three convolutional layers to represent different non-linear mapping relationships for reconstruction, and, although the network layers were shallow, they showed good performance at the time compared to traditional methods. Since then, a large number of deep learning-based methods have started to flood into super-resolution reconstruction techniques [16,17,18,19,20,21,22,23].

Caballero et al. [1] were the first to propose an end-to-end convolutional neural network-based video super-resolution approach Real-time video super-resolution with spatiotemporal networks and motion compensation (VESPCN), which is based on the single-image super-resolution algorithm, ESPCN, is combined with a motion estimation module. Wang et al. [6] proposed a motion estimation network (deep video super-resolution using HR optical flow estimation, SOFVSR), which uses a coarse-to-fine method to estimate the high-resolution optical flow between frames. The algorithm uses a three-layer pyramid structure to estimate the optical flow by progressively up-sampling the final high-resolution HR optical flow and then down-sampling the number of channels to expand the low-resolution LR optical flow. A basic architecture network video restoration with enhanced deformable convolutional networks (EDVR) was proposed by Wang et al. [17], which super-segmented the video into four main modules: feature extraction, alignment, aggregation, and reconstruction. Motion estimation is performed using PCD pyramidal variability convolution, followed by the corresponding spatiotemporal feature combination in combination with the TSA attention module.

Li et al. [7] proposed an attention mechanism approach to reconstruct spatiotemporal information features by using a dual-attention network. Yi et al. [24] proposed a progressive fusion strategy approach that uses a non-local approach to inter-frame similarity determination and then makes full use of spatiotemporal information through a stepwise separation and merging approach, greatly improving model performance. Chan et al. [2] proposed the search for essential components in video and super-resolution, beyond BasicVSR, a two-way loop-based network, and designed a baseline framework according to which the corresponding network can be changed, such as alignment and aggregation, which uses the two-way loop The Omniscient video super-resolution (OVSR), proposed by Yi et al. [25], is an omniscient framework that can be changed by simply satisfying Equations (1) and (2) of their paper. The network uses the two-way loop idea, combined with the two-branch idea to achieve it and can be used for past, present, and future information so that the reconstructed frame obtains more feature information. Isobe et al. [21] proposed a one-way loop-based lightweight network (revisiting temporal modeling for video super-resolution, RRN) and proposed a temporal aggregation network that combines past information and low-resolution video frames using a one-way loop mechanism for fusion and feature refinement through the idea of residuals, which is simple and efficient.

Good results are achieved in all of these approaches, with the use of aligned versus non-aligned approaches. Inspired by these approaches, this paper adopts a video super-resolution framework that displays aligned versus non-full loop, with adaptive motion compensation by looping the retained high-resolution features with low-resolution features, enhancing the fusion of locally salient features in a non-full-loop framework and reducing the problem of unbalanced one-way loop output results.

3. The Gated High-Low Resolution Frames Network

In this section, we describe in detail the structure of the GHLRF network. First, we describe the overall structure of the network. Then, the initial hidden state network (IHSNet), gate-guided adjacent high-low resolution frame alignment network (GAFANet), scale local level significant feature fusion network (SLLFNet), and reconstruction network (RNet) are described.

3.1. Overall Structure

As shown in Figure 2, the GHLRF network model consists of four modules: a pre-initial hidden state network (IHSNet), a gated adjacent high-low resolution frame alignment network (GAFANet), a scale local level significant feature fusion network (SLLFNet), and a reconstruction network (RNet) designed to convert low-resolution (LR) reference frames into high resolution (HR) reference frames for video super-resolution processing.

The pre-initial hidden state network (IHSNet) is used to take the first frame in a video sequence and its neighboring frames as input and convert them into an initially hidden state feature. The gated adjacent high-low resolution frame alignment network (GAFANet) consists of three sub-modules, the feature extraction module, the stream estimation network, Spynet [18], and the gated adjacent frame high-low resolution motion compensation module, GNMCB, for the alignment of LR and HR-adjacent frames. The scale local level significant feature fusion network, SLLFNet, is used to perform scale attention fusion on the aligned features generated by (GAFANet) to enhance high-frequency details. Finally, the reconstruction network (RNet) takes the fused features with the input of the reference frame that underwent double triple up-sampling to generate the HR reference frame.

Specifically, an initial hidden state feature, H0, is first generated using a given set of L consecutive LR frames as input to IHSNet, and then alignment features are generated using LR reference frames and adjacent frames as input to GAFANet, which consists of three sub-modules: the feature extraction module, the optical flow estimation network Spynet, and the gated adjacent frames high-low resolution motion The feature extraction module extracts the shallow features of the LR reference frame and adjacent frames, which are then fed into the stream estimation network Spynet to generate the LR optical stream. Next, the shallow features of the neighboring frames, the initially hidden features, or the recursive neighboring high-resolution features and the LR optical stream are input to GNMCB, and the gating mechanism is used to determine the goodness of the features after motion compensation. Finally, the features are fed into the RNet with the reference frame that has undergone double triple up-sampling to generate the HR reference frame.

3.2. Pre-Initial Hidden State Network

The one-way recurrent network approach allows hidden state information to be retained for the next input. As the hidden state information is passed through the network and updated to provide better long-term temporal dependence, the better the super-resolution reconstruction effect becomes. However, this reconstruction effect is unbalanced in that it accumulates in one direction, i.e., the one-way cyclic approach is a one-way accumulation process. To reduce the effect of this imbalance, we propose to give an initial hidden state information to have better hidden state information at the beginning, and the initially hidden state information can help the network to converge faster and reduce the training time during the cyclical process. In addition, the initial hidden state information can improve the super-resolution reconstruction effect.

As shown in Figure 3, ISHNet is a neural network consisting of a convolutional layer and an attentional U-shaped module for taking as input the first reference frame in a video sequence with its neighboring frames and transforming it into an initial hidden state

H 0

:

H 0 = I S H N e t (I_{t - N}^{L}, \dots, I_{t}^{L}, \dots, I_{t + N}^{L}, Θ_{I S H})

(1)

where

I_{t - N}^{L}, \dots, I_{t}^{L}, \dots, I_{t + N}^{L}

in

I_{t}^{L}

denotes a low-resolution reference frame and the rest are adjacent frames;

Θ_{I S H}

denotes the set of parameters.

Specifically, as follows, the first reference frame

I_{t}^{L}

and its adjacent frames are input to a weight-sharing convolutional layer, and feature channel stitching is performed. Then the spliced features are input to the U-shaped module of channel attention, and, after generating features through the residual layer, the convolution layer operation and the convolution layer operation containing the activation function are performed to generate

o 0

and

h 0

, respectively. Finally, the channel number splicing is performed between

o 0

and

h 0

to obtain the final initial state hiding information

H 0

.

In summary, ISHNet generates initial hidden state information by feeding the first reference frame and adjacent frames into the network, using convolutional layers and U-modules, which help to provide a better initial state for subsequent super-resolution reconstruction and reduce the imbalance effect of one-way cyclical frames.

3.3. Gate-Guided Adjacent High-Low Resolution Frame Alignment Network

The gated adjacent high-low resolution frame alignment GAFANet network is the core module that aims to solve the problems of existing methods that do not make full use of the high-low resolution adjacent frame information and the large motion estimation errors. The core idea of the model is to introduce a gating mechanism while using the high- and low-resolution neighboring frame information in an adaptive manner for motion compensation.

As shown in Figure 1a, it consists of three sub-modules: the feature extraction module, the optical flow estimation module Spynet, and the gated adjacent high-low resolution frame motion compensation module GNMCB. The input information is passed through GAFANet to generate aligned video frame features

f^{A}

:

f_{1}^{A} = G A F A N e t ((I_{t - 1}^{L}, I_{1}^{L}, H 0), Θ_{G A F A}) n = 1

(2)

f_{n}^{A} = G A F A N e t ((I_{t - 1}^{L}, I_{t}^{L}, h_{t - 1}), Θ_{G A F A}) n \geq 2

(3)

where

f_{1}^{A}

and

f_{n}^{A}

are aligned video frame features

f^{A}

,

I_{1}^{L}

is denoted as the low-resolution reference of the first frame,

I_{t}^{L}

is the low-resolution reference frame after the second frame,

I_{t - 1}^{L}

is denoted as its corresponding low-resolution neighbor frame, and

h_{t - 1}

is the loop-based high-resolution frame after the first frame,

Θ_{G A F A}

, denoted as the parameter set.

Specifically, the model first performs feature extraction through the feature extraction module, then uses motion estimation methods to obtain motion information between neighboring frames and then uses the neighboring frame information to perform gated motion compensation. The gating mechanism is used to adaptively select useful information in high- and low-resolution neighboring frames to avoid the effects of motion estimation errors, thereby acquiring aligned video frames.

1.: Feature extraction module

The feature extraction module consists of five weight-sharing residual layers, and the generated features are used as input to the motion estimation network Spynet.

2.: Optical flow estimation module Spynet

The optical flow estimation module Spynet is an open-source convolutional neural network-based optical flow estimation method. Its main feature is that it can learn the mapping from image to optical flow directly, avoiding the problems of traditional optical flow algorithms where parameters need to be designed and tuned manually.

In video super-resolution reconstruction tasks, Spynet is commonly used to perform optical flow estimation on low-resolution reference frames to obtain motion information between adjacent frames, further aiding in the generation of high-resolution super-resolution reconstruction results.

3.: Gated adjacent high-low resolution frame motion compensation module

Inaccurate or large errors in motion estimation can lead to inaccurate alignment of video frames and affect subsequent super-resolution reconstruction. In order to be able to adaptively select useful information for high-low resolution adjacent frames and avoid the effects of motion estimation errors. we propose a gated motion compensation module for high-low resolution adjacent frames.

As shown in Figure 4, the module consists of two sub-modules, the adjacent high-low resolution frame feature extraction module and the gated motion compensation module. In this module, the gated motion compensation module is used to control the weighting of adjacent frames in terms of high-low resolution features to select useful information and achieve a more refined motion compensation.

Specifically, the gated motion compensation module consists of gated convolution and motion compensation. Firstly, feature

F_{t - 1}

is fused with the down-sampled high-resolution feature

h_{t - 1}

. Secondly, a layer of activation function convolution and a layer of linear convolution is performed to generate feature

X

. Then, the convolution operation, Sigmoid operation, and motion compensation are performed on feature

X

to obtain the gated weight

G

and alignment feature

X_{M C}

, respectively. Finally, the alignment feature information

X_{M C}

is multiplied with the gated weight

G

to obtain the final alignment feature

f^{A}

. The calculation formula is as follows:

X = W [d o w n (h_{t - 1}), F_{t - 1}]

(4)

G = S (R (X \otimes W_{G, 1}) \otimes W_{G, 2})

(5)

X_{M C} = M C (X)

(6)

f^{A} = G ⊙ X_{M C}

(7)

where

X

denotes a high-low resolution feature summing input,

d o w n (\cdot)

denotes sub-pixel down-sampling,

G

denotes gating weights,

W_{G, 1}

, and

W_{G, 2}

denote first- and second-layer linear operations,

R

and

S

denote Relu and Sigmoid functions, respectively,

M C (\cdot)

is a motion compensation module,

\otimes

denotes matrix operations, and

⊙

denotes element-by-element multiplication.

The advantage of this method over existing methods is that it can use both high- and low-resolution neighboring frame information, effectively improving the accuracy of motion compensation and the reconstruction effect. The introduction of the gating mechanism, on the other hand, enables the adaptive selection of useful information to avoid the effects of motion estimation errors.

3.4. Scale Local Level Significant Feature Fusion Network

SLLFNet is a key part of GHLRF, a network used to fuse features from aligned video frames and extract high-frequency detailed information from the edges and textures of the video frames to improve the reconstruction quality of the video. As shown in Figure 5, the SLLFNet consists of multiple residual blocks and a U-shaped structure, where each structure introduces a multi-scale fusion module to exploit feature extraction and information fusion at multiple scales. In this paper, a secondary cascade of SLLFNet is adopted to obtain the final fused SLLFNet, possessing different levels of high-frequency detail features

U

:

U = S L L F N e t (S L L F N e t (f^{A}, Θ_{S L L F N e t}), Θ_{S L L F N e t})

(8)

S L L F N e t = [U_{1}, U_{2}, U_{3}]

(9)

U_{1} = R M B_{1} (R B (f^{A}))

(10)

U_{2} = H H M A B (d o w n (U_{1}))

(11)

U_{3} = R M B_{2} (u p (U_{2}))

(12)

where

S L L F N e t (\cdot)

is denoted as a local scaled hierarchical significant feature fusion network,

Θ_{S L L F N e t}

is the corresponding set of network model parameters,

[U_{1}, U_{2}, U_{3}]

are denoted as

R M B_{1} (\cdot)

,

H H M A B (\cdot)

, and

R M B_{2} (\cdot)

, respectively, composed as

S L L F N e t

,

R M B_{1} (\cdot)

and

R M B_{2} (\cdot)

are residual multilevel blocks, and

H H M A B (\cdot)

is a hierarchical mixed multi-attentive block,

d o w n (\cdot)

is down-sampling, and

u p (\cdot)

as up-sampling.

The residual multilevel block, as shown in Figure 6, consists of three layers for cascading and channel number splicing for each convolutional layer output to form multiple features with different levels of perceptual fields, convolutional dimensionality reduction operation for its fused features, and, finally, global residual concatenation with the input features to obtain the RMB output features.

SLLFNet also introduces the hierarchical hybrid multi-attention block (HHMAB) at the bottom layer of the U-shaped structure, as shown in Figure 7, to further control the feature flow at different scales effectively and recover high-frequency detailed information of edges and textures. The hierarchical hybrid multi-attention block, HHMAB, convolves the feature maps by convolution kernels at different scales, after which channel attention is applied to the features obtained at different scales [26] to select the importance of features at different levels. Then, these features at different scales are weighted and fused to capture contextual spatial information by spatial attention [27]. Finally, the features at different levels of importance are selected and element-wise summed with capturing contextual spatial features to obtain detailed information at different levels of high frequency. This plug-and-play scale attention mechanism can enhance the feature extraction and information fusion capability of the network as well as enhance high-frequency detailed information, thus improving the quality of video reconstruction.

3.5. Reconstruction Network

The RNet is reconstructed from the output features

U

fused by SLLFNet with the reference frame

I_{t}^{L}

, which was double-up sampled three times into the RNet to generate the HR reference frame:

H R = R N e t ((U, I_{t}^{L}), Θ_{R N e t})

(13)

where

R N e t (\cdot)

denotes the reconfiguration network and

Θ_{R N e t}

denotes the set of parameters.

As shown in Figure 1c, the first two layers of RNet are blocks of residuals consisting of 3 × 3 convolutions for the further refinement of features. Then the convolution operation is repeated twice through one layer of 3 × 3 and 2 times sub-pixel up-sampling to obtain the final HR reference frame.

4. Experimental Results

4.1. Implementation Details

The training network in this paper uses the open-source dataset Vimeo-90K, containing 64,612 video sequences (each sequence contains 7 consecutive frames of size 448 × 256). To obtain low-resolution video frames of size 64 × 64 LR, the high-resolution sequence maps were uniformly cropped to 256 × 256 and then blurred with a Gaussian kernel of 1.6 and further down-sampled using a 4-fold scale.

The convolutional layers of this network without special specification are trained end-to-end using a 3 × 3 convolution accompanied by the activation function Relu, using the L1 loss function. In the testing phase, the benchmark dataset Vid4 [28], SPMCS [8], which is commonly used and publicly available in video superscore, was used for evaluation.

The deep learning framework used in this experiment was PyTorch; the hardware environment was two NVIDIA DELL V100 GPUs. The batch size of the training model was 4, and the Adam optimizer was used, with a setting of β1 = 0.9 and β2 = 0.999. The learning rate was initially set to

1 \times 10^{- 4}

, and after 50 epochs to 70 epochs, it was divided by 10.

The training loss curve and the test performance evaluation curve are shown in Figure 8. As can be seen from the graph, when the number of training iterations is low, the loss is lower, and the test performance evaluation metric is higher; when the number of iterations exceeds 20, the loss value fluctuates, and the test performance evaluation also starts to show a fluctuating increase.

4.2. Ablation Study

Ablation Experiment I

To verify the effectiveness of the modules proposed in this paper, we trained three different models as pre-initial hidden state network (IHSNet), gated adjacent high- and low-resolution frame alignment network (GAFANet), and scale local hierarchical significant feature fusion network (SLLFNet) for validation. The different models were trained uniformly on Vimeo-90K-T, containing 7824 sequential frames, and their average PSNR (dB) and SSIM performance results on the dataset from the VSR test set Vid4 are shown in Table 1. It can be observed that the lack of any of the modules results in different degradations in performance, indicating that each module is necessary.

The specific reasons are as follows: (1) The pre-initial hidden state network (IHSNet) during the one-way loop and the initial hidden state information can reduce the effect of unbalanced output video frames. At the same time, the initial hidden state information can also improve the super-resolution reconstruction effect. (2) The gated adjacent high-low resolution frame alignment network (GAFANet) can use the high-low resolution neighboring frame information simultaneously, effectively improving the accuracy of the motion compensation and reconstruction effect. The introduction of the gating mechanism, on the other hand, enables the adaptive selection of useful information to avoid the effect of motion estimation errors. (3) The scale local level significant feature fusion network (SLLFNet) is further designed to enhance the reconstruction quality of video by fusing the features used to fuse the aligned video frames and extracting high-frequency detailed information of the edges and textures of the video frames. As a result, the average performance improvement is 0.05 (dB)/0.0001 with IHSNet; 0.22 (dB)/0.046 with GAFANet; and (3) 1.22 (dB)/0.0504 with SLLFNet.

Ablation Experiment II

To verify the effectiveness of the gating-based motion compensation, we trained two different network models. One is a model with gated motion compensation and the other was without gated motion compensation, denoted by GHLRF-w and GHLRF-o, with and without, respectively.

The average PSNR (dB) and SSIM performance results on the dataset from the VSR test set Vid4, trained uniformly on Vimeo-90K-T containing 7824 sequence frames, are shown in Table 2. It can be seen that the reconstruction performance of the motion-compensated GHLRF-w with gating is better than that of the motion-compensated GHLRF-o without gating, with the average performance increasing from 27.28 (dB)/0.83460 to 27.47 (dB)/0.8346; it can thus be seen that, on alignment, the use of the gating mechanism adaptively selects useful information in high-low resolution neighboring frames to avoid the effect of estimation errors.

Ablation Experiment III

To further verify that the simultaneous use of high-low resolution information LHRA is valid in comparison to the other two methods, we compared them with aligned LRA using only low-resolution frames, aligned HRA using only high-resolution frames, and with high-resolution HR optical flow operations. We trained three different models as LHRA, LRA, and HRA. The different models were trained uniformly on Vimeo-90K-T, containing 7824 sequence frames, and their average PSNR (dB) and SSIM performance results on the dataset of the VSR test set Vid4 are shown in Table 3. The LHRA method achieves the best PSNR, and the SSIM is in between HRA and LRA. In addition, as can be seen in Figure 9, LHRA generates the best images that produce better visual effects. Overall, it can be seen that the LHRA method has better performance than the other two.

4.3. Comparisons with State-of-the-Arts

4.3.1. Comparison of Experimental Results on the Vid4 Test Set

The Vid4 test set evaluation is a widely used baseline test set consisting of four video sequences: calendar, city, foliage, and walk.

In comparison with existing methods including Bicubic, FRVSR [3], DUF [22], PFNL [24], RBPN [9], RLSP [14], TGA [29], RRN [21], BasicVSR [2], and GHLRF (our), FRVSR and TGA are suitable for explicit motion estimation and compensation. RBPN also calculates optical flow but uses it as an additional input rather than explicit motion compensation. DUF and PFNL use an advanced temporally integrated network to exploit motion information implicitly. RLSP and RRN propagate historical information in feature space. BasicVSR is the most relevant network at the feature level for motion alignment, but it uses a two-way recurrent network that can only perform the offline state of the video.

As shown in Table 4, the model outperforms all other methods in terms of PSNR and SSIM indices, including the local sliding window method, the two-way and one-way cyclical methods, and the combined method. Specifically, our method outperforms the state-of-the-art method BasicVSR by 0.64 dB in the PSNR evaluation index. BasicVSR is an offline super-resolution model, which is unfairly compared to our online model. However, it is clear from the quantitative evaluation results that not only do we not limit the application scenario but we obtain better recovery results. This is because the gated high-low resolution neighboring frame alignment allows for inter-frame alignment and the selection of more useful information, and adaptive gating allows for the optimization of motion estimation. Scale-attentive U-shaped fusion networks, capable of detailed reconstruction for deeper features, are used to enhance the feature extraction and information fusion capabilities of the network as well as to enhance high-frequency detailed information through scale-attentive.

Figure 10 and Figure 11 illustrate the qualitative results for the scenes from the Vid4 test set. As we can see from the enlarged area, our frame recovers more reliable and finer detail on both regular and irregular patterns. The video frames in ‘City’ show that we do better on the buildings than the other models. On the buildings, only our model can clearly show the structure of the edges of the buildings, approaching the zoomed-in effect of GT. In the video frame of “Calendar” we can recover a clearer result, and, in the red box marked “GT”, our method can recover a more realistic result.

4.3.2. Comparison of Experimental Results on the SPMCS-11 Test Set

We tested our method on the SPMCS-11 test set, which consists of 11 high-quality video clips with different actions and scenes. Compared to Vid4, SPMCS-11 contains more high-frequency information at higher resolution, which requires the algorithm to have strong recovery capabilities, in comparison with existing methods including Bicubic, RBPN [9], TGA [29], and RRN [21].

As shown in Table 5, among all VSR methods supporting the SPMCS-11 test set approach, our model achieved the best results in terms of both PSNR and SSIM. Compared to RRN and TGA, our model had better performance of 1.06 dB and 0.56 dB, respectively. As can be seen from Table 5. our method obtained the highest PSNR and SSIM values in the seven tested videos compared to the other SISR and VSR methods.

A qualitative comparison of the SPMCS-11 test set is shown in Figure 12 and Figure 13. It can be observed that the method in this paper is better at recovering more detail and is closer to the real image. On the “PRVTG”, the white border of the window is better displayed by our method; on the “Hitachi”, our method shows a better visual effect, and on the labelled frame, our method is closer to that of the “GT” diagram.

As we have seen, our model can recover better visual effects on both regular and irregular patterns, demonstrating that this new method of gated motion compensation is effective and that it can make better use of inter-frame information to recover better detail than other methods.

5. Conclusions

The GHLRF network designs a gated motion compensation method for high-low resolution adjacent frames, which can better improve the reconstruction effect in scenarios where there is a large amount of inter-frame motion variation, distant targets, and blurred targets. The aim is to solve the problem that the existing methods do not make full use of the high-low resolution adjacent frame information and have large motion estimation errors. A pre-initial hidden state network is designed to address the impact of unbalanced output effects that can occur in a one-way cyclical framework. In addition, a scale local hierarchical salient feature fusion network is proposed, which is used to fuse features from aligned video frames, allowing for better extraction of contextual information, locally salient features, and extracting high-frequency detail information from the edges and textures of video frames to improve the reconstruction quality of the video. The methods adopted in this paper are all based on local ideas for extraction and are not able to extract global targets better, which may result in poor reconstruction when there is large movement or occlusion. In the future, it is intended to adopt global-based ideas to deal with scenes such as large motion and occlusion.

Author Contributions

Conceptualization, N.O. and Z.O.; funding acquisition, N.O. and L.L.; methodology, Z.O.; writing—original draft, Z.O.; writing—review and editing, Z.O. and L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Natural Science Foundation of China under Grant No. 62001133. This research is supported by Guangxi Science Foundation and Talent Special Fund under Grant No. AD19110060. This research is supported by Guangxi Thousands of Young and Middle-aged University Backbone Teachers Training Program, Guangxi Natural Science Foundation under Grant No. 2017GXNSGBA198212. This research is supported by the Guangxi Key Laboratory of Wireless Broadband Communication and Signal Processing under Grant No. GXKL06200114.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Caballero, J.; Ledig, C.; Aitken, A.; Acosta, A.; Totz, J.; Wang, Z.; Shi, W. Real-time video super-resolution with spatio-temporal networks and motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Chan, K.C.; Wang, X.; Yu, K.; Dong, C.; Loy, C.C. Basicvsr: The search for essential components in video super-resolution and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Sajjadi, M.S.; Vemulapalli, R.; Brown, M. Frame-recurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W.T. Video enhancement with task-oriented flow. Int. J. Comput. Vis. 2019, 127, 1106–1125. [Google Scholar] [CrossRef] [Green Version]
Wang, L.; Guo, Y.; Lin, Z.; Deng, X.; An, W. Learning for video super-resolution through HR optical flow estimation. In Proceedings of the Computer Vision—ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Revised Selected Papers, Part I. Springer: Berlin/Heidelberg, Germany, 2019; Volume 14. [Google Scholar]
Wang, L.; Guo, Y.; Liu, L.; Lin, Z.; Deng, X.; An, W. Deep video super-resolution using HR optical flow estimation. IEEE Trans. Image Process. 2020, 29, 4323–4336. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Li, H.; Xu, J.; Hou, S. Optical flow enhancement and effect research in action recognition. In Proceedings of the 2021 IEEE 13th International Conference on Computer Research and Development (ICCRD), Beijing, China, 5–7 January 2021. [Google Scholar]
Tao, X.; Gao, H.; Liao, R.; Wang, J.; Jia, J. Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Haris, M.; Shakhnarovich, G.; Ukita, N. Recurrent back-projection network for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Chan, K.C.; Zhou, S.; Xu, X.; Loy, C.C. BasicVSR++: Improving video super-resolution with enhanced propagation and alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Liang, J.; Fan, Y.; Xiang, X.; Ranjan, R.; Ilg, E.; Green, S.; Cao, J.; Zhang, K.; Van Gool, L. Recurrent video restoration transformer with guided deformable attention. Adv. Neural Inf. Process. Syst. 2022, 35, 378–393. [Google Scholar]
Wang, P.; Sertel, E. Multi-frame super-resolution of remote sensing images using attention-based GAN models. Knowl. -Based Syst. 2023, 266, 110387. [Google Scholar] [CrossRef]
Chiche, B.N.; Woiselle, A.; Frontera-Pons, J.; Starck, J.-L. Stable long-term recurrent video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Fuoli, D.; Gu, S.; Timofte, R. Efficient video super-resolution through recurrent latent space propagation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–29 October 2019. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chasemi-Falavarjani, N.; Moallem, P.; Rahimi, A. Particle filter based multi-frame image super resolution. Signal Image Video Process. 2022, 6, 1–8. [Google Scholar]
Wang, X.; Chan, K.C.; Yu, K.; Dong, C.; Change Loy, C. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Ranjan, A.; Black, M.J. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Li, F.; Bai, H.; Zhao, Y. Learning a deep dual attention network for video super-resolution. IEEE Trans. Image Process. 2020, 29, 4474–4488. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Yi, P.; Jiang, K.; Jiang, J.; Han, Z.; Lu, T.; Ma, J.; Yi, H. Multi-memory convolutional neural network for video super-resolution. IEEE Trans. Image Process. 2018, 28, 2530–2544. [Google Scholar] [CrossRef] [PubMed]
Isobe, T.; Zhu, F.; Jia, X.; Wang, S. Revisiting temporal modeling for video super-resolution. arXiv 2020, arXiv:200805765. [Google Scholar]
Jo, Y.; Oh, S.W.; Kang, J.; Kim, S.J. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Khattab, M.M.; Zeki, A.M.; Alwan, A.A.; Badawy, A.S. Regularization-based multi-frame super-resolution: A systematic review. J. King Saud Univ. -Comput. Inf. Sci. 2020, 32, 755–762. [Google Scholar] [CrossRef]
Yi, P.; Wang, Z.; Jiang, K.; Jiang, J.; Ma, J. Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–29 October 2019. [Google Scholar]
Yi, P.; Wang, Z.; Jiang, K.; Jiang, J.; Lu, T.; Tian, X.; Ma, J. Omniscient video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada,, 10–17 October 2021. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 10–14 September 2018. [Google Scholar]
Liu, C.; Sun, D. On Bayesian adaptive video super resolution. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 346–360. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Isobe, T.; Li, S.; Jia, X.; Yuan, S.; Slabaugh, G.; Xu, C.; Ma, Y. Video super-resolution with temporal group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]

Figure 1. A motion compensation approach based on deep learning video super-resolution reconstruction. (a) The first motion compensation approach uses low-resolution frame information for motion compensation of low-resolution optical streams. (b) The second motion compensation approach uses high-resolution frame information for motion compensation of high-resolution optical streams. (c) Our approach combines high and low-resolution frame information for a new type of motion compensation.

Figure 2. The network overall architecture of our GHLRF. (a) The gated adjacent high-low resolution frame alignment network (GAFANet) consists of the residual layer RB, the optical flow estimation network Spynet, the gated high-low resolution motion compensation GHLMC, and the IHSNet is a pre-initial hidden state network. (b) Scale local level significant feature fusion network (SLLFNet) consists of residual layer RB and scale local level significant feature fusion network, (c) reconfiguration network (RNet) consists of residual unit RB, 2× magnified sub-pixel convolution up-sampling, and convolution Conv3. Blue RB indicates a 5-layer residual block composition; green RB indicates a 1-layer residual block composition; Conv3 indicates a convolutional layer composed of a 3 × 3 convolutional kernel as well as a Relu activation function.

Figure 3. Schematic diagram of the initial hidden state network (IHSNet). CONV3 denotes a convolutional layer consisting of a 3 × 3 convolutional kernel and a Relu activation function. RCAB denotes a residual channel attention block. RB denotes a residual cell consisting of 5 layers. CONV is the convolution of a 3 × 3 convolutional kernel. Concat denotes channel number stitching fusion. H0 denotes initial hidden state information.

Figure 4. Schematic diagram of the gated adjacent high-low resolution frame motion compensation network (GHLMC).

Figure 5. Schematic diagram of a scale local level significant feature fusion network (SLLFNet). RB denotes a residual cell composed of 5 layers, RMB denotes residual multilevel block, and HHMAB denotes hierarchical hybrid multi-attention block.

Figure 6. Schematic diagram of residual multilevel blocks (RMB). Conv3 denotes a convolutional layer consisting of 3 × 3 convolutional kernels and the Relu activation function, Conv1 denotes the convolution of 1 × 1 convolutional kernels,

C

denotes the number of channels spliced, and

\oplus

denotes the corresponding sum of elements.

Figure 6. Schematic diagram of residual multilevel blocks (RMB). Conv3 denotes a convolutional layer consisting of 3 × 3 convolutional kernels and the Relu activation function, Conv1 denotes the convolution of 1 × 1 convolutional kernels,

C

denotes the number of channels spliced, and

\oplus

denotes the corresponding sum of elements.

Figure 7. Schematic diagram of the hierarchical hybrid multi-attention blocks (HHMAB). Conv3 denotes a convolutional layer consisting of 3 × 3 convolutional kernels and the Relu activation function, Conv1 denotes the convolution of 1 × 1 convolutional kernels. CA stands for channel attention module and SA stands for spatial attention module.

\oplus

denotes the summation of elemental correspondences.

Figure 7. Schematic diagram of the hierarchical hybrid multi-attention blocks (HHMAB). Conv3 denotes a convolutional layer consisting of 3 × 3 convolutional kernels and the Relu activation function, Conv1 denotes the convolution of 1 × 1 convolutional kernels. CA stands for channel attention module and SA stands for spatial attention module.

\oplus

denotes the summation of elemental correspondences.

Figure 8. (a) Plot of network training loss; (b) plot of test performance evaluation.

Figure 9. Comparison of the reconstruction results of ×4 on City for different methods of video super-resolution.

Figure 10. Comparison of the reconstruction results of ×4 on City for different video super-resolution methods.

Figure 11. Comparison of the reconstruction results of ×4 on Calendar for different video super-resolution methods.

Figure 12. Comparison of the reconstruction results of ×4 on PRVTG_012 for different methods of video super-resolution.

Figure 13. Comparison of the reconstruction results of ×4 on HIitachi_isee5_001 for different methods of video super-resolution.

Table 1. Results of ablation experiment I at 4× on Vid4.

IHSNet	GAFANet	SLLFNet	PSNR/SSIM
	√	√	27.42/0.8381
√		√	27.25/0.8336
√	√		26.25/0.7886
√	√	√	27.47/0.8372

Table 2. Results of ablation experiment II at 4× on Vid4.

GHLRF-w/o	PSNR/SSIM
GHLRF-o	27.28/0.8346
GHLRF-w	27.47/0.8372

Table 3. Results of ablation experiment III on Vid4 at 4×.

LH/L/HRA	PSNR/SSIM
HRA	27.40/0.8375
LRA	27.36/0.8368
LHRA	27.47/0.8372

Table 4. Test results on the Vid4 dataset using different video super-resolution methods.

Method	Params (M)	Runtimes (ms)	Calendar	City	Foliage	Walk	Average
Bicubic	-	-	18.80/0.4794	23.78/0.5129	21.39/0.4227	22.88/0.6976	21.71/0.5280
FRVSR	5.1	137	23.46/0.7854	27.70/0.8099	25.96/0.7560	29.69/0.8990	26.70/0.8126
DUF	5.8	974	24.04/0.8110	28.27/0.8313	26.41/0.7709	30.30/0.9141	27.33/0.8318
PFNL	3.0	295	-	-	-	-	27.16/0.8355
RBPN	12.2	1507	23.95/0.8070	27.70/0.8036	26.22/0.7575	30.69/0.9104	27.14/0.8196
RLSP	4.2	49	24.60/0.8355	28.14/0.8453	26.75/0.7983	30.88/0.9192	27.60/0.8476
TGA	5.8	441	24.50/0.8290	28.50/0.8420	26.59/0.7793	30.95/0.9171	27.63/0.8419
RRN	3.4	45	24.57/0.8342	28.51/0.8467	26.94/0.7979	30.74/0.9164	27.69/0.8488
BasicVSR	6.3	63	-	-	-	-	27.96/0.8553
GHLRF	20.4	110	25.02/0.8482	30.09/0.8822	26.95/0.7961	31.34/0.9253	28.60/0.8630

Table 5. Test results on the SPMCS-11 dataset using different video super-resolution methods.

Clip Name	Bicubic	RBPN	RRN	TGA	GHLRF
car05_001	24.94/0.6799	31.92/0.9016	31.79/0.9048	31.84/0.8987	32.70/0.9192
hdclub_003_001	17.54/0.9186	21.89/0.7246	22.42/0.7633	22.31/0.7537	22.74/0.7764
hitachi_isee5_001	17.44/0.4283	26.25/0.9042	26.45/0.9058	26.47/0.9059	27.88/0.9327
hk004_001	25.93/0.7221	33.34/0.9010	33.48/0.9138	33.76/0.9136	34.23/0.9213
HKVTG_004	25.29/0.8765	29.50/0.7975	29.70/0.8090	29.68/0.8088	29.94/0.8187
jvc_009_001	23.15/0.6713	29.99/0.9093	29.52/0.9038	30.19/0.9140	31.27/0.9326
NYVTG_006	25.69/0.7291	33.18/0.9227	33.02/0.9222	33.75/0.9334	34.00/0.9377
PYVTG_012	23.74/0.6249	27.56/0.8231	27.98/0.8403	27.80/0.8350	27.94/0.8434
RMVTG_011	20.98/0.5137	27.59/0.8157	28.20/0.8351	28.35/0.8409	28.45/0.8461
veni3_011	25.73/0.8312	36.58/0.9735	34.17/0.9695	36.40/0.9740	37.15/0.9774
Veni5_015	24.59/0.7823	32.92/0.9443	31.50/0.9411	33.24/0.9486	33.54/0.9517
Average	23.18/0.6253	30.06/0.8743	29.84/0.8827	30.34/0.8842	30.90/0.9861

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ouyang, N.; Ou, Z.; Lin, L. Video Super-Resolution Network with Gated High-Low Resolution Frames. Appl. Sci. 2023, 13, 8299. https://doi.org/10.3390/app13148299

AMA Style

Ouyang N, Ou Z, Lin L. Video Super-Resolution Network with Gated High-Low Resolution Frames. Applied Sciences. 2023; 13(14):8299. https://doi.org/10.3390/app13148299

Chicago/Turabian Style

Ouyang, Ning, Zhishan Ou, and Leping Lin. 2023. "Video Super-Resolution Network with Gated High-Low Resolution Frames" Applied Sciences 13, no. 14: 8299. https://doi.org/10.3390/app13148299

APA Style

Ouyang, N., Ou, Z., & Lin, L. (2023). Video Super-Resolution Network with Gated High-Low Resolution Frames. Applied Sciences, 13(14), 8299. https://doi.org/10.3390/app13148299

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Video Super-Resolution Network with Gated High-Low Resolution Frames

Abstract

1. Introduction

2. Related Works

3. The Gated High-Low Resolution Frames Network

3.1. Overall Structure

3.2. Pre-Initial Hidden State Network

3.3. Gate-Guided Adjacent High-Low Resolution Frame Alignment Network

3.4. Scale Local Level Significant Feature Fusion Network

3.5. Reconstruction Network

4. Experimental Results

4.1. Implementation Details

4.2. Ablation Study

4.3. Comparisons with State-of-the-Arts

4.3.1. Comparison of Experimental Results on the Vid4 Test Set

4.3.2. Comparison of Experimental Results on the SPMCS-11 Test Set

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI