Real-Time Video Super-Resolution with Spatio-Temporal Modeling and Redundancy-Aware Inference

Video super-resolution aims to generate high-resolution frames from low-resolution counterparts. It can be regarded as a specialized application of image super-resolution, serving various purposes, such as video display and surveillance. This paper proposes a novel method for real-time video super-resolution. It effectively exploits spatial information by utilizing the capabilities of an image super-resolution model and leverages the temporal information inherent in videos. Specifically, the method incorporates a pre-trained image super-resolution network as its foundational framework, allowing it to leverage existing expertise for super-resolution. A fast temporal information aggregation module is presented to further aggregate temporal cues across frames. By using deformable convolution to align features of neighboring frames, this module takes advantage of inter-frame dependency. In addition, it employs a hierarchical fast spatial offset feature extraction and a channel attention-based temporal fusion. A redundancy-aware inference algorithm is developed to reduce computational redundancy by reusing intermediate features, achieving real-time inferring speed. Extensive experiments on several benchmarks demonstrate that the proposed method can reconstruct satisfactory results with strong quantitative performance and visual qualities. The real-time inferring ability makes it suitable for real-world deployment.


Introduction
Video is a widely used multimedia format combining image frames with audio.However, the quality of video often is limited by factors such as capture, storage, and transmission [1].Video super-resolution (SR) techniques aim to reconstruct high-resolution (HR) frames from low-resolution (LR) counterparts.Similarly, image SR models focus on enhancing the resolution of LR images.Video SR can be seen as an extension of singleimage SR, which leverages spatial information along with temporal information from LR frames.It has diverse applications in video displaying [2], video surveillance [3], and satellite imagery [4].
Recently, deep learning-based methods have shown promising performance in video SR tasks [2] and image SR tasks [5].These video SR models can be categorized into two groups: (1) models without image SR techniques and (2) models incorporating image SR techniques.The first category has to explore alternative approaches for spatial information, such as estimating upsampling filters [6] or task-specific optical flow [7].Although these methods achieve good performance, they have limited spatial information modeling capacity.In contrast, the second category benefits from image SR insights for spatial reconstruction [8][9][10].However, they only incorporate specific components from image SR models, which creates a barrier to fully harnessing the potential of well-trained parameters.Thereby, there is room for performance improvement.Different from existing video SR models [11,12] that only incorporate specific components from the image SR model, the proposed method employs a full image SR model for better spatial feature extraction and SR reconstruction.Different from Kappeler et al. [1] and Bao et al. [13], the proposed method pre-trains the image SR model only.
Further, numerous video SR models [8,10,[14][15][16] focus on performance improvement.Only a few models [11,17,18] take time consumption into account.Fewer models [19,20] are capable of real-time inference.However, real-time inference is important for online applications, such as displaying.Different from previous work [18] that purges unimportant filters, the proposed redundancy-aware inference algorithm reduces time consumption while maintaining all filters in a video SR model.
In this work, a novel video SR method is proposed to address these limitations.To exploit spatial information, the proposed method incorporates the architecture and welltrained weights of an image SR model as the foundational framework.A fast temporal information aggregation module is introduced to effectively leverage inter-frame dependency.Since moving objects exist in different positions, deformable convolution [21] can effectively extract adjacent frame information.Considering the difference of neighboring frames, the channel attention mechanism [22] can adaptively rescale important features, resulting in effective temporal aggregation.The proposed method achieves real-time inference while providing high-quality SR results.Furthermore, a redundancy-aware inference algorithm is developed to reduce repetitive feature extractions.The experiments on popular benchmarks show that the proposed method delivers solid quantitative performance and visual quality.On the one hand, the use of the pre-trained image SR model reduces the difficulty of training a video super-resolution model.On the other hand, it allows the other module to focus on temporal information aggregation.The redundancy-aware inference algorithm significantly reduces the inference latency, making it suitable for applications that need live video SR reconstruction.
The main contributions of this paper are as follows: (1) A novel video SR model is proposed to fully incorporate a pre-trained image SR model and achieve a trade-off between accuracy and real-time efficiency.(1) A novel video SR model is proposed that can be inferred in real-time while providing high-quality SR video frames.(2) A fast temporal information aggregation module is introduced where deformable convolution is adopted to extract the information of a moving object.The channel attention is also employed for adaptively capturing important information.(3) A redundancy-aware inference is developed for video SR.By avoiding repetitive feature extraction, the computational cost is significantly reduced.
The remainder of this paper is organized as follows: Section 2 discusses related works.Section 3 provides a detailed description of the network architecture and the redundancyaware inference.Section 4 presents datasets, implementation details, experimental results, and analysis.Finally, Section 5 concludes this paper.

Image Super-Resolution
The image SR problem is a typical ill-posed problem.In 2014, Dong et al. [23] were the first to introduce deep learning into this field.Since then, image SR methods have experienced noteworthy advancements [5].In 2017, Lim et al. [24] proposed the representative EDSR, which made use of residual learning, eliminated unnecessary batch normalization, and expanded the number of parameters while ensuring stable training.To adaptively rescale features, Zhang et al. [22] developed the channel attention mechanism, which has been successfully employed in RCAN.In 2019, Hui et al. [25] presented IMDN, a lightweight model with a small memory footprint that yielded competitive accuracy and enabled quick inference.More recently, the Transformer, originally introduced in natural language processing [26], has been introduced into computer vision [27].Consequently, the enhanced Swin Transformer [28] has been adopted in SwinIR [29].By combining convolutional layers and Swin Transformer modules, the proposed approach captures both local and global dependencies simultaneously, resulting in state-of-the-art performance.
In this study, the IMDN [25] is employed as the foundational framework for the following reasons.A real-time video system must deliver a minimum of 24 frames per second, which is important for ensuring a seamless user experience.IMDN [25] has proven its capability in effectively leveraging spatial information for SR reconstruction with a lightweight design.

Video Super-Resolution
Recently, there has been a growing interest in the video (SR) problem, leading to the proposal of numerous deep learning-based models [2].Given the need to leverage both spatial and temporal information, effectively handling the input low-resolution (LR) frames becomes crucial.We categorize existing methods into the following groups.
The first category includes methods that utilize optical flow.These methods make use of optical flow to align neighboring frames or features.For instance, VESPCN [11] aligns neighboring frames in a coarse-to-fine manner, while TOF [7] learns a task-specific optical flow.Additionally, DRVSR [14] introduces a carefully designed SPMC layer to register pixels in high-resolution, and Wang et al. [30] directly estimated HR optical flow from LR frames.BasicVSR [12] propagates neighbor features via the optical flow.Although these methods have demonstrated promising results, they suffer from high computational complexity.Moreover, inaccurate optical flow estimation can negatively impact the quality of SR results.
The second category contains methods based on 3D convolutions.Three-dimensional convolution is capable of extracting spatial and temporal information simultaneously from multiple input frames.For example, Kim et al. [31] applied 3D convolutions to capture spatio-temporal dependencies in an end-to-end manner, while DUF [6] incorporates 3D convolutions in densely connected blocks.Isobe et al. [32] fused information from neighboring frames using 3D convolutions, and Li et al. [17] proposed fast spatio-temporal residual blocks for reduced latency.The introduction of 3D convolutions alleviates the reliance on inaccurate optical flows and enables end-to-end training.However, the choice of the kernel size in 3D convolutions requires a trade-off between performance under large motion and computational cost.
The third category consists of methods employing deformable convolutions, which have gained popularity recently.Deformable convolutions were proposed in [21].The learnable offset enables video SR models to capture objects with motion.For instance, Tian et al. [33] employed deformable convolutions to align neighboring frames, while D3Dnet [34] extends deformable convolutions from 2D to 3D for motion adaptivity and spatio-temporal information modeling.EDVR [8] introduces the Pyramid, Cascading, and Deformable convolutions module for neighboring feature alignment.Unlike optical flow-based methods, deformable convolution-based algorithms do not require optical flow estimation, thereby reducing computational costs and enabling end-to-end training.
In addition, there are attention-based approaches.These methods extract spatiotemporal information via various attention mechanisms.For example, Yi et al. [15] and Li et al. [16] adopted non-local attention.Xiao et al. [35] exploited the temporal difference attention.Wang et al. [36] and Xiao et al. [37] made use of deformable attention.Further, some studies [10,38] have employed self-attention mechanisms for video restoration.The attention mechanism can weigh different features according to the input.This allows a model to pay more attention to the key information, thereby improving its accuracy.
For better performance on video SR reconstruction, the proposed method incorporates both deformable convolution and channel attention.The proposed fast temporal information aggregation is achieved through two stages: spatial aggregation and subsequent temporal aggregation.In the spatial aggregation stage, the deformable convolution is employed to align neighboring features.In order to effectively aggregate information from neighboring video frames, channel attention is used.Further, both stages significantly contribute to reconstruction performance.

Overall Architecture
The overall architecture of the proposed method is shown in Figure 1.It takes 2n + 1 LR frames as input, centered around the target frame to be reconstructed at t = 0.The 2n represents the number of neighboring frames.The relative frame index is noted as t.The model consists of three key components, i.e., the spatial feature extraction module, the fast temporal information aggregation module, and the upsampler module.The spatial feature extraction module is based on a pre-trained image SR model called IMDN [25].The fast temporal information aggregation module aligns and fuses neighboring frame features to exploit inter-frame dependencies.Finally, the upsampler module upscales the fused spatio-temporal representation to generate the SR output frame.Figure 2a illustrates the spatial feature extraction module, comprising three convolutional layers with varying kernel sizes and six information multi-distillation blocks (IMDB) from IMDN [25].Conv-3 and Conv-1 refer to the convolutional layers with kernel sizes of three and one, respectively.Additionally, it incorporates global residual learning and hierarchical feature exploitation.It is the foundational framework of the proposed method and is responsible for capturing effective spatial details from input LR frames.
As shown in Figure 3, there are two parts in the IMDB.There are four convolutional layers in the first part.The first three of them are followed by a leaky ReLU and channel split layer.In the channel split layer, the feature is divided into two features.The two features hold 1/4 and 3/4 channels of input feature, respectively.The feature with 1/4 channels is fed to the concatenation.The feature with 3/4 channels is processed by the following convolutional layers.After the concatenation, there is contrast-aware channel attention, which is the second part.It is a more advanced channel attention module that takes not only the average value but also the standard deviation of each feature channel into consideration.The fast temporal information aggregation module is a key component that allows the model to leverage the inter-frame dependencies.It consists of two stages, i.e., spatial aggregation and temporal aggregation.The spatial aggregation stage gathers information about the same object and aligns it to the center frame.The subsequent temporal aggregation stage fuses information temporally.The details of this module are described in Section 3.2.
Figure 2b shows the upsampler module; it is the final component that converts the fused spatio-temporal features into SR output frames.It contains a convolutional layer and a sub-pixel layer.The convolutional layer adjusts the number of channels.Then, the sub-pixel layer upscales these features to target spatial resolution by rearranging elements from the channel dimension into the spatial dimension.
The specific design allows the spatial feature extraction module to extract information in a manner consistent with an image SR model.Consequently, the parameters of the spatial feature extraction module and the upsampler module can be initialized with well-trained parameters from an image SR model.Leveraging the spatial information extraction abilities learned by the image SR model, the utilization of these well-trained parameters enables the proposed model to make more effective use of spatial information from LR frames.Further, the spatial feature extraction module and upsampler module can be easily replaced by any other image SR models.
Given 2n + 1 LR frames I LR t , the corresponding target HR frame at t = 0 is denoted as I HR .The super-resolved frame at t = 0, I SR , can be produced by where Net(•) represents the proposed model.As illustrated in Figure 1, there are three modules in the proposed model, i.e., the spatial feature extraction module, the fast temporal information aggregation module, and the upsampler module.The proposed model can be further given by: where FE spatial (•), FE aggregation (•), and U(•) denote the spatial feature extraction module, fast temporal information aggregation module, and upsampler module, respectively.To optimize memory usage, the parameters of FE spatial (•) are shared across inputs with different timestamps.The spatial feature and temporal aggregated feature are represented as F S t and F T , respectively.Following previous work [33], the mean square error (MSE) is applied as the loss function for parameter optimization.For a sample from the training set, the loss function of the proposed model is defined as: where Θ denotes the learnable parameters of the proposed model.Further, the L2 norm is • 2 .The index of the sample in a mini batch is represented by i.

Fast Temporal Information Aggregation Module
Figure 4 illustrates the architecture of the proposed fast temporal information aggregation module.The fast temporal information aggregation module aligns and fuses spatial features from the 2n + 1 input frames to generate an enriched spatio-temporal feature.It has two stages, i.e., the spatial aggregation stage and the temporal aggregation stage.Thus, the fast temporal information aggregation module can be formulated as: where Aggregate spatial (•) and Aggregate temporal (•) denote the spatial and temporal aggregation stage, respectively.The intermediate spatial aggregated feature is denoted as F A t , and F T represents the output of this module.
The spatial aggregation stage includes the fast spatial offset feature extraction (FSOFE), the spatial feature alignment, and the spatial feature refinement.The FSOFE is conducted on the spatial feature F S t to obtain the spatial offset feature F SO t .Then, for the spatial feature alignment, the offset feature F O t is estimated using a 3 × 3 convolutional layer.Following this, a deformable convolution is employed for alignment.Unlike the conventional deformable convolution, this variant incorporates additional features for offset estimation, utilizing F S t for feature extraction and F O t for offset information.Finally, another deformable convolution is applied to refine the results in the aligned feature F A t .Note that the spatial feature alignment and refinement are skipped for the center spatial feature.The spatial aggregation stage can be expressed as: where FSOFE(•), Concat(•), Conv 3×3 , AlignDConv(•), and DConv(•) represent FSOFE, concatenation, convolution with a kernel size of 3, deformable convolution for alignment, and deformable convolution, respectively.The parameters of FSOFE(•), Conv 3×3 , AlignDConv(•), and DConv(•) are shared to optimize memory consumption.t = 0 and t = 0 denote the timestamps of the input center frame and its neighboring frames, respectively.The FSOFE is responsible for extracting spatial offset features to guide the alignment by deformable convolution.As shown in Figure 5, it adopts a compact two-level hierarchical structure to extract offset efficiently.In the first level, the spatial feature is extracted by a residual block in [24].In the second level, a 3 × 3 convolution with stride 2 is applied to reduce the spatial dimensions.The features from these two levels are fused by an elementwise addition and two residual blocks.The output features contain useful offset cues extracted from the spatial features and provide guidance for deformable convolution to adaptively aggregate and align the spatial features from neighboring frames.The two-level design allows the FSOFE to extract offset features with a large receptive field in an efficient manner.The temporal aggregation stage combines the 2n + 1 spatially aligned features F A t to generate a spatio-temporal feature.In order to effectively aggregate useful information, channel attention layer and RCAB [22] are employed.The channel attention adaptively rescales channels within a residual structure.Further, RCAB [22] extracts representative features for reconstruction.Further, a convolutional layer is placed between the channel attention layer and RCAB [22] to reduce the number of channels, resulting in lower inference latency.The optimal architecture of the temporal aggregation stage is provided in Table 1.Motion among these frames provides valuable cues for reconstructing the center frame.The fast temporal information aggregation module generates a spatio-temporal feature that contains information from all input LR frames.Then, the spatio-temporal feature is upscaled to produce the SR result.

Redundancy-Aware Inference
In order to minimize the computational redundancy that arises during model inference, the redundancy-aware inference (RAI) algorithm is introduced.Considering the fact that once trained, the model parameters remain fixed.As stated in Equation ( 2), the spatial feature extraction module has to be performed on all neighboring LR frames.However, when inferring consecutive frames in a video, these repeated computations are redundant.This redundancy presents an opportunity to enhance the efficiency and reduce inference latency.
In the standard inference process, which is consistent with the training phase, the spatial feature extraction module is executed 2n + 1 times to process each input frame separately.Thus, the latency for inferring a single frame can be expressed as follows: where L SFE , L FTI A , and L U are the inference latency of the spatial feature extraction module, fast temporal information aggregation module, and upsampler module, respectively.However, this is redundant as the operations and parameters are identical each time.Hence, some intermediate features, such as F S t , remain consistent when generating adjacent SR frames.The RAI reduces this redundancy by caching and reusing these intermediate features.For subsequent frames, the cached features from previous timestamps are reused instead of recomputing them.Only the features from the new input frames need to be extracted.As a result, the latency for inferring a frame after the first n and before the last n frames can be improved to: leading to a reduction in latency by 2n × L SFE .Similarly, the output of FSOFE, as indicated in Equation ( 8), can be stored for further processing.Algorithm 1 provides the details of the RAI.
It is important to note that, for simplicity, the processing of the first n and last n frames is omitted.Due to the inconsistency in the processing at both ends, there is a performance degradation in these frames.However, in the proposed RAI, the spatial feature extraction module and the FSOFE are executed once instead of 2n + 1 times.It allows for achieving real-time performance during inference without modifying the proposed model.In the experiments, Vimeo90K [7] is utilized for training.This dataset contains 64,612 video sequences for training.Each sequence is composed of seven frames.The Vimeo90K dataset has been widely acknowledged and used in various video-related tasks, such as video SR and video interpolation.To evaluate the performance of the proposed model, two well-known benchmarks are employed: Vid4 [33] and SPMCs-30 [14].The Vid4 benchmark consists of 4 videos with a total of 171 frames.It maintains a minimal resolution of 720 × 480.In addition to Vid4, the proposed method is evaluated on the SPMCs-30 benchmark, which consists of 30 videos and each video includes 31 frames.The resolution of video frames within the SPMCs-30 is 960 × 540.

Implement Details
To generate LR frames, bicubic degradation is employed via the Matlab function imresize.The downsampling scale factor was set to four.During the training phase, the patch size of the ground truth (GT) and the mini-batch size were empirically set to 256 and 16, respectively.To capture temporal information, the number of neighboring frames is empirically set to two, resulting in the model taking five LR frames as input.Additionally, data augmentation techniques, such as random flipping and rotation, were applied to the training data.The Adam optimizer [39] is utilized to optimize the proposed method, with parameters β 1 = 0.9 and β 2 = 0.99.The learning rate was initialized to 1 × 10 −4 and gradually decayed to 1 × 10 −7 .The training process lasted for 300,000 iterations.The channel number of the proposed model is empirically set to 64, except for the cases shown in Table 1.All experiments were conducted on a server with Python 3.8, PyTorch 1.12, Intel CPU, and Nvidia 2080Ti GPU.
For initializing the weight of the proposed method, the spatial feature extraction module and upsampler module load the weight of the pre-trained foundational framework, IMDN.The other parameters are initialized by PyTorch.No parameters are frozen when training the proposed method.The training of IMDN is consistent with [25].The training set for IMDN is DIV2K [40].The bicubic degradation is adopted to generate LR images.The channel number of IMDN is set to 64.Finally, the batch size for training IMDN is 16.
The performance of the reconstructed frames is assessed by two widely adopted metrics: peak signal-to-noise ratio (PSNR) and structure similarity index (SSIM) [41].The PSNR of one SR frame is defined as: and the mean squared error (MSE) is defined as: where P represents the total number of pixels in a frame.I SR and I HR denote the SR frame result and HR frame reference, respectively.Further, SSIM is defined as: where u I SR and u I HR are the mean values of the SR and HR frames, respectively.σ I SR and σ I HR are the standard deviations of the SR and HR frames, respectively.k1 and k2 are used to stabilize the calculation and set to 0.01 and 0.03, respectively.The covariance of the SR and HR frames is denoted as σ I SR I HR .Following previous studies [7,19,20,33], these metrics are calculated on the luminance channel (Y channel of YCbCr color space), while cropping the eight pixels near the boundary.Note that all frames were considered for performance evaluation.

Comparisons
For examining the performance of our model, comparisons with one image SR method (IMDN [25]) and six video SR methods (SWRN [19], 3DSRnet [31], TOF [7], EGVSR [20], SOFVSR [30], and RISTN [42]) are conducted.IMDN [25] is a lightweight image SR model and is employed as the foundational framework of the proposed method.SWRN [19] is a novel lightweight video SR method.3DSRnet [31] is a video SR method that exploits spatial-temporal information via 3D convolution.TOF [7] focuses on estimating taskspecific optical flow in videos.EGVSR [20] is a generative adversarial network-based model, and SOFVSR [30] predicts the HR optical flow to enhance video SR results.RISTN [42] leverages temporal features in a recurrent scheme.
First, the proposed method is evaluated on the Vid4 benchmark.The quantitative results are presented in Table 2 and Figure 6a.In each cell, the first row is the value of PSNR, and the second row is the value of SSIM.The quantitative results on the Vid4 benchmark demonstrate that our method outperforms others in terms of overall performance.Compared with foundational IMDN [25], the proposed method outperforms the PSNR and SSIM metrics by 1.06 and 0.057, respectively.The proposed method is better than the lightweight VSR methods, SWRN [19], and leads by 1.34 dB in PSNR metrics.In addition, the proposed method is superior to TOF [7] and SOFVSR [30], which are VSR methods based on optical flow.Further, the performance of recurrent-based RISTN [42] is lower than the proposed approach.When compared with GAN-based EGVSR [20], the proposed method underperforms EGVSR on Calendar and City videos but outperforms EGVSR on Foliage and Walk videos.On average, the PSNR value of the proposed method is 0.44 dB higher than EGVSR [20], but the SSIM value is 0.005 lower.Thus, the proposed method demonstrates overall better performance due to its utilization of image SR models, which are excellent at exploiting spatial information.Further, the proposed fast temporal information aggregation module effectively leverages information from neighboring frames.
Importantly, the inclusion of the proposed RAI did not negatively impact performance, with only a little degradation of 0.0093 dB and 0.0007 in terms of PSNR and SSIM, respectively.
For a qualitative comparison, the proposed method is compared with IMDN [25], SWRN [19], TOF [7], and SOFVSR [30].As shown in Figure 7, frames from each video are presented, arranged from the top row to bottom as follows: Calendar, City, Foliage, and Walk.In addition, the first column is in the whole frame, the second column labeled GT is a reference to the compared patch, and the third through seventh columns are the results of different methods.The results of each method are marked with the PSNR.Notably, the proposed model delivers superior performance in terms of enhancing text clarity in the Calendar and improving the car's boundaries in the Foliage.This can be attributed to our model's utilization of an image SR model as its foundational framework, which gains the capacity to effectively extract and utilize spatial information.Additionally, the proposed method has good performance at reconstructing clear textures of buildings in the City.In Walk, the rope on the clothes is significantly more recognizable.In both of these scenarios, the aggregation of temporal information plays an important role in achieving these improved results.In addition to the Vid4 benchmark, comparisons on the SPMCs-30 [14] benchmark are conducted.The quantitative results are presented in Table 3 and Figure 6b.On the SPMCs-30 benchmark, the proposed method surpasses all others in terms of average PSNR and SSIM metrics.Specifically, our method exhibits a remarkable improvement of 1.5 dB and 4.3% over SWRN [19] in terms of average PSNR and SSIM, respectively.Compared with optical flow-based methods, TOF [7] and SOFVSR [30], the proposed method outperforms by a margin of 0.8dB in terms of PSNR.Further, the recurrent-based RISTN [42] underperforms compared to the proposed method by 0.58 dB and 0.012 in terms of PSNR and SSIM.Thus, the proposed method makes better use of neighboring information than the recurrent scheme in RISTN [42].

IMDN
The qualitative comparison is shown in Figure 8, where frames from six videos have been selected for analysis.Arranged from the top row to bottom, the videos are named as follows: AMVTG_004, hdclub_001, hdclub_003, hitachi_isee5, jvc_004, and LDVTG_009.In the case of AMVTG_004, it is evident that all compared models struggle to accurately reproduce the texture of the wall.The GT column is the high-resolution reference.Moreover, some methods result in the presence of undesired artifacts.Similarly, in hdclub_001, only the proposed method and SWRN demonstrate success in recovering the correct structure by effectively leveraging temporal information from neighboring frames.Regrettably, all compared methods exhibit poor performance in hdclub_003.However, the proposed method works well in reconstructing a clear and well-defined structure for both the building and flower in hdclub_003 and hitachi_isee5.The results obtained from jcv_004 show the ability of the proposed method to recover more details.Lastly, the SR frames of LDVTG_009 illustrate how the proposed method effectively utilized the ability of the image SR model, leading to improved results.These qualitative comparisons serve as compelling evidence of the superior performance and effectiveness of the proposed method.The temporal consistency of the proposed model is evaluated following the methodology in a prior study [33].The temporal profiles of different methods are shown in Figure 9, with each temporal profile generated at the specified location marked in red, as illustrated in the first column.The reference temporal profile of high-resolution video frames is shown in the GT column.As one can see, the proposed model exhibits superior performance in terms of generating smooth and clearly defined temporal profiles, particularly in Calendar and City.While artifacts are present in the temporal profile of Walk for all methods, the proposed approach demonstrates the fewest instances of such artifacts, indicating its ability to effectively preserve temporal consistency.These findings serve as robust evidence of the enhanced temporal performance of our method.

Efficiency
The efficiency is analyzed from four aspects: number of parameters, number of computational operations, inference latency, and quality of SR results.The float point operations (FLOPs) and latency of each model are evaluated by producing 100 SR frames with a resolution of 1280 × 720.Further, all models are inferred with a Nvidia 2080ti GPU.The efficiency of the proposed method and compared models are presented in Table 4 and Figure 10.As shown in Table 4, there are four models that are capable of real-time inference.The number of parameters of IMDN [25] and SWRN [19] are relatively small.Further, the small computational complexity of IMDN [25] and SWRN [19] enables real-time inference.However, their PSNR performance is slightly lower than other methods.TOF [7] and SOFVSR [30] need more time for optical flow estimation, so they cannot achieve real-time inference.EGVSR [20] has more parameters than the proposed method.The proposed method performs well in terms of parameter count and PSNR, but it cannot achieve realtime inference due to redundancy.With the integration of the RAI, both latency and FLOPs drop significantly, leading the proposed method to produce real-time 720P SR frames while still achieving competitive performance.These results indicate that the RAI demonstrates an efficient and simple yet effective strategy to optimize the inference process by avoiding unnecessary computations.It achieves a balance between effectiveness and efficiency.Further, the modular design allows it to be integrated into other video models that require spatio-temporal feature extraction.

Ablation Analysis
In this section, ablation studies are presented to examine the impact of key components.IMDN establishes a baseline for comparison, which takes one LR frame as input.Subsequently, the spatial aggregation and temporal aggregation are evaluated.They are key stages in the fast temporal information aggregation module.For measuring the performance of the model with spatial aggregation only, the spatially aggregated features are fused using concatenation and a 3 × 3 convolutional layer.Table 5 provides the ablation studies of the proposed model, with the second and third columns specifically highlighting the variation.
On the Vid4 benchmark, the baseline model without temporal information achieves a PSNR result of 25.3254 dB and an SSIM result of 72.49%.By incorporating spatial aggregation, there is a noticeable improvement of 0.6499 dB and 3.96% in terms of PSNR and SSIM.Notably, the temporal aggregation in this variation is a simple 3 × 3 convolution.When the proposed temporal aggregation approach is employed, there is a further increase in performance, with an additional enhancement of 0.315 dB and 1.63% in terms of PSNR and SSIM, respectively.These results validate the significant contributions of both spatial and temporal aggregation components within our method.Furthermore, an additional analysis is conducted to evaluate the impact of well-trained parameters from the image SR model on the video SR task.As shown in Table 5, the fourth column indicates whether the model was initialized with well-trained image SR parameters.The results demonstrate the significance of utilizing well-trained parameters in the video SR task.Model 4 exhibits superior performance compared to Model 1, while Model 5 outperforms Model 3.These findings suggest that incorporating well-trained parameters from an image SR model can effectively enhance the overall performance of the video SR task.This analysis further emphasizes the importance of leveraging existing knowledge and expertise in the field of image SR to improve the efficiency and effectiveness of video SR models.

Limitation
Although the proposed method can infer 720P video frames in real-time, there are some limitations.First, the LR video frame is synthesized by bicubic degradation.This may deviate from the degradation of actual low-resolution video.Secondly, the performance of the proposed method can be further improved.In Section 4.3, the proposed method achieves the overall best performance.However, it performs worse than IMDN in some videos.In addition, some reconstruction results are not very sharp, for example, "Sunday" and "Monday" in Figure 7. Thirdly, the time consumption is close to the boundary of real-time inference.There is still more room for improvement.

Conclusions
In this paper, a novel approach for real-time video super-resolution is presented.The method incorporates a pre-trained image super-resolution model as its foundational framework to effectively exploit spatial information.To further leverage inter-frame dependencies, a fast temporal information aggregation module is introduced with the utilization of deformable convolution.This temporal modeling extracts motion cues across frames to enrich the spatial details.Additionally, a redundancy-aware inference algorithm is developed to minimize redundant computations by reusing intermediate features.It reduces the inference latency, enabling real-time performance for 720p video super-resolution with a minimal impact on accuracy.Experiments on several benchmarks show that the proposed method produces high-quality SR results quantitatively and qualitatively.The real-time inference capability makes the proposed method suitable for practical applications requiring live video enhancement.In the future, the efficient video super-resolution approaches can be improved by but not limited to the following directions: advanced degradation model for real-world low-resolution video; attention mechanism for better spatial-temporal feature extraction; novel techniques for efficient inference.

Figure 1 .
Figure 1.Overall Architecture of the Proposed Method.

Figure 2 .
Figure 2. Details of the Spatial Feature Extraction Module and Upsampler Module.

Figure 4 .Figure 5 .
Figure 4. Architecture of the Proposed Fast Temporal Information Aggregation Module.

Figure 10 .
Figure 10.Latency and PSNR on the Vid4 Benchmark.

Table 1 .
Details of the Temporal Aggregation Stage.

Table 2 .
Quantitative Comparison on the Vid4 Benchmark.The best and second-best results are marked in red and blue, respectively.

Table 3 .
Quantitative Comparison on the SPMCs-30 Benchmark.The best and second-best results are marked in red and blue, respectively.

Table 4 .
Quantitative Comparison of Efficiency for Producing 720P Frames.

Table 5 .
Quantitative Performance for the Ablation Study.