Next Article in Journal
Compactness of Commutators for Riesz Potential on Generalized Morrey Spaces
Previous Article in Journal
Iteration-Based Temporal Subgridding Method for the Finite-Difference Time-Domain Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Arbitrary Timestep Video Frame Interpolation with Time-Dependent Decoding

Faculty of Computing, Harbin Institute of Technology, Harbin 150001, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(2), 303; https://doi.org/10.3390/math12020303
Submission received: 18 November 2023 / Revised: 10 January 2024 / Accepted: 14 January 2024 / Published: 17 January 2024
(This article belongs to the Section E1: Mathematics and Computer Science)

Abstract

:
Given an observed low frame rate video, video frame interpolation (VFI) aims to generate a high frame rate video, which has smooth video frames with higher frames per second (FPS). Most existing VFI methods often focus on generating one frame at a specific timestep, e.g., 0.5, between every two frames, thus lacking the flexibility to increase the video’s FPS by an arbitrary scale, e.g., 3. To better address this issue, in this paper, we propose an arbitrary timestep video frame interpolation (ATVFI) network with time-dependent decoding. Generally, the proposed ATVFI is an encoder–decoder architecture, where the interpolation timestep is an extra input added to the decoder network; this enables ATVFI to interpolate frames at arbitrary timesteps between input frames and to increase the video’s FPS at any given scale. Moreover, we propose a data augmentation method, i.e., multi-width window sampling, where video frames can be split into training samples with multiple window widths, to better leverage training frames for arbitrary timestep interpolation. Extensive experiments were conducted to demonstrate the superiority of our model over existing baseline models on several testing datasets. Specifically, our model trained on the GoPro training set achieved 32.50 on the PSNR metric on the commonly used Vimeo90k testing set.

1. Introduction

With the development of smart devices and high refresh-rate screen technology, recent mobile phones and computer monitors have enabled consumers to enjoy videos with high frames per second (FPS), which are much smoother than those with low FPS. However, high FPS video sources are scarce due to the high cost of capturing devices, data storage, and transfer bandwidth. Moreover, many existing videos taken in the past have low FPS due to the limits of capturing devices. To enhance the user experience when watching existing low FPS videos, video frame interpolation (VFI) aims to increase the video’s FPS by generating interpolated frames between the frames of a given input video. Recent years have witnessed much research attention and contributions from the community [1,2,3,4,5,6].
Given two consecutive input video frames with timesteps defined as 0 and 1, most existing VFI methods focus on the quality of the interpolated frame and can generate one frame at a specific timestep, e.g., 0.5, between each two frames of input video. As a consequence, these models often only perform 2 × interpolation, i.e., doubling the FPS of the input video. Although interpolation of 4 × , 8 × is possible by applying these 2 × models in an iterative fashion, some problems still exist. First, arbitrary interpolation scales, e.g., 3 × , cannot be realized due to the nature of 2 × interpolation in these methods. Second, iterative interpolation lacks shared computation among the interpolation of multiple frames, which may suffer from accumulated interpolation errors and also may lead to inconsistency between generated frames.
Recently, some researchers have focused on generating interpolated frames for arbitrary scales [7,8,9,10,11,12]. Most of these methods depend on scaling the estimated optical flow linearly according to the desired scale. However, this paradigm assumes that the motion is linear, which is not always the case in real-world motion scenarios. Moreover, these methods rely heavily on the precision of optical flow estimation, whose errors could yield artifacts in the final interpolated frames.
In this paper, we propose a new arbitrary timestep video frame interpolation (ATVFI) neural network model with interpolation time-decoding. Generally, our method is built on an encoder–decoder framework [13]. The decoder part of our model takes the interpolation timestep t as an extra input, indicating the relative time coordinate of the desired output frame with regard to input frames. The proposed model can relieve the above-mentioned problems from the following perspectives. First, since the timestep t can be arbitrarily given, the interpolated frame at any timestep can be obtained, and the FPS of the input video can be arbitrarily scaled. Second, the computation of each output frame only directly depends on (the features of) input frames, without dependency on any other output frames, and thus ATVFI is free from interpolation error accumulation. Compared to VFI methods [7,8,9,11,12], considering arbitrary interpolation scale, our ATVFI does not assume linear motion, and its motion estimation is accordingly learned from training samples.
Moreover, existing VFI methods leverage training videos by splitting them into samples of fixed-width windows [14]. To further leverage the high FPS video training set to match the arbitrary-scale interpolation fashion of our model, we propose a new data augmentation method, multi-width window sampling (MWWS), to split the videos into training samples with multiple window widths. This approach improves the variety of motion magnitude between input frames, as well as the possible values of timestep t during training. MWWS enables our model to adapt to different motion magnitudes and learn a continuous correspondence relationship from t to output frames, thus significantly boosting the performance of our proposed ATVFI.
To summarize, our contributions in this paper are listed as follows:
  • A new arbitrary timestep video frame interpolation framework based on an encoder–decoder is proposed, where any interpolation timestep can be fed into the decoder part.
  • We propose a new data augmentation method, i.e., multi-width window sampling, to better leverage the training set for interpolating videos at any interpolation scale.
  • Extensive experiments are conducted to compare our method with state-of-the-art VFI methods, showing the effectiveness of the proposed framework and data augmentation.

2. Related Work

Existing mainstream methods of video frame interpolation can be generally categorized into flow-based methods [7,8,9,11,12,15,16,17,18,19,20,21,22,23,24,25,26], kernel-based methods [1,5,10,27,28,29,30], phase-based methods [31], and hallucination-based methods [2,3,14,32].
Kernel-based methods explicitly or implicitly estimate a convolution kernel for each target pixel, and then synthesize the interpolated frame by convolving over input frames. The concrete convolution operator may be adaptive convolution [27], separable convolution [1], AdaCoF [28,30], or deformable convolution [10], generalized deformable convolution [29]. Phase-based methods synthesize the interpolated frame from the estimated phase decomposition of it [31]. Hallucination-based methods directly synthesize the interpolated frame from blended features generated from PixelShuffle [2], 3D convolution [14], or 2D convolution with generative adversarial networks [32,33]. Flow-based methods first estimate the optical flow between input frames, then infer the optical flow between the intermediate frame and input frames, and finally synthesize the output frame based on the optical flow by warping the input frames or their features. Super SloMo [7] is a pioneering work under this setup. DAIN [8] incorporates a depth estimation module as extra information for handling occlusion. SoftSplat [16] leverages the Softmax operator to mix pixels and features during the forward warping operation. Heo and Jeong [26] combine max–min warping with forward warping. ABME [18] further estimates an asymmetric bilateral optical flow based on symmetric optical flow. QVI [19] and EQVI [20] leverage an extra two frames and refine optical flow based on the proposed quadratic motion assumption. AllAtOnce [21] further extends quadratic motion to cubic motion. IFRNet [11] gradually estimates bilateral optical flow together with intermediate features in a coarse-to-fine manner for mutual promotion of the two components. VFIformer [23] refines the warped frames with a Transformer [34] structure. UPR-Net [24] uses a unified pyramid recurrent network to estimate optical flow and synthesize frames. AMT [25] leverages bidirectional correlation volumes for all pairs of pixels in flow estimation and feature update.
Some existing methods are capable of arbitrary-scale interpolation and generate an interpolated frame at any given timestep t between input frames. For most flow-based methods, the bilateral optical flows between the intermediate frame and input frames can be simply obtained by scaling the optical flow between input frames by t [7,8,9,12]. IFRNet [11] takes t as input to the network to estimate optical flow. EDSC [10] is a kernel-based method using heterogeneous convolution [35] and deformable convolution [36] with separable filters.

3. Proposed Method

In this section, we describe our proposed ATVFI model. In the following, we first present the overall framework of the arbitrary-timestep video frame interpolation in Section 3.1, then the detailed network model architecture in Section 3.2, and finally our proposed data augmentation method named multi-width window sampling in Section 3.3. Please refer to the Supplementary Materials for the source code of our method.

3.1. Arbitrary Timestep VFI Framework

The goal of video frame interpolation is to generate intermediate frames for the moments between given video frames. Specifically, given two consecutive input frames, I 0 and I 1 , the goal is to generate the frame I t at the given intermediate timestep t. Our ATVFI aims to model the continuous relationship between I t and any timestep t [ 0 , 1 ] , rather than fixed t = 0.5 , as in most existing VFI methods.
We model this relationship as a two-step process in an encoder–decoder framework. Firstly, given input frames I 0 and I 1 , an encoder network E ϕ encodes them into latent representation z :
z = E ϕ ( I 0 , I 1 ) .
Secondly, given z and the desired timestep t, a decoder network F θ decodes them into the target output frame I ^ t :
I ^ t = F θ ( z , t ) .

3.2. Network Architecture

Our model follows a typical encoder–decoder framework of a U-Net [37] structure with skip connections, as shown in Figure 1.

3.2.1. Encoder of ATVFI

In the encoder of the ATVFI, a four-level pyramid network comprises the encoder part. The input frames are first concatenated on the channel axis, and then fed into the first level. Each level of the encoder part, named EncBlock, contains three convolution layers and one average pooling layer [38]. The latent feature z consists of the output feature pyramid of all levels of the encoder network, as well as the input frames I 0 , I 1 themselves.
Figure 1. Overall architecture of our ATVFI, which is built on encoder–decoder architecture. The details of warping parameter estimation modules (WPEMs) can be found in Figure 2.
Figure 1. Overall architecture of our ATVFI, which is built on encoder–decoder architecture. The details of warping parameter estimation modules (WPEMs) can be found in Figure 2.
Mathematics 12 00303 g001

3.2.2. Decoder of ATVFI

As for the decoder of ATVFI, it is composed of a pyramid network, sub-networks for warping the input frames, and a fusion mechanism for the warped frames. The details are explained below.
Each level of the pyramid network, named DecBlock, has three convolution layers, followed by an upsampling layer and a convolution layer. The input of the first level is the concatenated tensor of two parts: the output feature of the last level of the encoder network, and a one-channel feature map with all values set to t. The input of each of the following levels is the concatenated tensor of three parts: the output feature of the corresponding encoder level via a skip connection, the output feature of the previous level of the decoder network, and the one-channel t-tensor, as mentioned above. In this way, the timestep is taken into account in the generation process of interpolated frames. In addition, the vanishing gradient problem [39] of the multi-layer network is addressed by the skip connections [40].
Then, two Warping parameter estimation modules (WPEMs, Figure 2), named WPEM 0 for I 0 and WPEM 1 for I 1 , take t and the output feature of the pyramid network as input. Each WPEM has three sub-networks for calculating each parameter for the warping operator, including the warping kernel, W , and warping offsets, α , β . The basic architecture of these sub-networks is the same as DecBlock. The network for W also has a Softmax layer appended. We employ the AdaCoF operator [28] as the warping operator, for its flexibility and simplicity, to warp both input frames I 0 , I 1 into I ^ 0 , I ^ 1 . Specifically, given the kernel width K and dilation factor d as hyperparameters, the AdaCoF operator takes the properly padded input frame I ( i , j ) of the original size H × W , per-pixel convolution kernel W ( i , j , k , l ) , per-pixel vertical offsets α ( i , j , k , l ) , and horizontal offsets β ( i , j , k , l ) as input. The output image I ^ ( i , j ) of the AdaCoF operator is calculated as follows:     
I ^ ( i , j ) = k = 0 K 1 l = 0 K 1 W ( i , j , k , l ) · I i + d k + α ( i , j , k , l ) , j + d l + β ( i , j , k , l )
where i , j are the coordinates of image pixels, and  k , l are the coordinates of the convolution kernel. In practice, the values in α and β may not be integers, and the offset vector could point to positions other than grid points. Here, we use bilinear interpolation to obtain pixel values on those off-the-grid positions, which also make the warping operator differentiable; thus, the whole model could be trained end-to-end.
Figure 2. Architecture of the warping parameter estimation module (WPEM).
Figure 2. Architecture of the warping parameter estimation module (WPEM).
Mathematics 12 00303 g002
Finally, with t and the output feature of the pyramid network as input, a DecBlock followed by a Sigmoid function σ ( x ) = 1 1 + exp ( x ) is employed to calculate the per-pixel fusion factor V . Then, I ^ 0 and I ^ 1 are fused into the final image I ^ modulated by V element-wisely, as a way to model the occlusion relationship between pixels. The modulation operation is calculated as follows:
I ^ t ( i , j ) = V ( i , j ) · I ^ 0 ( i , j ) + ( 1 V ( i , j ) ) · I ^ 1 ( i , j ) .

3.2.3. Algorithmic Explanation

In order to provide a comprehensive understanding of our model, we present the pseudocode representation of ATVFI in Algorithm 1. We aim to illustrate the step-by-step execution of the model, highlighting how each component plays its role in achieving the desired outcomes.
Algorithm 1: Pseudocode representation of ATVFI
1:
function ATVFI( I 0 , I 1 , t )
2:
     F e 1 EncBlock 1 ( I 0 , I 1 )
3:
     F e 2 EncBlock 2 ( F e 1 )
4:
     F e 3 EncBlock 3 ( F e 2 )
5:
     F e 4 EncBlock 4 ( F e 3 )
6:
     F e 5 EncBlock 5 ( F e 4 )
7:
     F d 5 DecBlock 5 ( F e 5 , t )
8:
     F d 4 DecBlock 4 ( F d 5 , F e 5 , t )
9:
     F d 3 DecBlock 3 ( F d 4 , F e 4 , t )
10:
   F d 2 DecBlock 2 ( F d 3 , F e 3 , t )
11:
   α 0 , β 0 , W 0 WPEM 0 ( F d 2 , F e 2 , t )
12:
   I ^ 0 AdaCoF ( I 0 , α 0 , β 0 , W 0 )
13:
   α 1 , β 1 , W 1 WPEM 1 ( F d 2 , F e 2 , t )
14:
   I ^ 1 AdaCoF ( I 1 , α 1 , β 1 , W 1 )
15:
   V σ ( DecBlock V ( F d 2 , F e 2 , t ) )
16:
   I ^ t V · I ^ 0 + ( 1 V ) · I ^ 1
17:
  return  I ^ t
18:
end function

3.3. Multi-Width Window Sampling

During training, the sampling process in high FPS videos of existing multi-frame interpolation methods only use a fixed-width window [10,14,21]. To be precise, given the window width W, frames from the training set video I 0 , I 1 , , I n 1 , and a proper sampling position index i, the input frames of a training sample are I i , I i + W (for methods that require four frames as input, I i W , I i + 2 W are also needed), and the ground-truth frames are I i + 1 , I i + 2 , , I i + W 1 . However, simply following such a sampling process in training our model would lead to limited motion magnitude. Moreover, the possible values of t would be limited to 1 / W , 2 / W , , ( W 1 ) / W , whose values are at least 1 / W apart from each other, which is not ideal for our model to learn a continuous mapping between t and the output frame.
To enrich the variety of motion magnitude between input frames, and to spread the possible values of t more continuously over [ 0 , 1 ] , we propose multi-width window sampling (MWWS). Firstly, for each window width W ranging from 2 to a certain maximum width W max , we perform a fixed window sampling of width W over the videos and obtain a set of training samples, D W . Secondly, we combine all sets to obtain the final training set: D MWWS = W = 2 W max D W . In this way, the training samples in D MWWS gain not only a larger range of motion magnitude between input frames but also greater diversity and continuity of values of t, as illustrated in Figure 3.

4. Experiments

In this section, we perform a series of experiments on our proposed method. We first describe the basic experimental settings in Section 4.1, then we compare our proposed method to the related state-of-the-art frame interpolation methods, and analyze it quantitatively and qualitatively in Section 4.2; finally, we perform an ablation study on crucial components in our method in Section 4.3.

4.1. Basic Settings

Following common practice [8,10,28], our model is trained toward minimizing the 1 norm of the distance between the output frame I ^ t and the ground-truth frame I t , by optimizing the following loss function:
L = i , j ρ I ^ t ( i , j ) I t ( i , j )
where ρ ( x ) = x 2 + ϵ 2 is the Charbonnier function, with the constant ϵ = 0.001 . Since our model requires supervision of varied timesteps, which the commonly used Vimeo90k [41] triplet dataset lacks, we train our model on the training split of the GoPro [42] dataset, a collection of 240-fps 720p videos. MWWS with W max = 8 is applied to our model during training by default. For a fair comparison with other methods, we retrain them on the GoPro training set. For different testing sets, we evaluate each method and report their peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [43], calculated as follows:
PSNR = 10 log 10 255 2 1 H W i = 1 H j = 1 W I ^ t ( i , j ) I t ( i , j ) 2
SSIM = ( 2 μ O μ T + ϵ 1 ) ( 2 σ O T + ϵ 2 ) ( μ O 2 + μ T 2 + ϵ 1 ) ( σ O 2 + σ T 2 + ϵ 2 )
where μ { O , T } and σ { O , T } represent the mean and deviation of the output image I ^ t and target image I t , respectively. σ O T represents the covariance between the two images. ϵ 1 and ϵ 2 are small constants to prevent the denominator from being zero.

4.2. Comparison with State-of-the-Art

In this section, we evaluate and compare our proposed method with state-of-the-art methods. We first compare them under the setting of the most common 2 × interpolation in Section 4.2.1 and the arbitrary-scale interpolation in Section 4.2.2. Then, we compare our method with an iterative version of the state-of-the-art 2 × interpolation methods in Section 4.2.3. Finally, we compare the methods qualitatively in Section 4.2.4.

4.2.1. 2 × Interpolation

For 2 × interpolation, we evaluate the competing methods on the Vimeo90k triplet [41], UCF [44], and Middlebury [45] testing sets. We group the baseline methods by their capability for arbitrary-scale interpolation. The results are presented in Table 1. It is shown that our model achieves the best results on PSNR and SSIM metrics on the UCF and Middlebury testing sets, and secures the second position on the Vimeo90k testing set. We also measure the inference speed and GPU memory consumption by running each method with frames from the Vimeo90k dataset on a workstation with a Core i7-8700K CPU and one Nvidia RTX2080Ti GPU. We measure the average running time and occupied GPU memory size over 100 runs.

4.2.2. Arbitrary-Scale Multi-Frame Interpolation

For arbitrary-scale multi-frame interpolation, we evaluate methods on GoPro [42] and Adobe240 [46] testing sets. Similar to the sampling process in MWWS, we sample from testing videos with different values of W and obtain testing sets of different interpolation scales from 3 × to 7 × . We then report the performance of each one. Table 2 and Table 3 list the results of the GoPro and Adobe240 testing sets, respectively. Our model achieves the best results on all scale settings on the Adobe240 testing set and on most scale settings on the GoPro testing set.

4.2.3. Comparison with Iterative Interpolation

Iterative interpolation enables existing single-frame interpolation methods to generate more frames between input frames but suffers from deficiencies elaborated in Section 1. To compare our method with those under multi-frame interpolation settings, similar to Section 4.2.2, we sample from test videos with fixed window sizes W = 8 and obtain testing sets for 8 × interpolation. The baseline single-frame interpolation models are applied iteratively, three times, to obtain seven interpolated frames. The results are presented in Table 4. Our model performs better than the iterative interpolation of any single-frame interpolation methods.

4.2.4. Qualitative Evaluation

We present several examples from the Vimeo90k testing set and compare the interpolated results visually, as shown in Figure 4. In the first example, our method generates the fingertip with clear edges and the strap loop with the correct shape and position. As for faces, our method generates background texture near the human face with the least blur and distortion.

4.3. Ablation Study

In this section, we conduct an ablation study on our proposed method. First, we test various designs of adopting the timestep in the decoder network in Section 4.3.1. Then, we analyze the effect of MWWS on our model in Section 4.3.2.

4.3.1. Method for Adopting the Timestep as Input

To explore more ways for our decoder network to adopt t as input, we compare the performance of our model (denoted as “Ours-original”) with the following decoder variants:
  • Variant “Ours-lastonly”: Only the sub-networks in the warping modules adopt t as input, while the pyramid part of the decoder network only takes encoder features and previous-level decoder features as input.
  • Variant “Ours-allconv”: Apart from the first convolution layer in each level of the decoder network, all other convolution layers also adopt t as extra input in the same way.
The results are presented in Table 5. The difference between different variants is minor, indicating that our model is robust in the choice of where to adopt t as input.

4.3.2. Effect of Multi-Width Window Sampling

To show the effectiveness of multi-width window sampling (MWWS) in augmenting the dataset, we compare our model with one trained on the GoPro dataset sampled by a fixed window width of 8 (denoted as “Ours (w/o MWWS)”). The results are presented in Table 6. The large performance margin shows that the great variety of timestep values produced by MWWS is a crucial part of training our model.

5. Discussion and Conclusions

In this work, we proposed a new arbitrary timestep video frame interpolation (ATVFI) network model. The proposed ATVFI is built on an encoder–decoder architecture, where the interpolation timestep is fed as an extra input to the decoder, enabling ATVFI to interpolate video at any scale. For high-scale interpolation, this paradigm avoids error accumulation of iterative interpolation. Also, in contrast to flow-based methods that explicitly model motion as flows and assume linear motion, ATVFI models complex motion implicitly with the power of neural networks. For more supervision of motion magnitudes and correspondence between an output frame and its timestep, MWWS is proposed to split training samples with multiple window widths, which can better leverage training videos for arbitrary timestep video interpolation. Our ATVFI has been well evaluated on several testing datasets and achieved comparable results with state-of-the-art methods, notably a 32.50 on the PSNR metric on the Vimeo90k testing set.
In the future, our ATVFI could be further improved in terms of performance and efficiency. Currently, the warping parameters generated by WPEM are totally guided by the gradient from the differentiable AdaCoF operator during training. Although this works fine, the warping parameters could leverage some extra supervision, which could be, e.g., optical flow models pre-trained on large datasets. In addition, MWWS is unable to produce timestep t values that are very close to 0 and 1, as shown in Figure 3. How to produce training samples with such t values is left as future work. Furthermore, our ATVFI has the potential to be extended to handle more complicated tasks, e.g., joint video deblurring and arbitrary timestep video frame interpolation.

Supplementary Materials

The source code for this article is accessible at https://github.com/madmusician/time_dependent_decoding.

Author Contributions

Conceptualization, H.Z. and W.Z.; methodology, H.Z. and W.Z.; software, H.Z.; validation, H.Z.; formal analysis, H.Z., D.R. and W.Z.; investigation, H.Z.; resources, H.Z.; data curation, H.Z.; writing—original draft preparation, H.Z.; writing—review and editing, D.R., Z.Y. and W.Z.; visualization, H.Z.; supervision, D.R., Z.Y. and W.Z.; project administration, W.Z.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key Research and Development Program of China under grant no. 2022YFA1004103 and the National Natural Science Foundation of China (NSFC) under grant no. U19A2073.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Vimeo90k [41] is available at http://toflow.csail.mit.edu/. UCF-101 [44] is available at https://www.crcv.ucf.edu/data/UCF101.php. Middlebury [45] is available at http://vision.middlebury.edu/flow/. GoPro [42] is available at https://seungjunnah.github.io/Datasets/gopro. Adobe240 [46] is available at https://www.cs.ubc.ca/labs/imager/tr/2017/DeepVideoDeblurring/. Accessed on 9 January 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
VFIvideo frame interpolation
FPSframes per second
ATVFIarbitrary timestep video frame interpolation
MWWSmulti-width window sampling
WPEMwarping parameter estimation module

References

  1. Niklaus, S.; Mai, L.; Liu, F. Video Frame Interpolation via Adaptive Separable Convolution. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 261–270. [Google Scholar] [CrossRef]
  2. Niklaus, S.; Liu, F. Context-Aware Synthesis for Video Frame Interpolation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1701–1710. [Google Scholar] [CrossRef]
  3. Gui, S.; Wang, C.; Chen, Q.; Tao, D. FeatureFlow: Robust Video Interpolation via Structure-to-Texture Generation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 14001–14010. [Google Scholar] [CrossRef]
  4. Reda, F.A.; Kontkanen, J.; Tabellion, E.; Sun, D.; Pantofaru, C.; Curless, B. FILM: Frame Interpolation for Large Motion. arXiv 2022, arXiv:2202.04901. [Google Scholar]
  5. Peleg, T.; Szekely, P.; Sabo, D.; Sendik, O. IM-Net for High Resolution Video Frame Interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 2398–2407. [Google Scholar] [CrossRef]
  6. Bao, W.; Lai, W.; Zhang, X.; Gao, Z.; Yang, M. MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 933–948. [Google Scholar] [CrossRef] [PubMed]
  7. Jiang, H.; Sun, D.; Jampani, V.; Yang, M.; Learned-Miller, E.G.; Kautz, J. Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9000–9008. [Google Scholar] [CrossRef]
  8. Bao, W.; Lai, W.; Ma, C.; Zhang, X.; Gao, Z.; Yang, M. Depth-Aware Video Frame Interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 3703–3712. [Google Scholar] [CrossRef]
  9. Sim, H.; Oh, J.; Kim, M. XVFI: eXtreme Video Frame Interpolation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 14469–14478. [Google Scholar] [CrossRef]
  10. Cheng, X.; Chen, Z. Multiple Video Frame Interpolation via Enhanced Deformable Separable Convolution. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7029–7045. [Google Scholar] [CrossRef] [PubMed]
  11. Kong, L.; Jiang, B.; Luo, D.; Chu, W.; Huang, X.; Tai, Y.; Wang, C.; Yang, J. IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation. arXiv 2022, arXiv:2205.14620. [Google Scholar] [CrossRef]
  12. Huang, Z.; Zhang, T.; Heng, W.; Shi, B.; Zhou, S. Real-Time Intermediate Flow Estimation for Video Frame Interpolation. In Proceedings of the 17th European Conference on Computer Vision, ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13674, pp. 624–642. [Google Scholar] [CrossRef]
  13. Zhang, Y.; Sung, Y. Traffic Accident Detection Using Background Subtraction and CNN Encoder–Transformer Decoder in Video Frames. Mathematics 2023, 11, 2884. [Google Scholar] [CrossRef]
  14. Kalluri, T.; Pathak, D.; Chandraker, M.; Tran, D. FLAVR: Flow-Agnostic Video Representations for Fast Frame Interpolation. arXiv 2020, arXiv:2012.08512. [Google Scholar]
  15. Liu, Z.; Yeh, R.A.; Tang, X.; Liu, Y.; Agarwala, A. Video Frame Synthesis Using Deep Voxel Flow. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 4473–4481. [Google Scholar] [CrossRef]
  16. Niklaus, S.; Liu, F. Softmax Splatting for Video Frame Interpolation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 5436–5445. [Google Scholar] [CrossRef]
  17. Park, J.; Ko, K.; Lee, C.; Kim, C. BMBC: Bilateral Motion Estimation with Bilateral Cost Volume for Video Interpolation. In Proceedings of the 16th European Conference on Computer Vision, ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12359, pp. 109–125. [Google Scholar] [CrossRef]
  18. Park, J.; Lee, C.; Kim, C. Asymmetric Bilateral Motion Estimation for Video Frame Interpolation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 14519–14528. [Google Scholar] [CrossRef]
  19. Xu, X.; Si-Yao, L.; Sun, W.; Yin, Q.; Yang, M. Quadratic Video Interpolation. In Proceedings of the Annual Conference on Neural Information Processing Systems 2019—Advances in Neural Information Processing Systems 32, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R., Eds.; pp. 1645–1654. [Google Scholar]
  20. Liu, Y.; Xie, L.; Li, S.; Sun, W.; Qiao, Y.; Dong, C. Enhanced Quadratic Video Interpolation. In Proceedings of the 2020 Workshops on Computer Vision, Glasgow, UK, 23–28 August 2020; Bartoli, A., Fusiello, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12538, pp. 41–56. [Google Scholar] [CrossRef]
  21. Chi, Z.; Nasiri, R.M.; Liu, Z.; Lu, J.; Tang, J.; Plataniotis, K.N. All at Once: Temporally Adaptive Multi-frame Interpolation with Advanced Motion Modeling. In Proceedings of the 16th European Conference on Computer Vision, ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12372, pp. 107–123. [Google Scholar] [CrossRef]
  22. Chen, Z.; Wang, R.; Liu, H.; Wang, Y. PDWN: Pyramid Deformable Warping Network for Video Interpolation. arXiv 2021, arXiv:2104.01517. [Google Scholar] [CrossRef]
  23. Lu, L.; Wu, R.; Lin, H.; Lu, J.; Jia, J. Video Frame Interpolation with Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 3522–3532. [Google Scholar] [CrossRef]
  24. Jin, X.; Wu, L.; Chen, J.; Chen, Y.; Koo, J.; Hahm, C.H. A Unified Pyramid Recurrent Network for Video Frame Interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  25. Li, Z.; Zhu, Z.L.; Han, L.H.; Hou, Q.; Guo, C.L.; Cheng, M.M. AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  26. Heo, J.; Jeong, J. Forward Warping-Based Video Frame Interpolation Using a Motion Selective Network. Electronics 2022, 11, 2553. [Google Scholar] [CrossRef]
  27. Niklaus, S.; Mai, L.; Liu, F. Video Frame Interpolation via Adaptive Convolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 2270–2279. [Google Scholar] [CrossRef]
  28. Lee, H.; Kim, T.; Chung, T.; Pak, D.; Ban, Y.; Lee, S. AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 5315–5324. [Google Scholar] [CrossRef]
  29. Shi, Z.; Liu, X.; Shi, K.; Dai, L.; Chen, J. Video Frame Interpolation via Generalized Deformable Convolution. IEEE Trans. Multim. 2022, 24, 426–439. [Google Scholar] [CrossRef]
  30. Ding, T.; Liang, L.; Zhu, Z.; Zharkov, I. CDFI: Compression-Driven Network Design for Frame Interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, 19–25 June 2021; pp. 8001–8011. [Google Scholar]
  31. Meyer, S.; Djelouah, A.; McWilliams, B.; Sorkine-Hornung, A.; Gross, M.H.; Schroers, C. PhaseNet for Video Frame Interpolation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 498–507. [Google Scholar] [CrossRef]
  32. Tran, Q.N.; Yang, S.H. Efficient Video Frame Interpolation Using Generative Adversarial Networks. Appl. Sci. 2020, 10, 6245. [Google Scholar] [CrossRef]
  33. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  34. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; pp. 5998–6008. [Google Scholar]
  35. Singh, P.; Verma, V.K.; Rai, P.; Namboodiri, V.P. HetConv: Heterogeneous Kernel-Based Convolutions for Deep CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 4835–4844. [Google Scholar] [CrossRef]
  36. Liu, B.; Chen, K.; Peng, S.L.; Zhao, M. Depth Map Super-Resolution Based on Semi-Couple Deformable Convolution Networks. Mathematics 2023, 11, 4556. [Google Scholar] [CrossRef]
  37. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the 8th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., III, W.M.W., Frangi, A.F., Eds.; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
  38. Barbu, T. CNN-Based Temporal Video Segmentation Using a Nonlinear Hyperbolic PDE-Based Multi-Scale Analysis. Mathematics 2023, 11, 245. [Google Scholar] [CrossRef]
  39. Abuqaddom, I.; Mahafzah, B.A.; Faris, H. Oriented stochastic loss descent algorithm to train very deep multi-layer neural networks without vanishing gradients. Knowl.-Based Syst. 2021, 230, 107391. [Google Scholar] [CrossRef]
  40. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  41. Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W.T. Video Enhancement with Task-Oriented Flow. Int. J. Comput. Vis. 2019, 127, 1106–1125. [Google Scholar] [CrossRef]
  42. Nah, S.; Kim, T.H.; Lee, K.M. Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 257–265. [Google Scholar] [CrossRef]
  43. Batchuluun, G.; Koo, J.H.; Kim, Y.H.; Park, K.R. Image Region Prediction from Thermal Videos Based on Image Prediction Generative Adversarial Network. Mathematics 2021, 9, 1053. [Google Scholar] [CrossRef]
  44. Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
  45. Baker, S.; Scharstein, D.; Lewis, J.P.; Roth, S.; Black, M.J.; Szeliski, R. A Database and Evaluation Methodology for Optical Flow. Int. J. Comput. Vis. 2011, 92, 1–31. [Google Scholar] [CrossRef]
  46. Su, S.; Delbracio, M.; Wang, J.; Sapiro, G.; Heidrich, W.; Wang, O. Deep Video Deblurring for Hand-Held Cameras. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 237–246. [Google Scholar] [CrossRef]
Figure 3. Illustrative comparison of the possible values of timestep t on the time axis, either from fixed-width window sampling with W = 5 (the axis on the top) or from multi-width window sampling (MWWS) with W max = 5 (the axis on the bottom), which shows that MWWS increases the diversity and continuity of values of t. The colors of the points on the axis indicate whether the point is a third, a quarter, or a fifth of the time interval [ 0 , 1 ] .
Figure 3. Illustrative comparison of the possible values of timestep t on the time axis, either from fixed-width window sampling with W = 5 (the axis on the top) or from multi-width window sampling (MWWS) with W max = 5 (the axis on the bottom), which shows that MWWS increases the diversity and continuity of values of t. The colors of the points on the axis indicate whether the point is a third, a quarter, or a fifth of the time interval [ 0 , 1 ] .
Mathematics 12 00303 g003
Figure 4. Visual comparison for results from AdaCoF [28], DAIN [8], EDSC [10], IFRNet [11], RIFE [12], VFIformer [23] and our model on Vimeo90k dataset.
Figure 4. Visual comparison for results from AdaCoF [28], DAIN [8], EDSC [10], IFRNet [11], RIFE [12], VFIformer [23] and our model on Vimeo90k dataset.
Mathematics 12 00303 g004
Table 1. Comparison of the 2 × interpolation performance.
Table 1. Comparison of the 2 × interpolation performance.
MethodVimeo90kMiddleburyUCFTime
(s)
Memory
(GB)
PSNRSSIMPSNRSSIMPSNRSSIM
SepConv [1]29.760.87429.600.84829.180.8990.0110.76
AdaCoF [28]31.980.92032.400.90631.560.9350.0110.76
CAIN [2]30.320.88529.120.83028.990.8920.0190.97
VFIformer [23]33.080.94433.000.93331.380.9400.3442.88
Super SloMo [7]31.480.92031.350.90529.320.9100.0170.78
DAIN [8]31.730.93032.040.91330.180.9260.2722.17
XVFI [9]30.120.89930.770.88929.540.9160.0291.12
EDSC [10]31.570.91431.900.89631.030.9270.0150.82
IFRNet [11]32.230.92732.890.92531.520.9350.0070.79
RIFE [12]30.410.88531.660.89731.480.9350.0060.78
UPR-Net [24]31.730.92031.950.91130.980.9320.0210.85
AMT [25]32.240.92932.810.91731.280.9340.0150.81
Ours32.500.93133.650.93332.010.9410.0110.95
Table 2. Comparison of arbitrary-scale multi-frame interpolation performance results on GoPro.
Table 2. Comparison of arbitrary-scale multi-frame interpolation performance results on GoPro.
Method 3 × 4 × 5 × 6 × 7 ×
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
Super SloMo [7]35.390.96934.100.95732.800.94131.640.92330.500.902
DAIN [8]30.700.92229.350.89327.670.86226.650.83725.650.811
EDSC [10]35.520.96734.170.95532.840.94031.690.92330.570.902
XVFI [9]31.210.92829.690.90127.900.87226.800.84725.800.825
IFRNet [11]35.610.96734.370.95732.970.94231.860.92730.840.909
RIFE [12]31.060.91929.000.88027.470.84626.310.81725.360.791
UPR-Net [24]35.140.963334.060.95532.920.94231.980.92830.940.910
AMT [25]35.030.96334.330.95732.820.94332.010.93030.940.913
Ours36.430.97234.780.96033.240.94331.940.92530.700.903
Table 3. Comparison of arbitrary-scale multi-frame interpolation performance results on Adobe240.
Table 3. Comparison of arbitrary-scale multi-frame interpolation performance results on Adobe240.
Method 3 × 4 × 5 × 6 × 7 ×
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
Super SloMo [7]33.920.96233.430.95732.930.95232.420.94631.870.938
DAIN [8]30.760.92129.870.90428.900.88528.240.87027.560.855
EDSC [10]34.650.96134.190.95733.620.95133.050.94532.460.937
XVFI [9]31.750.93730.630.92029.490.90428.710.89127.960.878
IFRNet [11]35.320.96634.690.96233.930.95433.210.94832.590.940
RIFE [12]31.790.93130.440.91029.400.89128.560.87427.840.858
UPR-Net [24]34.720.96434.280.96033.710.95433.170.94832.530.939
AMT [25]34.120.96133.990.96033.360.95432.760.94832.280.941
Ours37.000.97335.980.96835.050.96134.200.95333.370.944
Table 4. Comparison between our method and iterative interpolation.
Table 4. Comparison between our method and iterative interpolation.
MethodGoPro ( 8 × )Adobe240 ( 8 × )
PSNRSSIMPSNRSSIM
SepConv [1] (iterative)27.870.84328.100.857
AdaCoF [28] (iterative)29.340.87731.270.918
CAIN [2] (iterative)27.880.83427.750.855
VFIformer [23] (iterative)28.700.86429.160.889
Ours29.620.88132.600.913
Table 5. Comparison between ATVFI variants.
Table 5. Comparison between ATVFI variants.
Vimeo90k ( 2 × )GoPro ( 8 × )Adobe240 ( 8 × )
PSNRSSIMPSNRSSIMPSNRSSIM
Ours-original32.500.93129.620.88132.600.935
Ours-lastonly32.520.93229.630.88132.700.938
Ours-allconv32.600.93329.670.88132.720.937
Table 6. Results of the ablation study on MWWS.
Table 6. Results of the ablation study on MWWS.
Vimeo90k ( 2 × )GoPro ( 8 × )Adobe240 ( 8 × )
PSNRSSIMPSNRSSIMPSNRSSIM
Ours32.500.93129.620.88132.600.935
Ours (w/o MWWS)31.460.90929.390.87731.830.924
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, H.; Ren, D.; Yan, Z.; Zuo, W. Arbitrary Timestep Video Frame Interpolation with Time-Dependent Decoding. Mathematics 2024, 12, 303. https://doi.org/10.3390/math12020303

AMA Style

Zhang H, Ren D, Yan Z, Zuo W. Arbitrary Timestep Video Frame Interpolation with Time-Dependent Decoding. Mathematics. 2024; 12(2):303. https://doi.org/10.3390/math12020303

Chicago/Turabian Style

Zhang, Haokai, Dongwei Ren, Zifei Yan, and Wangmeng Zuo. 2024. "Arbitrary Timestep Video Frame Interpolation with Time-Dependent Decoding" Mathematics 12, no. 2: 303. https://doi.org/10.3390/math12020303

APA Style

Zhang, H., Ren, D., Yan, Z., & Zuo, W. (2024). Arbitrary Timestep Video Frame Interpolation with Time-Dependent Decoding. Mathematics, 12(2), 303. https://doi.org/10.3390/math12020303

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop