Video Frame Interpolation for Extreme Motion Scenes Based on Dual Alignment and Region-Adaptive Interaction

Ning, Xin; Qu, Jiantao; Duan, Junyi; Yang, Kun; Ding, Youdong

doi:10.3390/sym17122097

Open AccessArticle

Video Frame Interpolation for Extreme Motion Scenes Based on Dual Alignment and Region-Adaptive Interaction

by

Xin Ning

^1,*

,

Jiantao Qu

¹

,

Junyi Duan

²,

Kun Yang

³ and

Youdong Ding

^1,4,*

¹

Shanghai Film Academy, Shanghai University, Shanghai 200072, China

²

Data and Target Engineering, Information Engineering University, Zhengzhou 450000, China

³

College of Mechanical Engineering, Taiyuan University of Technology, Taiyuan 030024, China

⁴

Shanghai Engineering Research Center of Motion Picture Special Effects, Shanghai 200072, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2025, 17(12), 2097; https://doi.org/10.3390/sym17122097

Submission received: 1 November 2025 / Revised: 28 November 2025 / Accepted: 3 December 2025 / Published: 6 December 2025

(This article belongs to the Special Issue Symmetry in Artificial Intelligence and Applications)

Download

Browse Figures

Versions Notes

Abstract

Video frame interpolation in ultra-high-definition extreme motion scenes remains highly challenging due to large displacements, nonlinear motion, and occlusions that disrupt spatio-temporal symmetry. To address this issue, this study proposes a frame interpolation method for extreme motion scenes based on dual alignment and region-adaptive interaction from the perspectives of cross-frame localization and adaptive reconstruction. Specifically, we design a two-stage motion information alignment strategy that obtains two types of motion information via optical flow estimation and offset estimation, and it progressively guides reference pixels for accurate long-range cross-frame localization, mitigating structural misalignment caused by limited receptive fields while simultaneously alleviating spatiotemporal asymmetry caused by inconsistent inter-frame motion speed and direction. Based on this, we introduce a region-adaptive interaction module that automatically adapts motion representations for different regions through cross-frame interaction and leverages distinct attention pathways to accurately capture both the global context and local high-frequency motion details. This achieves a dynamic feature fusion tailored to regional characteristics, significantly enhancing the model’s ability to perceive the overall structure and texture details in extreme motion scenarios. In addition, the introduction of a motion compensation module explicitly captures pixel motion relationships by constructing a global correlation matrix that compensates for the positioning errors of the dual alignment module in extreme motion or occlusion areas. The experimental results demonstrate that the proposed method achieves excellent overall performance in ultra-high-definition extreme motion scenes, with a PSNR improvement of 0.05 dB over state-of-the-art methods. In multi-frame interpolation tasks, it achieves an average PSNR gain of 0.31 dB, demonstrating strong cross-scene interpolation capability.

Keywords:

video frame interpolation; deep learning; extreme motion scenes; motion awareness; spatiotemporal symmetry

1. Introduction

Video frame interpolation aims to increase the frame rate of a video by synthesizing new frames within consecutive video sequences. This can effectively mitigate common issues in low-quality videos, such as jitter and blur, and it is therefore widely used to balance the video coding rate and quality [1], support view synthesis [2], enhance video quality [3], and allow video deblurring [4,5].

The massive generation of ultra-high-definition (UHD) video (4K or 8K) data poses a significant challenge to current deep-learning-based video frame interpolation methods. In [6], Sim et al. first introduced the concept of extreme motion in the context of video frame interpolation to describe fast, large-magnitude pixel movements present in UHD videos. Extreme motion refers to high-speed dynamic changes in which pixel displacements far exceed the typical distribution found in conventional video datasets. Such motions often extend beyond the receptive field of deep neural networks, making it difficult for models to capture long-range pixel correspondences. Meanwhile, these scenarios are frequently accompanied by complex occlusions and nonlinear lighting variations, which disrupt the temporal symmetry and motion symmetry that should exist between the input frames

I_{0}

and

I_{1}

and the intermediate frame

I_{0.5}

, thereby imposing stricter demands on the spatiotemporal modeling capabilities of existing video frame interpolation methods.

Recently, some works [6,7,8] have proposed methods for frame interpolation in 4K videos and extreme motion scenes. However, these approaches [9,10] typically rely on a single optical flow estimation to infer pixel trajectories. When motion magnitudes exceed the network’s receptive field, the models struggle to maintain spatiotemporal symmetry between frames. In particular, under large-scale motion and severe occlusions, errors in optical flow estimation tend to accumulate, resulting in structural misalignment and texture degradation in the reconstructed intermediate frames. As shown in Figure 1, noticeable artifacts and blurring (red arrows) appear in the results of current SOTA methods, along with information loss at occlusion boundaries (yellow arrows).

To address the limitations of existing methods in handling extreme motion scenes in UHD videos, this paper proposes a video interpolation method based on dual alignment and region-adaptive interaction. The framework consists of two stages. In the first stage, we design a dual alignment module (DAM), which simultaneously estimates optical flow and offsets to obtain two types of motion information. These are applied progressively in a stepwise manner to achieve precise alignment of reference pixels between adjacent frames, effectively reducing the difficulty of motion fitting. Unlike mainstream flow-based methods [9,10,11], the DAM aligns the input frames primarily along the temporal dimension, alleviating spatiotemporal asymmetry caused by nonlinear motion speed variations and inconsistent inter-frame motion directions.

In the second stage, inspired by the frequency-aware strategy, we propose a region-adaptive interaction module (RAIM) to jointly model global semantics and local motion features. As shown in Figure 2, we apply the discrete Fourier transform (DFT) to 4K video frames and compute the RGB frequency-domain difference map by subtracting the spectra of consecutive frames. It can be observed that large-motion regions correspond to local high-frequency components, while the overall semantic structure is mainly represented by low-frequency components. This frequency-domain distinction reveals the distribution pattern of motion and structural features in videos. Based on this observation, the RAIM first employs a window partitioning strategy, dividing adjacent frames into regions of different scales to separate local motion from global structural information. Then, a cross-frame interaction strategy establishes adaptive correlations among motion regions at different scales. Next, a region-separated attention mechanism (RSAM) is introduced to learn from these correlated regions, simultaneously capturing global context and local high-frequency motion information. Unlike existing methods [9,12,13,14,15] that rely solely on spatial feature stacking or a single attention mechanism, the RAIM effectively restores the spatiotemporal symmetry disrupted in extreme motion scenarios through frequency-inspired, region-adaptive interactions, significantly enhancing the model’s holistic perception of motion–structure relationships. Additionally, to compensate for errors in pixel localization from the DAM, we design a motion compensation module (MCM), which explicitly computes the displacement vector field between pixels based on a global correlation matrix, providing additional motion information to further improve interpolation accuracy.

Our contributions can be summarized as follows:

We propose a novel video frame interpolation method for extreme motion scenes in UHD videos.
We design a DAM to reduce pixel localization difficulty through dual motion information, prevent motion error accumulation, and guide spatiotemporal information aggregation across frames.
We propose an RAIM that learns different timestamp information through adaptive interaction between motion regions of varying scales in neighboring frames, effectively enhancing the perception of the overall structure and texture details.
Through experiments, our method achieves superior interpolation performance in high-resolution, extreme motion scenes and exhibits stronger robustness and stability in cross-scene generalization.

2. Related Work

Existing video frame interpolation methods are primarily categorized into flow-based [16,17,18] and kernel-based methods [12,13].

2.1. Flow-Based Methods

Flow-based methods estimate bidirectional optical flow between input frames to fit the motion trajectories of pixels in interframes. A representative method is SuperSloMo [16], which employs a predefined linear motion assumption to resample bidirectional optical flow between input frames, thereby obtaining optical flow at intermediate time points. To better model complex real-world motion, several studies improve inter-frame motion trajectories. For example, QVI [17] performs interpolation using quadratic trajectories constructed from acceleration information between video frames. Other works introduce coarse-to-fine optical flow estimation frameworks to adaptively handle motions of varying magnitudes [18,19,20]. Jin et al. [19] propose EBME, which employs enhanced bidirectional motion estimation to achieve a flexible and compact network design with only 3.9 M parameters; building on this, they further propose UPR-Net [20], which unifies motion estimation and frame synthesis within a single pyramid recurrent network, realizing a coarse-to-fine frame synthesis process. To improve computational efficiency, Huang et al. [11] propose RIFE, which uses IFNet, a network based on convolution and deconvolution operations, to estimate intermediate flow in an end-to-end and real-time manner. EMA-VFI [9] employs a hybrid CNN-Transformer architecture and designs an inter-frame attention mechanism to simultaneously capture motion and appearance information, providing an important reference for subsequent work [21]. For large motion scenes, Liu et al. [10] proposed the SGM-VFI, which detects defective regions in intermediate optical flow through sparse global matching and demonstrates excellent performance on large motion benchmark datasets.

Although these methods achieve significant progress, in extreme motion scenes, the large motion magnitude causes single optical flow estimation to easily accumulate errors, making it difficult to accurately model pixel trajectories and, thereby, adversely affecting the frame rendering quality.

2.2. Kernel-Based Methods

The pioneering work in this category is AdaConv, an adaptive convolution kernel proposed by Niklaus et al. [22] in 2017. Its core concept involves performing convolution operations on local pixels of two input frames through an end-to-end DNN model, without requiring additional optical flow input. To reduce model parameters, Niklaus et al. [12] further decompose the 2D convolution kernels into pairs of 1D kernels, resulting in SepConv. However, both AdaConv and SepConv rely on fixed-shape kernels, which limits their ability to handle motions that exceed the predefined kernel size. To address this challenge, Lee et al. [13] propose AdaCoF, a general and flexible transformation module based on Deformable Convolution (DConv) [23], which adaptively estimates multiple offsets for each target pixel. Inspired by DConv, Cheng et al. introduce DSepConv [24] and EDSC [25]. To further expand the sampling range on the input frames, Shi et al. [26] propose GDConvNet based on generalized deformable convolution. To address the issue of large parameter counts in AdaCoF, Ding et al. [27] proposed the compression-driven optimization scheme CDFI. In addition, some works introduce a Vision Transformer (ViT) [28] into video frame interpolation [29,30], effectively addressing the issue of limited receptive fields in CNNs.

Overall, while kernel-based methods can prevent the accumulation of optical flow errors, they still exhibit significant limitations in capturing the global dependencies of inter-frame pixels in extreme motion scenes.

3. UHD4K120FPS-N Dataset

To address the task of frame interpolation for extreme motion scenes in UHD video, we construct a 4K@120fps dataset named UHD4K120FPS-N. Specifically, we capture a total of 300 HDR raw video clips using a SONY Alpha1 camera (Tokyo, Japan), with each clip lasting 5 to 20 s, covering diverse real-world extreme motion scenarios. To ensure data quality, low-quality videos exhibiting blur or defocus are first manually filtered out. Then, scene segmentation is performed on the raw videos using shot detection and background similarity removal methods [31], splitting them into coherent frame sequences. Each sequence is sampled in groups of nine consecutive frames (frame1.png to frame9.png), with each group treated as an independent scene.

Due to GPU memory limitations, the raw 4K videos are temporarily downsampled to a resolution of 1920 × 1080 for optical flow computation using the RAFT-large algorithm [32]. This preprocessing step is solely for motion analysis and scene filtering, without affecting the fidelity of the original 4K data. After calculating the optical flow magnitude between the first and last frames (equivalent to the average optical flow magnitude between input frames in 30 fps), sequences with an average flow magnitude below 3 pixels are removed, resulting in 145 valid samples. Fifteen sequences are randomly selected as test samples based on the percentile distribution, with the remaining sequences being used for training. Training samples are randomly cropped multiple times at a size of 128 × 128 in the same spatial position across 9 consecutive frames. Each sample is required to maintain an average optical flow magnitude of no less than 3 pixels between neighboring frames, with outliers removed. This yielded 50,400 training samples, denoted as UHD-N-Train, with the average optical flow magnitude between neighboring frames ranging from 3 to 164.9 pixels. The test samples, denoted as UHD-N-Test, exhibit an average optical flow magnitude between neighboring frames ranging from 8.5 to 244.6 pixels.

Figure 3 shows some samples from the UHD4K120FPS-N dataset and their corresponding optical flow magnitudes. Table 1 compares the optical flow magnitude and high-frequency feature ratio between the benchmark datasets and our dataset. The high-frequency feature ratio measures the frequency-domain information of regions with large pixel displacements. It is quantified by calculating the proportion of energy in the high-frequency regions of the image power spectrum, where regions with a distance from the spectrum center exceeding 0.2 are defined as high-frequency regions. Higher values indicate that the image contains more details and textures. The results show that the UHD4K120FPS-N dataset exhibits higher optical flow magnitudes and high-frequency feature ratios, demonstrating its clear advantage for training models on extreme motion scenes.

4. Methods

This task aims to synthesize the intermediate frame

I_{t}

(t = 0.5)

from two given RGB frames

I_{0}, I_{1} \in R^{H \times W \times 3}

. As illustrated in Figure 4, the overall framework consists of three main components: low-level feature extraction (LFE), inter-frame feature interaction perception (IFIP), and predicted frame reconstruction (PFR). The proposed method adopts a hybrid structure of a CNN and Transformer, consisting of three convolutional layers and two Transformer layers, and the entire process is divided into four stages. In the first stage, the CNN extracts low-level features

F_{l}^{i}

. In the second stage,

F_{l}^{i}

is simultaneously fed into the optical flow estimation network (OFENet) and the DAM. OFENet performs bidirectional optical flow estimation to obtain

f_{0 \to 1}^{i}

and

f_{1 \to 0}^{i}

, while the DAM utilizes flow and offset information to achieve precise alignment of long-range reference pixels across frames. In the third stage, the aligned features

F_{i}^{a l i g n}

(i = 1, 2)

are passed into the interaction perception Transformer block (IPTBlock), which is built on the Transformer architecture and integrates the RAIM and the MCM to enhance cross-frame feature fusion and supplement additional motion information. In the fourth stage,

F_{s t}^{i}, F_{m}^{i}

, and

F_{l}^{i}

are fused and input into PFR, where the intermediate frame

I_{0.5}

is generated through motion estimation and a refinement network.

4.1. Dual Alignment Module

To mitigate the spatiotemporal asymmetry caused by large-magnitude nonlinear motion, th eDAM adopts a coarse-to-fine dual alignment strategy, and its overall structure is illustrated in Figure 5a. In the first stage, the optical flow alignment module (OFAM) warps cross-frame features using bidirectional optical flow, providing a temporally consistent coarse alignment that partially alleviates frame misalignment caused by long-range nonlinear displacements. In the second stage, the offset alignment module (OAM) estimates spatial offsets based on local pixel cosine similarity, further refining pixel positions and correcting local inaccuracies introduced by optical-flow warping to achieve fine-grained alignment. This hierarchical design allows the OFAM to handle coarse temporal alignment while the OAM mitigates residual spatial deviations. The dual-alignment mechanism ultimately provides a stable feature alignment foundation for subsequent attention computations, effectively alleviating spatiotemporal asymmetry caused by nonlinear motion and enhancing structural consistency in extreme motion scenes.

4.1.1. Optical Flow Alignment Module

For clarity, this section takes the alignment from input frame

I_{0}

to

I_{1}

as an example. Given two input features

F_{0}^{l}, F_{1}^{l} \in R^{C \times H \times W}

, we first construct an empty sampling grid

g (i, j)

according to the tensor shape of

F_{0}^{l}

. Adding

g (i, j)

to the optical flow field

f_{0 \to 1}

yields the target sampling coordinates. For a pixel

p (i, j)

on the feature map, its corresponding position in

F_{1}^{l}

is

p^{'} (i + Δ x^{i, j}, j + Δ y^{i, j})

, where

(Δ x, Δ y)

denotes the optical flow magnitude at position

(i, j)

in

F_{0}^{l}

. The corresponding sampling position in

F_{1}^{l}

for any pixel on the entire feature map is:

\tilde{g} (i, j) = g (i, j) + f_{0 \to 1} (i, j) = [\begin{matrix} i + Δ x^{i, j} \\ j + Δ y^{i, j} \end{matrix}]

(1)

After obtaining the sampling locations,

\tilde{g} (i, j)

is normalized. Finally, the aligned feature

F_{f l o w 0 \to 1}^{a l i g n}

is obtained from

F_{1}^{l}

through the sampling mapping function

S (\cdot)

, thereby achieving reference pixel alignment from

I_{0}

to

I_{1}

. The computation is as follows:

F_{f l o w 0 \to 1}^{a l i g n} = S (I_{1}, \tilde{g} (i, j))

(2)

where

S (\cdot)

denotes a backward-warping sampling operator that extracts feature values from the source feature map according to flow-guided coordinates using nearest-neighbor interpolation.

4.1.2. Offset Alignment Module

To further align pixels in boundary regions and reduce blurring in inconsistent areas within categories, the OAM achieves fine intra-frame pixel alignment by calculating similarity between pixel neighborhoods in the spatial dimension, as shown in Figure 5a. First, the cosine similarity is calculated between pixel

p (i, j)

at a certain position in

F_{f l o w 0 \to 1}^{a l i g n}

and the neighboring pixels within the

k \times k

convolution window. Excluding the central pixel, there are

k^{2} - 1

pixels in total. The calculation process is as follows:

S_{i, j}^{(p, q)} = \frac{\sum_{c = 1}^{C} F_{c, i, j}^{a l i g n} F_{c, i + p, j + q}^{a l i g n}}{\sqrt{\sum_{c = 1}^{C} {(F_{c, i, j}^{a l i g n})}^{2}} \cdot \sqrt{\sum_{c = 1}^{C} {(F_{c, i + p, j + q}^{a l i g n})}^{2}}}

(3)

where

S \in R^{k^{2} - 1}

, channel index is

c = 1, \dots, C

, the neighborhood offset

(p, q) \in \{- \frac{k - 1}{2}, \dots, \frac{k - 1}{2}\}

and

(p, q) \neq (0, 0)

. We set

k = 3

, indicating the calculation of similarity between this pixel and the 8 pixels in its neighborhood.

To avoid interference from smoothed pixels in low-frequency regions and ensure that the offset estimation module focuses on local high-frequency motion areas, we generate a high-frequency map from

F_{f l o w 0 \to 1}^{a l i g n}

using the high-frequency mapping function

F_{h f} (\cdot)

, which is similar to an occlusion mask:

H = F_{h f} (F_{f l o w 0 \to 1}^{a l i g n})

(4)

where

S \in R^{1 \times H \times W}

. During secondary sampling, the high-frequency map H guides offset sampling toward pixels with high intra-category similarity by focusing on regions of significant change within the image.

Through the convolutional layer, the tensor V controlling the sampling direction of the offset is obtained. The tensor M controlling the sampling size of the offset is derived via the Sigmoid function. Finally, multiplying V and M yields the predicted offset

Δ O \in R^{2 \times H \times W}

. Subsequently, any pixel point in

F_{f l o w 0 \to 1}^{a l i g n}

is sampled at its new coordinate position to obtain the final aligned output feature

F_{i}^{a l i g n} \in R^{C \times H \times W}

. The overall computation process is as follows:

V = {Conv}_{3 \times 3} (Concat (F_{f l o w 0 \to 1}^{a l i g n}, S))

(5)

M = Sigmoid ({Conv}_{3 \times 3} (Concat (F_{f l o w 0 \to 1}^{a l i g n}, S, H)))

(6)

Δ O = M \cdot V

(7)

F_{i}^{a l i g n} = S (F_{f l o w 0 \to 1}^{a l i g n}, Δ O)

(8)

The OAM achieves more reliable processing of localized high-frequency motion regions through dynamic high-frequency response and low-frequency suppression strategies. It drives the model to locate rapidly changing areas and samples pixels with high intra-category similarity within the frame to achieve secondary fine alignment.

4.2. Region-Adaptive Interaction Module

The overall framework of the RAIM is illustrated in Figure 5b. It first employs cross-frame interaction perception to adaptively associate different regions within the neighborhood, thereby enhancing feature expression across distinct areas. Subsequently, it incorporates both global attention (G-Att) and local attention (L-Att) in the RSAM to capture global dependencies and local high-frequency motion information, respectively. Finally, based on the characteristics of different regions, global and local features are synergistically fused to achieve spatiotemporal information aggregation. The specific implementation process comprises the following four steps.

4.2.1. Assigning the Attention Head

We reshape

F_{0}, F_{1} \in R^{C \times H \times W}

along the channel dimension into a sequence

F \in R^{C \times N}

of length

N = H \times W

as input. The RAIM divides the attention heads into two groups based on different regional characteristics, with the following computation process:

h_{L A} + h_{G A} = h

(9)

d_{L A} = h_{L A} \cdot d_{h e a d}, d_{G A} = h_{G A} \cdot d_{h e a d}

(10)

where

h_{L A}

and

h_{G A}

represent the numbers of local attention heads and global attention heads, respectively. h represents the total number of heads in this layer, setting

h_{L A} = h_{G A} = \frac{h}{2}

.

d_{h e a d} = \frac{C}{h}

is the dimension of each attention head,

d_{L A}

and

d_{G A}

represent the total dimension of the attention heads in L-Att and G-Att, respectively.

4.2.2. L-Att Branch

L-Att employs non-overlapping local window attention to focus on capturing local high-frequency motion information. Within the L-Att branch, cross-frame adaptive interaction perception is achieved through a window partitioning strategy. Specifically, the window is

w \in R^{r \times r}

, and the number of windows is

N_{w} = \frac{H \times W}{r^{2}}

, so the window partitioning formula is as follows:

{F_{0, g}}_{g = 1}^{N_{w}} = W_{r} (F_{0}), {F_{1, g}}_{g = 1}^{N_{w}} = W_{r} (F_{1})

(11)

where

F_{0, g}, F_{1, g} \in R^{r^{2} \times C}

denotes the token sequence in the g-th window of the two frames.

W_{r} (\cdot)

is the window partitioning function. Subsequently, using

F_{0}

as the Query and the local features of

F_{1}

as the Key and Value, their projection matrices are computed via linear mapping. The formula is as follows:

Q_{g}^{(m)} = F_{0, g} W_{Q}^{(m)}, K_{g}^{(m)} = F_{1, g} W_{K}^{(m)}, V_{g}^{(m)} = F_{1, g} W_{V}^{(m)}

(12)

where

W_{Q}^{(m)}, W_{K}^{(m)}, W_{V}^{(m)} \in R^{C \times d^{'}}

are the learnable weight matrices of the m-th attention head within the window, where

m \in {1, \dots, h_{H F}}

, and

d^{'} = \frac{d_{L A}}{h_{L A}}

denotes the dimensionality of a single attention head in the L-Att branch. The above process enables adaptive cross-frame interaction perception from

I_{0}

to

I_{1}

at the local region level, allowing

I_{0}

to adaptively associate with local regions of

I_{1}

. Subsequently, attention is computed within each window to achieve inter-frame local high-frequency feature aggregation as follows:

A t t_{g}^{(m)} = Softmax (\frac{Q_{g}^{(m)} K_{g}^{{(m)}^{⊤}}}{\sqrt{d^{'}}}) V_{g}^{(m)}

(13)

Finally, the attention weights from all heads within each window are concatenated to compute the multi-head attention for that window.

W_{0}^{L A} \in R^{d_{L A}^{2}}

represents the learnable weight matrix, and the output

A t t_{L A} \in R^{N \times d_{L A}}

for the L-Att branch is obtained through linear mapping, as shown in the following formula:

A t t_{g}^{L A} = Concat (A t t_{g}^{(1)}, A t t_{g}^{(2)}, \dots, A t t_{g}^{(h_{L A})}) W_{o}^{L A}

(14)

A t t_{L A} = Reshape (A t t_{1}^{L A}, A t t_{2}^{L A}, \dots, A t t_{N_{w}}^{L A})

(15)

4.2.3. G-Att Branch

L-Att is fundamentally implemented based on window attention, perceiving only local high-frequency regions. To extract global semantic information, the cross-frame adaptive interaction perception method in the G-Att branch is achieved through average pooling. Specifically,

F_{1}

is downsampled to

{\tilde{F}}_{1} \in R^{N_{s} \times C}

to reduce the scale and expand the receptive field range, with the downsampling factor s set to 2, yielding

N_{s} = \frac{N}{s^{2}}

. To establish cross-frame attention, the Query is derived from

F_{0}

, while the Key and Value are obtained from the downsampled feature

{\tilde{F}}_{1}

. All three components are generated through linear projection and are computed as follows:

Q = F_{0} W_{Q}^{G A}, K = {\tilde{F}}_{1} W_{K}^{G A}, V = {\tilde{F}}_{1} W_{V}^{G A}

(16)

where

Q \in R^{N \times h_{G A} \times d^{″}}

,

K, V \in R^{N_{s} \times h_{G A} \times d^{″}}

.

W_{Q}^{G A} \in R^{C \times d_{G A}}

is the learnable weight matrix for Q, while

W_{K}^{G A}, W_{V}^{G A} \in R^{C \times d_{G A}}

are the learnable weight matrices for K and V, respectively.

d^{″} = \frac{d_{G A}}{h_{G A}}

is the dimension of a single attention head in the G-Att branch.

The above process achieves cross-frame adaptive interaction perception across the global regions

I_{0}

to

I_{1}

. For

I_{0}

, it has now been adaptively associated with the global context of the lower-scale

I_{1}

. Next, the attention for each attention head m is computed using the following formula:

A t t^{(m)} = Softmax (\frac{Q^{(m)} K^{{(m)}^{⊤}}}{\sqrt{d^{″}}}) V^{(m)}

(17)

where

Q^{(m)} \in R^{N \times d^{″}}

,

K^{(m)}, V^{(m)} \in R^{N_{s} \times d^{″}}

,

A t t^{(m)} \in R^{N \times d^{″}}

. Based on the above formula, G-Att can compute attention between Q, K, and V at different scales. Since the global region changes slowly overall, this approach not only captures cross-scale dependencies but also comprehensively focuses on global semantic information. Finally, concatenate all attention heads and apply a linear mapping to obtain the G-Att output

A t t_{G A} \in R^{N \times d_{G A}}

, and

W_{0}^{G A} \in R^{d_{G A}^{2}}

is a learnable weight matrix, as shown in the following formula:

A t t_{G A} = Concat (A t t^{1}, A t t^{2}, \dots, A t t^{h_{G A}}) W_{o}^{G A}

(18)

4.3. Motion Compensation Module

According to [11], for optical flow computation in UHD video sequences, a coarse flow field can first be estimated on low-resolution images and then upsampled to the original resolution. This approach reduces computational complexity while providing an approximate flow field, with the low-resolution flow serving as a motion prior to supplementing the motion information in high-resolution sequences.

To compensate for potential alignment errors from the DAM, the MCM leverages global correlations to provide additional motion information, as illustrated in Figure 5c. This module explicitly models the inter-frame motion field on low-resolution feature maps by computing the global correlation matrix, providing supplementary motion information for the subsequent predicted frame reconstruction stage and, thereby, mitigating alignment errors that could result from depending entirely on the DAM. Specifically, given two feature tensors obtained from the RAIM, they are firstly reshaped into token sequences

X_{0}, X_{1} \in R^{C \times N}

,

N = H \times W

. Then, their inner product is computed to derive the global correlation matrix. Finally, the probability matching distribution is obtained via the Softmax function as follows:

C o r r = \frac{1}{\sqrt{C}} \cdot (X_{0} X_{1}^{⊤})

(19)

P = Softmax (C o r r)

(20)

where each element of

C o r r \in R^{N \times N}

represents the correlation between the i-th position in

X_{0}

and the j-th position in

X_{1}

,

P \in R^{N \times N}

is the probability distribution of the match. The initial coordinate grid

G \in R^{N \times 2}

is embedded into coordinate grid

\tilde{G} \in R^{N \times D}

through the linear mapping, and D is the motion embedding dimension. Subsequently, P and the embedded positions of the target image are weighted to obtain the coordinate distribution M for each pixel point match. Finally, the difference between the new position coordinates of each pixel and its initial embedded coordinates is calculated to derive the motion feature

F_{m} \in R^{2 \times H \times W}

.

4.4. Optical Flow Estimation Network

To enable the DAM to achieve efficient and accurate long-range inter-frame pixel alignment, we design a lightweight CNN-based optical flow estimation subnetwork to provide motion information to the DAM. OFENet adopts a classical stacked convolutional architecture, where layer-by-layer feature extraction and nonlinear transformations map the feature maps

F_{l, 0}^{f}, F_{l, 1}^{f}

of two input frames into a pixel-wise optical flow field

f_{0 \to 1}, f_{1 \to 0} \in R^{2 \times H \times W}

. The network consists of five convolutional layers: the first four layers have 64 output channels with a stride of 1 and padding of 1, while the final layer outputs 2 channels. The overall architecture is illustrated in Figure 5d.

5. Experiment

5.1. Datasets and Evaluation Metrics

5.1.1. Datasets

We train our models on the UHD-N-Train and Vimeo90K [31] training sets. UHD-N-Train contains 50,400 triplets, while Vimeo90K [31] consists of 51,312 triplets with a spatial resolution of

448 \times 256

. During training, the samples are cropped to

256 \times 256

. Both datasets are augmented with random flipping, temporal reversal, and rotation.

To fully evaluate the performance of our method in UHD extreme motion scenes and other motion scenes, we conducted evaluations on our self-built dataset and multiple benchmark datasets, as detailed below:

UHD4K120FPS-N. This dataset is a 4K dataset covering extreme motion scenes, containing triplets from 15 different scenes, where $I_{0}$ and $I_{1}$ are the input frames, and $I_{0.5}$ is the ground truth.
Vimeo90K [31]. This contains 3782 triplets with a resolution of $448 \times 256$ .
UCF101 [33]. This contains 379 video sequences with a resolution of 256 × 256.
Middlebury [34]. We evaluate on the Middlebury OTHER, which contains 12 sequences from different scenes with a resolution of approximately $640 \times 480$ .
X4K1000FPS [6]. This is commonly used to evaluate 4K video frame interpolation tasks. XTest contains 15 consecutive 33-frame 4K video sequences. Following the settings in [10], we select the 0th and 32nd frames from each sequence as input and evaluate the quality of the generated 16th frame. This new test set is denoted as XTest-L.
Xiph [35]. This dataset contains 8 4K scene sequences, each with 100 consecutive frames. Following the same setup as XTest-L in [6], 192 test instances are obtained and denoted as Xiph-L.

5.1.2. Evaluation Metrics

This paper employs three evaluation metrics: the peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [36], and average interpolation error (IE). The PSNR and SSIM are used for evaluation on Vimeo90K [31], UCF101 [33], X4K1000FPS [6], and Xiph [35]. To ensure fair comparison and reproducibility, the PSNR and SSIM are first computed for each sequence and then averaged across all sequences. The IE is used for evaluation on Middlebury [34]. Additionally, we compare the model parameters and runtime, where runtime is measured as the average time over 100 iterations for all models on the same hardware, NVIDIA GeForce RTX 2080 Ti GPU (manufactured by NVIDIA Corporation, Santa Clara, CA, USA).The test samples are random data with a resolution of

640 \times 480

.

5.2. Implementation Details

5.2.1. Training Details

The batch size is set to 32, and training is conducted for 300 epochs. The optimizer is AdamW [37] with

β_{1} = 0.9, β_{2} = 0.999

, and a weight decay of

1 \times 10^{- 4}

. Before training, the learning rate is warmed up for 2000 steps to

2 \times 10^{- 4}

and then gradually decayed to

2 \times 10^{- 5}

using cosine annealing. Following [9], the loss function consists of warp loss and Laplacian loss. Except for the runtime evaluation, all experiments are performed on an NVIDIA A100 Tensor Core GPU.

5.2.2. Network Architecture

Our model is based on a five-layer architecture with an initial channel width of 32 and a downsampling rate of 2. The first three layers are convolutional layers with a depth of 2, while the last two layers are Transformer layers with a depth of 4. The channels are 32, 64, 128, 256, and 512, respectively. The RAIM has a total of eight attention heads, and the local branch window size for RSAM is set to 2.

5.3. Quantitative Comparison with Previous Methods

5.3.1. Comparison in UHD Extreme Motion Scenes

Table 2 reports the quantitative evaluation results of our method against 19 SOTA methods on the UHD-N-Test dataset, and all methods are compared for the single-frame (2×) interpolation task. These methods include ToFlow [31], SepConv [12], AdaCoF [13], CDFI [27], BMBC [38], XVFIv [6], CAIN [14], DAIN [15], UPR-base [20], UPR-large [20], EMA-small [9], RIFEm [11], EBME-H [19], EBME [19], FILM [39], M2M-PWC [40], FLAVR [41], SGM-small-1/2-points [10], and SGM-local-branch [10]. The PSNR and SSIM are better when higher. The PSNR measures the pixel-wise consistency between the generated and ground-truth frames, and a high PSNR indicates that the model can effectively learn motion information and reconstruct high-quality frames in extreme motion scenes. The results show that our method achieves the best performance in terms of both the PSNR and SSIM, surpassing the second-best result by 0.05 dB in PSNR and 0.0006 in SSIM. Compared with other mainstream methods designed for handling large motion, our method also demonstrates significant advantages, outperforming FILM [39] by 0.57 dB and SGM-local-branch [10] by 1.78 dB in PSNR. These results clearly validate the effectiveness of the proposed method in UHD extreme motion scenes.

Under the same hardware conditions, our method has a runtime of 42 ms. Compared with other 4K video frame interpolation methods, it is faster than XVFIv [6] and FILM [39], but it is still slower than the real-time interpolation method RIFEm [11]. This indicates that the applicability of our method in practical real-time scenes is limited. To further reduce runtime, the most direct approach is to use higher-performance hardware. Additionally, from a model optimization perspective, the RSAM can be further optimized to reduce its computational complexity, which would help lower the overall runtime and improve real-time performance (see Section 5.5.4 for discussion).

In addition, we perform quantitative comparisons with four mainstream methods on the public UHD datasets XTest-L [6] and Xiph-L [35] to further verify the generalization capability of our method, as shown in Table 3. The results show that the proposed method exhibits excellent overall performance in handling extreme motion, and it is not inferior to advanced methods specifically designed for UHD scenarios. It achieved the best results on the XTest-L dataset and demonstrated superior overall performance on the Xiph-L dataset. Compared with EMA-VFI-small [9] and SGM-VFI-local-branch [10], our method achieved improvements of 1.1 dB and 0.45 dB, respectively, on the XTest-L-4K dataset, further validating its adaptability and generalization capability in extreme motion scenes.

We also conduct multi-frame (8×) interpolation experiments on XTest [6] datasets of different resolutions, following the experimental settings of M2M [40] and EMA-VFI [9], as shown in Table 4. Our method achieves the best performance at both resolutions, indicating smaller pixel-level reconstruction errors and interpolation results closer to the ground truth. Combined with inference time analysis, this demonstrates that our method provides efficient and stable multi-frame interpolation in practical applications.

5.3.2. Comparison on Low-Resolution Benchmarks

Table 5 presents the quantitative comparison of our method with 20 SOTA methods on low-resolution datasets, including Vimeo90K [31], UCF101 [33], and Middlebury [34], to evaluate its performance in easy motion scenes. The results show that on the Vimeo90K dataset, our method achieves the best performance in both evaluation metrics, surpassing the second-best method FLAVR [41] by 0.09 dB in PSNR and 0.0004 in SSIM. On the UCF101 dataset, our method also achieves a strong second-best PSNR.

Similarly to our approach, FGDCN-S [43] and VFIFT-Conv [44] also introduce optical flow, but they rely on only a single type of motion information. On the Vimeo90K dataset, our method significantly outperforms VFIFT-Conv [44] and FGDCN-S [43], with PSNR improvements of 0.37 dB and 0.15 dB, respectively. This clearly demonstrates the effectiveness of our approach in leveraging two types of motion information. In summary, our method also shows outstanding performance in low-resolution scenarios with easy motion.

Table 5. Quantitative comparison with 20 SOTA methods on the Vimeo90K [31], UCF101 [33], and Middlebury [34] datasets. The best and second-best results are indicated with bold and underlined text, respectively. “↑” indicate higher is better, "↓" indicate lower is better.

Method	Vimeo90K		UCF101		Middlebury
Method	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	IE ↓
Other
CAIN [14]	34.65	0.9730	34.91	0.9690	2.28
FLAVR [41]	36.30	0.9750	33.33	0.9710	-
Kernel-base
SepConv [12]	33.79	0.9702	34.78	0.9669	2.27
AdaCoF [13]	34.47	0.9730	34.90	0.9680	2.24
CDFI [27]	35.17	0.9770	35.21	0.9690	1.98
EDSC- $L_{C}$ [25]	34.84	0.9750	35.13	0.9680	2.02
Flow-base
SoftSplat [45]	36.10	0.9700	35.39	0.9520	1.81
ToFlow [31]	33.73	0.9682	34.58	0.9667	2.15
BMBC [38]	35.01	0.9764	35.15	0.9689	2.04
XVFIv [6]	35.07	0.9681	35.18	0.9519	-
DAIN [15]	34.71	0.9756	34.99	0.9683	2.04
ABME [46]	36.18	0.9805	35.38	0.9698	2.01
IFRNet [42]	35.80	0.9794	35.29	0.9693	1.95
UPR-base [20]	36.03	0.9801	35.41	0.9698	-
EMA-small [9]	36.07	0.9797	35.34	0.9696	1.94
RIFE [11]	35.61	0.9779	35.28	0.9690	1.96
EBME-H* [19]	36.19	0.9807	35.41	0.9697	-
FGDCN-S [43]	36.24	0.9806	35.42	0.9698	1.94
VFIFT-Conv [44]	36.02	0.9798	35.65	0.9793	-
M2M-PWC [40]	35.47	0.9778	35.28	0.9694	2.09
Ours	36.39	0.9811	35.46	0.9700	2.01

5.4. Qualitative Comparison with Previous Methods

5.4.1. Visual Comparison in UHD Extreme Motion Scenes

On the UHD-N-Test dataset, we visually compare the interpolation performance of our method with eight SOTA approaches, all trained under the same settings on UHD-N-Train. As shown in Figure 6, our method produces more reasonable results in extreme motion scenes. In particular, in Example 5, our method reconstructs the volleyball contours clearly compared with other methods. Example 4 shows that, except for our method, EBME [19], and SGM-local-branch [10], the interpolation results of other methods exhibit blur or artifacts, indicating their inability to handle local texture regions effectively. In Example 3 (occluded regions), all methods exhibit some degree of blur along occlusion boundaries, caused by the lack of reference information, which makes accurate pixel displacement estimation difficult. For our method, although the boundaries remain slightly blurred, the RAIM can leverage information from neighboring visible pixels to infer structure and texture more reasonably. This slight blurring is not a flaw of the method but an unavoidable challenge in extreme motion scenes, and our design has substantially mitigated this issue. Moreover, for nonlinear extreme motion (rotational motion in Example 1), all methods fail to generate clear water flow, whereas our interpolation results appear visually more reasonable. In summary, the visualization results of UHD-N-Test fully demonstrate the effectiveness of our method in handling extreme motion scenes while also exhibiting excellent interpolation performance in other complex scenes.

To further validate the performance of the proposed method on UHD datasets, we conduct qualitative comparisons on the XTest-L-4K [6] dataset. We focus on evaluating extreme motion scenes and do not adopt the experimental setup of XVFI [6] on the XTest [6] dataset; instead, we select inputs with the largest temporal intervals for evaluation. As shown in Figure 7, our method achieves interpolation results in subjective visual quality that are comparable to existing SOTA methods, effectively confirming its generalization ability in UHD video frame interpolation tasks.

5.4.2. Visual Comparison on Low-Resolution Benchmarks

Figure 8 presents a visual comparison of our method against six mainstream approaches on the Vimeo90K [31] dataset. This dataset mainly contains low-resolution scenes with easy motion, so we focus on the consistency between the interpolated results and the ground truth (GT). For example, in Example 2, the GT shows the overall contour of the athlete’s legs, but the details are blurry, and there are significant occluded regions. Both our method and EMA-VFI [9] are able to recover these severely occluded areas, producing results with more reasonable overall structures. Across all examples, other mainstream methods perform less favorably. For instance, EBME [19] generates results with noticeable noise, while M2M [40] produces significant artifacts. In contrast, our method demonstrates clear advantages in interpolation quality.

5.5. Ablation Study

5.5.1. Ablation Study of Dual Alignment Module

To validate the effectiveness of the DAM, we conduct ablation experiments on the optical flow alignment module (OFAM) and the offset alignment module (OAM) separately on the UHD-N-Test and Vimeo90K [31] datasets, as shown in Table 6. First, we replace OFENet with several mainstream pre-trained optical flow estimation methods, including SPyNet [47], RAFT [32], and GMFlow [48]. The results show that OFENet achieves the best performance, demonstrating its adaptability and effectiveness for video frame interpolation. Additionally, to analyze the impact of the OAM on overall performance, we remove it and perform alignment using only the OFAM. The results indicate that removing the OAM decreases performance by 0.29 dB, but it still outperforms some pre-trained optical flow methods. These findings confirm the strong alignment capability of the DAM in extreme motion scenes.

5.5.2. Ablation Study of Region-Adaptive Interaction Module

To accelerate the research process, all methods are quantitatively evaluated after 100 training epochs. To validate the effectiveness of the proposed cross-frame adaptive interaction strategy, we replace the RAIM with the classical cross-attention mechanism. To validate the effectiveness of the proposed cross-frame adaptive interaction perception strategy, the RAIM was replaced with a classical cross-attention mechanism. Unlike the RAIM, this attention mechanism does not associate regions at different scales through cross-frame adaptive interaction perception nor aggregate global and local features separately, instead calculating attention only between neighboring frames. As shown in Table 7, on the UHD-N-Test dataset, the model with the RAIM achieves a PSNR 0.29 dB higher than that with cross-attention. This demonstrates the RAIM’s superior performance across diverse motion scenes and the effectiveness of the cross-frame adaptive interaction strategy.

Furthermore, to further analyze the performance of the RAIM and the effectiveness of the RSAM, we compare different window sizes in the cross-frame adaptive interaction, as shown in Table 7. Specifically, the window sizes are set to 3, 2, and 1. The results indicate that a window size of 2 (our model) achieves the best performance. When the window size is 3, since the feature maps in the last two layers of the model (

32 \times 32

and

16 \times 16

) are already small, a larger window increases the computational complexity of attention (as shown in Table 8). When the window size is 1, the model performance drops significantly. In this case, the RSAM degenerates to using only global attention (G-Att), which is equivalent to not introducing the cross-frame adaptive interaction strategy in the RAIM, preventing the differentiation of distinct motion regions and, thereby, limiting the modeling and understanding of extreme motion.

5.5.3. Ablation Study of Motion Compensation Module

We validate the effectiveness of the MCM by removing it from the model. As shown in Table 7, compared with the complete model, the model without the MCM exhibits a performance drop of 0.11 dB. This result indicates that the motion information provided by the MCM enhances model performance, indirectly verifying its ability to effectively suppress alignment errors in the DAM.

5.5.4. Computational Complexity Analysis of the Region-Separated Attention Mechanism

Table 8 presents the performance analysis of the RSAM, G-Att, and L-Att. Specifically, in the last two layers of the model, the feature map sizes are 32 × 32 and 16 × 16, with a depth of 4 per layer. For each layer, we report the total number of parameters and computational complexity, along with the percentage of computation contributed by G-Att and L-Att. The results in the table reveal that, theoretically, the computational complexity of L-Att scales quadratically with the window size, whereas the complexity of G-Att is inversely proportional to the square of the downsampling rate. In practical applications, L-Att accounts for a larger proportion of computation in the last layer, while G-Att dominates in the penultimate layer.

5.5.5. Visual Analysis of the Region-Adaptive Interaction Module

To fully validate the cross-frame adaptive interaction perception capability of the RAIM and its ability to model both global context and local high-frequency motion information, Figure 9 presents the attention weight visualization heatmap results for the final layer of the RAIM within the Transformer. Red and blue regions indicate areas with higher and lower attention weights, respectively. The results show that in the RAIM heatmap, the attention mechanisms focus on areas that highly align with the locally high-frequency motion regions in the original image. For example, the basketball movement area in the first row and the rapid arm movement area in the second row both exhibit high attention weights. In the L-Att heatmap, red indicates high-frequency feature regions, with results closely matching the high-weight areas in the RAIM heatmap, demonstrating that L-Att focuses on learning local high-frequency motion information. In contrast, in the G-Att heatmap, blue indicates low-frequency feature regions, revealing that G-Att prioritizes learning global contextual information. Overall, these visualizations intuitively demonstrate the effectiveness of the proposed RAIM and its superior representation capabilities for distinct spatial features.

6. Conclusions

This paper addresses the challenge of accurately fitting extreme motion in UHD scenes by proposing a high-performance video interpolation method based on dual alignment and region-adaptive interaction. First, we design the dual alignment module, which achieves long-range pixel alignment in stages using both optical flow and offset motion information, effectively reducing the accumulation of motion fitting errors and alleviating spatiotemporal asymmetry between frames. Then, we introduce the region-adaptive interaction module, which employs a cross-frame adaptive interaction strategy to dynamically establish correlations and fuse information across different motion regions in neighboring frames, effectively mitigating blurring at occlusion boundaries. Furthermore, to compensate for alignment errors, we introduce the motion compensation module, which provides additional motion information through explicit motion estimation. Extensive quantitative and qualitative experiments demonstrate that our method exhibits outstanding interpolation performance, strong robustness, and effective restoration of spatiotemporal symmetry in extreme motion scenes on both our self-built UHD dataset and multiple benchmark datasets, validating its applicability and reliability.

Author Contributions

Conceptualization, X.N.; methodology, X.N. and Y.D.; software, X.N.; validation, X.N., J.Q., J.D. and K.Y.; data curation, X.N., J.Q., J.D. and K.Y.; investigation, X.N., J.Q., J.D. and K.Y.; resources, Y.D.; writing—original draft preparation, X.N.; writing—review and editing, J.Q., J.D., K.Y. and Y.D.; visualization, X.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Nos. 61303093 and 61402278) and the Shanghai Natural Science Foundation (No. 19ZR1419100).

Data Availability Statement

The original data presented in the study are openly available in a GitHub repository at https://github.com/nxkobe/ExtremeMotion-VFI, accessed on 28 November 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jiang, W.; Li, J.; Zhang, K.; Zhang, L. Ecvc: Exploiting non-local correlations in multiple frames for contextual video compression. In Proceedings of the Computer Vision and Pattern Recognition Conference, Jammu, India, 16–18 July 2025; pp. 7331–7341. [Google Scholar]
Liao, G.; Li, Q.; Bao, Z.; Qiu, G.; Liu, K. Spc-gs: Gaussian splatting with semantic-prompt consistency for indoor open-world free-view synthesis from sparse inputs. In Proceedings of the Computer Vision and Pattern Recognition Conference, Jammu, India, 16–18 July 2025; pp. 11264–11274. [Google Scholar]
Kim, E.; Kim, H.; Jin, K.H.; Yoo, J. BF-STVSR: B-Splines and Fourier—Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, Jammu, India, 16–18 July 2025; pp. 28009–28018. [Google Scholar]
Shen, W.; Bao, W.; Zhai, G.; Chen, L.; Min, X.; Gao, Z. Blurry video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 13–19 June 2020; pp. 5114–5123. [Google Scholar]
Yan, Z.; Lei, P.; Wang, T.; Fang, F.; Zhang, J.; Huang, Y.; Song, H. Explicit Depth-Aware Blurry Video Frame Interpolation Guided by Differential Curves. In Proceedings of the Computer Vision and Pattern Recognition Conference, Jammu, India, 16–18 July 2025; pp. 1994–2004. [Google Scholar]
Sim, H.; Oh, J.; Kim, M. Xvfi: Extreme video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14489–14498. [Google Scholar]
Ahn, H.E.; Jeong, J.; Kim, J.W. A fast 4k video frame interpolation using a hybrid task-based convolutional neural network. Symmetry 2019, 11, 619. [Google Scholar] [CrossRef]
Park, J.; Kim, J.; Kim, C.S. Biformer: Learning bilateral motion estimation via bilateral transformer for 4k video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1568–1577. [Google Scholar]
Zhang, G.; Zhu, Y.; Wang, H.; Chen, Y.; Wu, G.; Wang, L. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5682–5692. [Google Scholar]
Liu, C.; Zhang, G.; Zhao, R.; Wang, L. Sparse global matching for video frame interpolation with large motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19125–19134. [Google Scholar]
Huang, Z.; Zhang, T.; Heng, W.; Shi, B.; Zhou, S. Real-time intermediate flow estimation for video frame interpolation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 624–642. [Google Scholar]
Niklaus, S.; Mai, L.; Liu, F. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 261–270. [Google Scholar]
Lee, H.; Kim, T.; Chung, T.y.; Pak, D.; Ban, Y.; Lee, S. Adacof: Adaptive collaboration of flows for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 13–19 June 2020; pp. 5316–5325. [Google Scholar]
Choi, M.; Kim, H.; Han, B.; Xu, N.; Lee, K.M. Channel attention is all you need for video frame interpolation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 10663–10671. [Google Scholar] [CrossRef]
Bao, W.; Lai, W.S.; Ma, C.; Zhang, X.; Gao, Z.; Yang, M.H. Depth-aware video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3703–3712. [Google Scholar]
Jiang, H.; Sun, D.; Jampani, V.; Yang, M.H.; Learned-Miller, E.; Kautz, J. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9000–9008. [Google Scholar]
Xu, X.; Siyao, L.; Sun, W.; Yin, Q.; Yang, M.H. Quadratic video interpolation. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Xiao, J.; Xu, K.; Hu, M.; Liao, L.; Wang, Z.; Lin, C.W.; Wang, M.; Satoh, S. Progressive motion boosting for video frame interpolation. IEEE Trans. Multimed. 2022, 25, 8076–8090. [Google Scholar] [CrossRef]
Jin, X.; Wu, L.; Shen, G.; Chen, Y.; Chen, J.; Koo, J.; Hahm, C.h. Enhanced bi-directional motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 5049–5057. [Google Scholar]
Jin, X.; Wu, L.; Chen, J.; Chen, Y.; Koo, J.; Hahm, C.h. A unified pyramid recurrent network for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1578–1587. [Google Scholar]
Shen, T.; Li, D.; Gao, Z.; Tian, L.; Barsoum, E. Ladder: An efficient framework for video frame interpolation. arXiv 2024, arXiv:2404.11108. [Google Scholar] [CrossRef]
Niklaus, S.; Mai, L.; Liu, F. Video frame interpolation via adaptive convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 670–679. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Cheng, X.; Chen, Z. Video frame interpolation via deformable separable convolution. Proc. AAAI Conf. Artif. Intell. 2020, 34, 10607–10614. [Google Scholar] [CrossRef]
Cheng, X.; Chen, Z. Multiple video frame interpolation via enhanced deformable separable convolution. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7029–7045. [Google Scholar] [CrossRef] [PubMed]
Shi, Z.; Liu, X.; Shi, K.; Dai, L.; Chen, J. Video frame interpolation via generalized deformable convolution. IEEE Trans. Multimed. 2021, 24, 426–439. [Google Scholar] [CrossRef]
Ding, T.; Liang, L.; Zhu, Z.; Zharkov, I. Cdfi: Compression-driven network design for frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 8001–8011. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Huang, F.; Zhang, X.; Xu, Y.; Wang, X.; Wu, X. Video Frame Interpolation for Polarization via Swin-Transformer. arXiv 2024, arXiv:2406.11371. [Google Scholar] [CrossRef]
Shi, Z.; Xu, X.; Liu, X.; Chen, J.; Yang, M.H. Video frame interpolation transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 17482–17491. [Google Scholar]
Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W.T. Video enhancement with task-oriented flow. Int. J. Comput. Vis. 2019, 127, 1106–1125. [Google Scholar] [CrossRef]
Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 402–419. [Google Scholar]
Soomro, K.; Zamir, A.R.; Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar] [CrossRef]
Baker, S.; Scharstein, D.; Lewis, J.P.; Roth, S.; Black, M.J.; Szeliski, R. A database and evaluation methodology for optical flow. Int. J. Comput. Vis. 2011, 92, 1–31. [Google Scholar] [CrossRef]
Montgomery, C.; Lars, H. Xiph. org video test media (derf’s collection). Online 1994, 6, 1. Available online: https://media.xiph.org/video/derf (accessed on 1 April 2025).
Feijoo, D.; Benito, J.C.; Garcia, A.; Conde, M.V. Darkir: Robust low-light image restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, Jammu, India, 16–18 July 2025; pp. 10879–10889. [Google Scholar]
Loshchilov, I.; Hutter, F. Fixing weight decay regularization in adam. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Park, J.; Ko, K.; Lee, C.; Kim, C.S. Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 109–125. [Google Scholar]
Reda, F.; Kontkanen, J.; Tabellion, E.; Sun, D.; Pantofaru, C.; Curless, B. Film: Frame interpolation for large motion. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 250–266. [Google Scholar]
Hu, P.; Niklaus, S.; Sclaroff, S.; Saenko, K. Many-to-many splatting for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 3553–3562. [Google Scholar]
Kalluri, T.; Pathak, D.; Chandraker, M.; Tran, D. Flavr: Flow-agnostic video representations for fast frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2071–2082. [Google Scholar]
Kong, L.; Jiang, B.; Luo, D.; Chu, W.; Huang, X.; Tai, Y.; Wang, C.; Yang, J. Ifrnet: Intermediate feature refine network for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 1969–1978. [Google Scholar]
Lei, P.; Fang, F.; Zeng, T.; Zhang, G. Flow guidance deformable compensation network for video frame interpolation. IEEE Trans. Multimed. 2023, 26, 1801–1812. [Google Scholar] [CrossRef]
Gao, P.; Tian, H.; Qin, J. Video frame interpolation with flow transformer. arXiv 2023, arXiv:2307.16144. [Google Scholar] [CrossRef]
Niklaus, S.; Liu, F. Softmax splatting for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 13–19 June 2020; pp. 5437–5446. [Google Scholar]
Park, J.; Lee, C.; Kim, C.S. Asymmetric bilateral motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14539–14548. [Google Scholar]
Ranjan, A.; Black, M.J. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4161–4170. [Google Scholar]
Xu, H.; Zhang, J.; Cai, J.; Rezatofighi, H.; Tao, D. Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 8121–8130. [Google Scholar]

Figure 1. Visual comparison with state-of-the-art (SOTA) methods in extreme motion scenes. GT represents the ground truth.

Figure 2. Visualization results of high-frequency and low-frequency component differences.

Figure 3. Example samples from the UHD4K120FPS-N dataset and their corresponding inter-frame optical flow magnitudes.

Figure 4. The overall framework of the model proposed in this paper. The model primarily consists of three components: low-level feature extraction (LFE), inter-frame feature interaction perception (IFIP), and predicted frame reconstruction (PFR). Among these, our main contribution lies in IFIP.

Figure 5. The structure of each module proposed in this paper, including the dual alignment module, region-adaptive interaction module, motion compensation module, and optical flow estimation network.

Figure 6. Qualitative comparison with SOTA methods on the UHD-N-Test dataset.

Figure 7. Qualitative comparison with SOTA methods on the XTest-L-4K [6] dataset.

Figure 8. Qualitative comparison with SOTA methods on the Vimeo90K [31] dataset.

Figure 9. Visualization of heat maps for the RAIM.

Table 1. Percentile distributions of the optical flow magnitude and high-frequency feature ratio for the UHD4K120FPS-N and benchmark datasets. Higher values indicate larger motion magnitudes and a higher high-frequency feature ratio.

Dataset	Optical Flow Magnitudes			High-Frequency Feature Ratios
Dataset	25th	50th	75th	25th	50th	75th
Vimeo90K-Test [31]	3.0	5.0	7.3	-	-	-
Vimeo90K-Train [31]	3.4	5.6	8.0	-	-	-
SNU-FILM-Hard [14]	2.7	5.6	17.1	-	-	-
SNU-FILM-Extreme [14]	5.3	11.9	34.8	-	-	-
X-Test [6]	19.2	75.6	141.0	0.0002	0.0002	0.0004
X-Train [6]	6.7	19.6	65.0	0.0001	0.0002	0.0003
UHD-N-Test	70.5	89.7	216.6	0.0001	0.0002	0.0004
UHD-N-Train	23.7	30.8	44.1	0.0002	0.0004	0.0008

Table 2. Quantitative comparison with 19 SOTA methods on the UHD-N-Test dataset. The best and second-best results are indicated with bold and underlined text, respectively. “↑” indicate higher is better.

Method	PSNR ↑	SSIM ↑	Parameters (Million)	Runtime (ms)
Other
CAIN [14]	26.30	0.8880	42.8	37
FLAVR [41]	27.52	0.8901	42.4	37
Kernel-base
SepConv [12]	24.97	0.8097	21.6	200
AdaCoF [13]	25.89	0.8874	22.9	30
CDFI [27]	26.04	0.8454	5.0	172
Flow-base
ToFlow [31]	24.56	0.8042	1.1	84
BMBC [38]	26.08	0.8432	11.0	822
XVFIv [6]	26.21	0.8559	5.5	98
DAIN [15]	25.45	0.8448	24.0	151
UPR-base [20]	26.94	0.8972	1.7	42
UPR-large [20]	23.60	0.8331	3.7	62
RIFEm [11]	28.41	0.9104	9.8	12
EBME-H [19]	23.46	0.7829	3.9	40
EBME [19]	26.76	0.8910	3.9	20
M2M-PWC [40]	23.62	0.8350	7.6	32
FILM- $L_{S}$ [39]	27.89	0.8937	34.4	101
EMA-small [9]	26.88	0.8951	14.5	30
SGM-local-branch [10]	26.68	0.9057	15.4	57
SGM-small-1/2-points [10]	26.86	0.8901	20.8	56
Ours	28.46	0.9110	29.8	42

Table 3. Quantitative comparison of 2× interpolation on the XTest-L [6] and Xiph-L [35] datasets with the SOTA methods. The best and second-best results are indicated with bold and underlined text, respectively. “↑” indicate higher is better.

Method	XTest-L-2K		XTest-L-4K		Xiph-L-2K		Xiph-L-4K
Method	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑
XVFI [6]	29.82	0.8951	29.02	0.8866	29.17	0.8449	28.09	0.7889
RIFE [11]	29.87	0.8805	28.98	0.8756	30.18	0.8633	28.07	0.7982
EMA-small [9]	29.51	0.8775	28.60	0.8733	30.54	0.8718	28.40	0.8109
SGM-local-branch [10]	30.39	0.8946	29.25	0.8861	30.89	0.8745	28.59	0.8115
Ours	30.45	0.8961	29.70	0.8908	30.86	0.8746	28.87	0.8134

Table 4. Quantitative comparison of 8× interpolation on the XTest [6] dataset with the SOTA methods. “†” denotes training on the XTrain [6] dataset. The best and second-best results are indicated with bold and underlined text, respectively.

	DAIN [15]	IFRNet [42]	XVFI † [6]	EMA-Small [9]	M2M [40]	Ours
2K	29.33	31.53	30.85	31.89	32.13	32.41
4K	26.78	30.46	30.12	30.89	30.88	31.23

Table 6. Ablation experiment on the dual alignment module. “w/o” denotes without. The best results are indicated with bold text.

Method	UHD-N-Test		Vimeo90K
Method	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑
SPyNet [47]	27.94	0.9090	36.15	0.9801
RAFT [32]	28.05	0.9098	36.18	0.9806
GMFlow [48]	28.21	0.9104	36.21	0.9805
w/o OAM	28.17	0.9107	36.26	0.9805
Ours	28.46	0.9110	36.39	0.9811

Table 7. Ablation experiment on the RAIM and MCM. The best results are indicated with bold text.

Method	UHD-N-Test		Vimeo90K
Method	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑
3 × 3 Window	26.42	0.8535	35.70	0.9780
1 × 1 Window (Global)	26.31	0.8600	35.66	0.9780
w/o MCM	27.20	0.8627	35.75	0.9788
Cross-Attention	27.02	0.8643	35.68	0.9779
Ours	27.31	0.8654	35.84	0.9795

Table 8. Complexity and parameter analysis of the RSAM.

Method	Theoretical	Actual
	FLOPs	Parameters (M)		FLOPs (G)
	FLOPs	16 × 16	32 × 32	16 × 16	32 × 32
L-Att	$O (H W (3 C d_{L A} + 4 h_{L A} r^{2} d_{h e a d} + d_{L A}^{2}))$	1.84	0.46	0.47 (61%)	0.47 (47%)
G-Att	$O (H W (C d_{G A} (1 + \frac{2}{s^{2}}) + 4 \frac{h_{G A}}{s^{2}} d_{h e a d} + d_{G A}^{2}))$	1.84	0.46	0.30 (39%)	0.54 (53%)
RSAM	$F L O P_{L A} + F L O P_{G A}$	3.68	0.92	0.77	1.01
Total	-	29.87		182

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ning, X.; Qu, J.; Duan, J.; Yang, K.; Ding, Y. Video Frame Interpolation for Extreme Motion Scenes Based on Dual Alignment and Region-Adaptive Interaction. Symmetry 2025, 17, 2097. https://doi.org/10.3390/sym17122097

AMA Style

Ning X, Qu J, Duan J, Yang K, Ding Y. Video Frame Interpolation for Extreme Motion Scenes Based on Dual Alignment and Region-Adaptive Interaction. Symmetry. 2025; 17(12):2097. https://doi.org/10.3390/sym17122097

Chicago/Turabian Style

Ning, Xin, Jiantao Qu, Junyi Duan, Kun Yang, and Youdong Ding. 2025. "Video Frame Interpolation for Extreme Motion Scenes Based on Dual Alignment and Region-Adaptive Interaction" Symmetry 17, no. 12: 2097. https://doi.org/10.3390/sym17122097

APA Style

Ning, X., Qu, J., Duan, J., Yang, K., & Ding, Y. (2025). Video Frame Interpolation for Extreme Motion Scenes Based on Dual Alignment and Region-Adaptive Interaction. Symmetry, 17(12), 2097. https://doi.org/10.3390/sym17122097

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Video Frame Interpolation for Extreme Motion Scenes Based on Dual Alignment and Region-Adaptive Interaction

Abstract

1. Introduction

2. Related Work

2.1. Flow-Based Methods

2.2. Kernel-Based Methods

3. UHD4K120FPS-N Dataset

4. Methods

4.1. Dual Alignment Module

4.1.1. Optical Flow Alignment Module

4.1.2. Offset Alignment Module

4.2. Region-Adaptive Interaction Module

4.2.1. Assigning the Attention Head

4.2.2. L-Att Branch

4.2.3. G-Att Branch

4.3. Motion Compensation Module

4.4. Optical Flow Estimation Network

5. Experiment

5.1. Datasets and Evaluation Metrics

5.1.1. Datasets

5.1.2. Evaluation Metrics

5.2. Implementation Details

5.2.1. Training Details

5.2.2. Network Architecture

5.3. Quantitative Comparison with Previous Methods

5.3.1. Comparison in UHD Extreme Motion Scenes

5.3.2. Comparison on Low-Resolution Benchmarks

5.4. Qualitative Comparison with Previous Methods

5.4.1. Visual Comparison in UHD Extreme Motion Scenes

5.4.2. Visual Comparison on Low-Resolution Benchmarks

5.5. Ablation Study

5.5.1. Ablation Study of Dual Alignment Module

5.5.2. Ablation Study of Region-Adaptive Interaction Module

5.5.3. Ablation Study of Motion Compensation Module

5.5.4. Computational Complexity Analysis of the Region-Separated Attention Mechanism

5.5.5. Visual Analysis of the Region-Adaptive Interaction Module

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI