RA-BiMENet: Continuous-Time 4D Medical Image Interpolation via Relation-Aware Bi-Directional Motion Estimation

Li, Liangjiang; Lyu, Jun

doi:10.3390/s26103034

Open AccessArticle

RA-BiMENet: Continuous-Time 4D Medical Image Interpolation via Relation-Aware Bi-Directional Motion Estimation

by

Liangjiang Li

and

Jun Lyu

^*

School of Computer and Control Engineering, Yantai University, Yantai 264005, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(10), 3034; https://doi.org/10.3390/s26103034

Submission received: 29 March 2026 / Revised: 6 May 2026 / Accepted: 9 May 2026 / Published: 11 May 2026

(This article belongs to the Special Issue Sensing and Processing for Medical Imaging: Methods and Applications)

Download

Browse Figures

Versions Notes

Abstract

Four-dimensional medical images introduce the temporal dimension to three-dimensional spatial data, enabling the dynamic characterization of organ motion and providing important support for disease diagnosis and functional assessment. However, due to constraints such as low-dose acquisition and prolonged scanning, the challenges faced in obtaining 4D medical images with high temporal resolution include insufficient spatial sampling, severe motion artifacts, and image blurring. Therefore, generating high-quality and temporally continuous intermediate frames while ensuring patient safety remains a critical challenge in 4D medical image interpolation. To address this issue, we propose the Relation-Aware Bi-directional Motion Estimation Network (RA-BiMENet) for 4D medical image interpolation, which enables the accurate prediction of intermediate frames at arbitrary time points. Specifically, RA-BiMENet consists of two key components: a spatiotemporal transform MLP (TS-MLP) module and a hierarchical spatiotemporal fusion (HSTF) module. The TS-MLP module performs bi-directional motion estimation in a pyramid-recursive manner, where a relation-aware multi-scale MLP (RAM-MLP) unit is introduced to model local correlations and multi-scale dependencies for accurate nonlinear motion estimation. Based on the estimated transformations, the HSTF module hierarchically integrates cross-temporal features through forward warping and self-attention, thereby enhancing local detail restoration while preserving global temporal consistency. Experimental results demonstrate that RA-BiMENet outperforms state-of-the-art methods on multiple quantitative evaluation metrics and is capable of generating high-fidelity and temporally coherent interpolated frames under complex deformation scenarios, validating its effectiveness and superiority for continuous-time 4D medical image interpolation.

Keywords:

4D medical imaging; MLP; feature fusion; relationship awareness; spatiotemporal modeling

1. Introduction

Four-dimensional medical imaging introduces a temporal dimension into conventional 3D medical imaging, enabling dynamic characterization of organ motion and providing richer spatiotemporal information for clinical research, diagnosis, and treatment. For example, 4D CT can be used for motion modeling of respiration-related organs [1]; 4D MRI facilitates cardiac function assessment [2]; 4D US has been widely applied for echocardiographic analysis [3]. Owing to its ability to dynamically capture organ motion patterns and biomechanical properties, 4D medical imaging plays an important role in disease diagnosis, functional assessment, and treatment planning.

However, the acquisition of high-quality 4D medical images often requires prolonged scanning. In radiological imaging scenarios in particular, long-duration or repeated scans may increase radiation exposure and introduce a risk of secondary cancer [4]. Therefore, low-dose imaging has become an urgent clinical need for ensuring patient safety. Nevertheless, low-dose acquisition is usually accompanied by insufficient temporal sampling, which limits temporal resolution and leads to a pronounced trade-off between image quality and patient safety [5,6]. In addition, prolonged scanning may introduce motion artifacts and image blurring [7,8], further degrading imaging accuracy. As a result, achieving a proper balance between temporal resolution and spatial resolution while accurately capturing the dynamic anatomical changes of rapidly moving organs, such as the heart, lungs, and abdomen, remains a critical challenge in 4D medical imaging.

In the field of natural image processing, video frame interpolation (VFI), as an effective post-processing technique, can generate high-frame-rate videos from limited frame sequences, thereby alleviating the conflict between temporal and spatial resolution to a certain extent [9,10]. However, directly transferring VFI methods developed for natural images to 4D medical image interpolation remains highly challenging. First, most mainstream VFI methods are data-driven and rely on large-scale training data to learn the statistical characteristics of natural videos, whereas medical imaging datasets are typically limited due to radiation constraints, acquisition cost, and annotation difficulty. Second, natural-image VFI mainly focuses on joint modeling of two-dimensional spatial information and time, while 4D medical image interpolation requires joint modeling of three-dimensional spatial structures and temporal dynamics. The straightforward extension of existing methods would therefore substantially increase the number of model parameters, computational complexity, and training cost.

The goal of 4D medical image interpolation is to generate intermediate frames that are anatomically plausible, morphologically continuous, and temporally smooth among neighboring frames. Its core challenge lies in accurately modeling the complex nonlinear spatiotemporal transformations between adjacent frames. In recent years, video frame interpolation methods based on convolutional neural networks (CNNs) and transformers have been gradually introduced for 4D medical image interpolation tasks [11,12]. Meanwhile, with the growing demand for long-range dependency modeling, multilayer perceptron (MLP)-based approaches have also attracted increasing attention [13]. MLP has shown strong potential in capturing fine-grained long-range dependencies and has achieved promising results in dense prediction tasks for medical imaging [14]. However, the application of MLP to 4D medical image interpolation is still in the early stage. Existing MLP-based methods usually focus on global feature interactions along the spatial or channel dimensions, while explicit modeling of local feature relationships, especially local motion correlations across temporal frames, remains insufficient. This limitation restricts the accuracy of motion estimation and the realism of the interpolated results under complex deformation scenarios.

To address these issues, we propose the Relation-Aware Bi-directional Motion Estimation Network (RA-BiMENet) for 4D medical image interpolation, which enables the high-quality prediction of intermediate frames at arbitrary time points. The proposed method consists of two key components: a spatiotemporal transform MLP (TS-MLP) module and a hierarchical spatiotemporal fusion (HSTF) module. Specifically, the TS-MLP module adopts a pyramid-recursive framework for bi-directional motion estimation. Within TS-MLP, a relation-aware multi-scale MLP (RAM-MLP) unit is introduced to compute local correlations between adjacent-frame feature maps and model multi-level dependencies through multi-scale MLPs, thereby effectively capturing complex nonlinear motion and achieving accurate bi-directional motion estimation. In particular, this module captures both global structural information and local dynamic variations at different pyramid levels, producing more precise nonlinear spatiotemporal transformation fields. After motion estimation, the HSTF module is employed to fuse and refine cross-temporal features for high-quality intermediate-frame reconstruction. HSTF first aligns the input frames and their contextual features to the target intermediate time point through forward warping and maps spatiotemporal features from different levels into a unified semantic space, thereby enabling cross-level feature fusion while reducing model complexity. It then introduces a self-attention mechanism to explicitly model interframe dependencies, allowing effective integration of cross-temporal information, enhanced local detail restoration, and preserved global temporal consistency.

The main contributions of this paper are summarized as follows:

We propose RA-BiMENet, a relation-aware bi-directional motion estimation network for 4D medical image interpolation, enabling accurate intermediate-frame prediction at arbitrary time points.
We developed a spatiotemporal transform MLP (TS-MLP) module for bi-directional motion estimation, in which a relation-aware multi-scale MLP (RAM-MLP) unit was designed to capture complex nonlinear motion through local correlation modeling and multi-scale dependency learning.
We developed a hierarchical spatiotemporal fusion (HSTF) module to fuse cross-frame features via forward warping and self-attention, improving detail reconstruction while preserving temporal consistency.
Extensive experiments on the ACDC and 4D-Lung datasets demonstrate that RA-BiMENet generates high-fidelity and temporally continuous interpolated frames, outperforming existing state-of-the-art methods.

2. Related Works

2.1. Video Frame Interpolation

In the field of natural image processing, video frame interpolation has been extensively studied, showing strong performance in frame rate up-conversion tasks. Existing methods can generally be categorized into two main types: kernel-based methods and optical flow-based methods. Kernel-based methods synthesize the target intermediate frame by learning or predicting spatially adaptive convolution kernels and performing weighted aggregation of neighboring frames. For example, Niklaus et al. [15] proposed adaptive separable convolution kernels to estimate pixel-wise motion, while Gui et al. [16] introduced a two-stage “structure-to-texture” framework that first generates structural information through structure-guided feature flow and then restores details with a texture compensator. In contrast, optical flow-based methods show that kernel-based approaches are limited by their receptive fields and thus struggle with large-displacement motion. These methods usually estimate motion fields between adjacent frames, then warp or align the input frames accordingly before synthesizing the target frame. Liu et al. [17] proposed a deep voxel flow method, which synthesizes new frames by resampling and propagating pixel information from existing frames. Jin et al. [18] proposed a compact interpolation model that leverages motion cues from both input frames through a lightweight pyramid network and reconstructs intermediate frames by fusing forward-warped features. Hu et al. [19] introduced implicit acceleration estimation and knowledge distillation to modulate linear motion into quadratic motion, enabling more accurate modeling of complex curved motion.

However, natural-image VFI methods typically rely on real intermediate-frame supervision and are mainly designed for 2D image sequences. Directly extending them to 3D medical images substantially increases model size, computational cost, and training complexity, which limits their applicability to high-dimensional medical images and complex dynamic scenarios.

2.2. Four-Dimensional Medical Image Interpolation

Four-dimensional medical image interpolation has received increasing attention in recent years. VoxelMorph provides a fast learning-based framework that generates deformation fields with convolutional networks for registration and interpolation tasks [20]. Guo et al. [11] were the first to propose a spatiotemporal volumetric interpolation network for 4D medical images, combining unsupervised spatiotemporal motion estimation with periodic regression modeling to achieve high-accuracy volumetric interpolation. Kim et al. [21] proposed a joint diffusion–deformation modeling method that generates continuous 4D medical image sequences between source and target volumes while preserving topological consistency. Wei et al. [22] introduced a multi-pyramid voxel flow model that efficiently generates 3D volumes at arbitrary time points through multi-scale voxel flow and pyramid fusion. Kim et al. [23] proposed an unsupervised volumetric interpolation framework, UVI-Net, which achieves accurate temporal interpolation without intermediate-frame supervision and remains robust in low-data settings. Zhang et al. [24] introduced a test-time self-supervised training strategy that adapts to distribution shifts on unlabeled test data, thereby improving interpolation accuracy and generalization. Although these methods have advanced interpolation accuracy and spatiotemporal modeling, most still rely on convolutional or deformation-based modules and mainly focus on global or pyramid-level features, making it difficult to fully capture fine-grained local spatiotemporal relationships. In contrast, MLP-based methods remain underexplored in 4D medical image interpolation but show the potential for more flexible feature interaction and improved modeling of complex motion.

2.3. Multilayer Perceptron (MLP)

Motion modeling is central to video frame interpolation. Existing methods are mainly based on convolutional neural networks (CNNs) or transformers. CNNs are effective at extracting local features but are limited in modeling long-range spatiotemporal dependencies due to their restricted receptive fields. Transformers are capable of capturing global relationships, but their computational cost is high, and they are relatively less effective in modeling local details. By contrast, MLP-based models have recently attracted increasing attention in vision tasks because they can model both local and global dependencies with relatively high efficiency, making them suitable for representing complex spatiotemporal features [13]. In natural image processing, implicit neural-representation-based methods have achieved significant progress. Zhang et al. [25] proposed an implicit video compression method that performs motion compensation and residual modeling through implicit networks. Guo et al. [26] introduced a generalizable implicit motion modeling approach that predicts optical flow at arbitrary timestamps using motion latent variables and adaptive coordinate networks, achieving accurate spatiotemporal modeling of real videos and integrating seamlessly with existing flow-based VFI methods. Hu et al. [19] proposed an implicit quadratic video frame interpolation method that more accurately models complex curved motion via implicit acceleration estimation and knowledge distillation.

In comparison, MLP-based studies in 4D medical imaging remain limited. Wang et al. [27] proposed a spatial-attention-based implicit neural representation network that represents MRI images as continuous functions of 3D coordinates for high-accuracy super-resolution reconstruction at arbitrary slice spacing. Wolterink et al. [14] presented an implicit deformable medical image registration method based on multilayer perceptrons, which directly represents the transformation function and achieves accurate 4D thoracic CT registration without training data. These studies indicate that MLPs and implicit representations are promising for continuous medical image modeling, while their application to 4D medical image interpolation remains largely unexplored.

2.4. Multi-Scale Feature Fusion

In computer vision and medical image analysis, multi-scale feature integration has been widely used to enhance the representation of complex structures and fine-grained details. By fusing features from different levels, models can capture both global structural information and local details, thereby improving performance in dense prediction tasks. Among existing methods, U-Net and its variants usually employ skip connections to fuse encoder and decoder features at different levels, achieving basic multi-scale feature integration [28,29]. In 4D medical image interpolation, some methods further include multi-scale fusion strategies to enhance cross-frame modeling. For example, the MPVF model proposed by Wei et al. [22] employs multi-scale voxel flow and pyramid fusion to efficiently generate 3D volumes at arbitrary time points, although its computational costs are high for large-scale 3D inputs. To reduce the overhead of multi-scale fusion, some studies have explored performing multi-scale computation on smaller feature sets. Xie et al. [30] proposed an efficient self-attention module that performs multi-scale computation only on compact feature sets, providing a useful reference for efficient feature integration. Inspired by this idea, our HSTF module integrates spatiotemporal features from different network levels and maps them into a unified semantic space, enabling efficient cross-frame feature fusion.

3. Methodology

3.1. Overview

Four-dimensional medical image interpolation aims to synthesize anatomically plausible and temporally coherent intermediate volumes between two observed time points. Given two 4D medical image volumes

F_{0}, F_{1} \in R^{H \times W \times D}

, corresponding to the starting and ending states of organ motion, our goal is to generate a high-quality interpolated volume

F_{t}

at an arbitrary intermediate time point

t \in (0, 1)

. To this end, we propose the Relation-Aware Bi-directional Motion Estimation Network (RA-BiMENet) for continuous-time 4D medical image interpolation.

As illustrated in Figure 1, RA-BiMENet consists of two key components: a spatiotemporal transform MLP (TS-MLP) module for motion estimation and a hierarchical spatiotemporal fusion (HSTF) module for intermediate-frame reconstruction. The TS-MLP module is formulated as a pyramid–recursive architecture, in which relation-aware multi-scale MLP (RAM-MLP) units are employed at different levels to progressively estimate and refine bi-directional spatiotemporal transformation fields between the input frames. To support interpolation at arbitrary time points, we also introduce temporal positional encoding (TPE) to explicitly inject temporal information into the motion estimation process, thereby generating time-specific transformation fields. Based on the estimated transformations, the HSTF module performs cross-temporal feature fusion through forward warping, self-attention, and cross-level feature aggregation to reconstruct the target intermediate frame with rich details and temporal continuity.

The whole framework is optimized in an end-to-end manner, where TS-MLP and HSTF are jointly trained with a unified objective function, enabling mutual learning between motion estimation and frame reconstruction, thereby allowing RA-BiMENet to effectively capture complex nonlinear spatiotemporal dynamics through relation-aware motion modeling, explicit temporal encoding, and hierarchical spatiotemporal fusion and to generate anatomically consistent and temporally coherent intermediate images.

3.2. Spatiotemporal Transform MLP (TS-MLP)

As shown in Figure 1b, the proposed TS-MLP module consists of a hierarchical feature encoder and a RAM-MLP-based multi-scale decoder for bi-directional motion estimation. The encoder is composed of four consecutive convolutional blocks, with

2 \times 2 \times 2

average pooling applied between adjacent levels for downsampling, so as to extract multi-scale features from the input volumes

F_{0}

and

F_{1}

:

f_{0}^{i} = E_{i} (F_{0}), f_{1}^{i} = E_{i} (F_{1}), i = 1, 2, 3, 4,

(1)

where

E_{i} (\cdot)

denotes the i-th convolutional block together with the corresponding downsampling operation, and

f_{0}^{i}

and

f_{1}^{i}

represent the output features at the i-th level.

Based on these hierarchical features, the decoder progressively estimates the bi-directional spatiotemporal transformation fields in a coarse-to-fine manner. At the coarsest level,

f_{0}^{4}

and

f_{1}^{4}

are fed into the RAM-MLP module to establish initial spatial correspondences and generate coarse transformation fields

Φ_{0 \to 1}^{4}

and

Φ_{1 \to 0}^{4}

. At finer levels, the transformation field estimated from the previous level is propagated to guide feature warping and alignment and is further refined by the current-level RAM-MLP. Through this recursive optimization, the complete bi-directional transformation fields

Φ_{0 \to 1}

and

Φ_{1 \to 0}

are finally obtained.

As shown in Figure 2, the RAM-MLP module is designed to jointly model interframe relations and cross-scale dependencies. For the i-th level, we first compute the local correlation between

f_{0}^{i}

and

f_{1}^{i}

to obtain a correlation map

f_{c}^{i}

, which is then concatenated with the original features to form a relation-aware representation:

\begin{matrix} f_{c}^{i} & = Corr (f_{0}^{i}, f_{1}^{i}), \\ f_{Corr}^{i} & = Concat (f_{0}^{i}, f_{c}^{i}, f_{1}^{i}), \end{matrix}

(2)

where

Corr (\cdot)

denotes the local correlation operation. Next,

f_{Corr}^{i}

is partitioned into non-overlapping windows at multiple scales:

f_{Corr, w}^{i} = Partition (f_{Corr}^{i}, w \times w \times w) w \in {3, 5, 7},

(3)

where

f_{Corr, w}^{i}

denotes the local features partitioned at scale w. The features at each scale are then processed by a WinMLP block:

z_{w}^{i} = WinMLP (f_{Corr, w}^{i}) .

(4)

To adaptively fuse multi-scale information, an MLP is used to generate the fusion weights:

α_{w} = MLP (\sum_{w \in {3, 5, 7}} z_{w}^{i}),

(5)

and the fused representation is obtained by

z^{i} = \sum_{w \in {3, 5, 7}} α_{w} \cdot z_{w}^{i} .

(6)

Finally, the fused feature is further enhanced by a channel attention module to strengthen relation-aware multi-scale modeling. By jointly exploiting local correlations and multi-scale dependencies, TS-MLP is able to effectively capture complex nonlinear motion.

3.3. Temporal Positional Encoding

In 4D medical image interpolation, accurate synthesis at an arbitrary time point requires the network to capture not only the spatial correspondence between the starting frame

F_{0}

and the ending frame

F_{1}

but also the temporal position of the target frame

t \in (0, 1)

. To this end, we introduce temporal positional encoding (TPE) [31] to explicitly encode temporal information into the motion estimation process. Specifically, each time step t is mapped to a d-dimensional embedding vector:

PE (t) = [PE (t, 0), PE (t, 1), \dots, PE (t, d - 1)] .

(7)

The even and odd dimensions are defined as:

\begin{matrix} PE (t, 2 i) & = sin (\frac{t}{{10,000}^{2 i / d}}), \\ PE (t, 2 i + 1) & = cos (\frac{t}{{10,000}^{2 i / d}}), \end{matrix}

(8)

where

i = 0, 1, \dots, d / 2 - 1

. This sinusoidal formulation provides each time point with a unique representation while preserving relative temporal relationships across different positions. The scaling factor

{10,000}^{2 i / d}

follows the standard positional encoding design in transformer-based models, particularly ViT, where 10,000 is empirically chosen to ensure a sufficiently long wavelength spectrum, enabling unique and stable encoding across embedding dimensions and allowing the model to capture temporal information at multiple scales.

During interpolation, the bi-directional transformation fields

Φ_{0 \to 1}

and

Φ_{1 \to 0}

, which represent the estimated voxel-wise motion fields from the first frame to the last frame and from the last frame to the first frame, respectively, are combined with the endpoint temporal encodings

PE (0)

and

PE (1)

, respectively. Meanwhile, the target encoding

PE (t)

is introduced as an additional condition to predict the transformation fields corresponding to the target time point:

Φ_{0 \to t}, Φ_{1 \to t} = {MLP}_{motion} (Φ_{0 \to 1} + PE (0), Φ_{1 \to 0} + PE (1), PE (t)),

(9)

where

{MLP}_{motion}

denotes an independent multilayer perceptron used for motion field prediction. By explicitly incorporating temporal positional encoding, the network can jointly exploit spatial transformation cues and temporal position information, leading to more accurate modeling of continuous motion and improved interpolation consistency.

3.4. Hierarchical Spatiotemporal Fusion (HSTF)

As shown in Figure 1c, after obtaining the target-time bi-directional transformation fields

Φ_{0 \to t}

and

Φ_{1 \to t}

, we introduce a hierarchical spatiotemporal fusion (HSTF) module to reconstruct the intermediate frame. Specifically, HSTF takes the transformation fields, the original input volumes

F_{0}

and

F_{1}

, and the corresponding forward-warped volumes

w_{0 \to t}

and

w_{1 \to t}

as inputs and extracts four levels of pyramid contextual features, denoted as

X_{1}, X_{2}, X_{3},

and

X_{4}

, through an encoder.

Features at different levels exhibit complementary properties: shallow features preserve fine local details, while deep features encode higher-level semantic context. To fully exploit these multi-level representations, HSTF first upsamples

X_{2}

,

X_{3}

, and

X_{4}

to the same spatial resolution as

X_{1}

and concatenates them to form the fused feature

X^{'}

:

X^{'} = Concat (Up (X_{2}), Up (X_{3}), Up (X_{4})),

(10)

where

Up (\cdot)

denotes the upsampling operation to match the spatial resolution of

X_{1}

.

X_{attn} = SelfAttn ({Conv}_{1 \times 1 \times 1} (X^{'})),

(11)

The output contains two channels corresponding to the residual term and the fusion mask. Let

{refine}_{1}

and

{refine}_{2}

denote the first and second channels of

refine

:

{refine}_{res} = σ ({refine}_{1}) \cdot 2 - 1,

(12)

{refine}_{mask} = σ ({refine}_{2}),

(13)

where

σ (\cdot)

denotes the sigmoid function. The fusion mask is then used to adaptively blend the forward-warped volumes:

merged_img = w_{0 \to t} \cdot {refine}_{mask} + w_{1 \to t} \cdot (1 - {refine}_{mask}) .

(14)

Finally, the residual term is added to the fused result, and the output is clipped to the valid intensity range to obtain the interpolated frame:

{\hat{F}}_{t} = clip (merged_img + {refine}_{res}, 0, 1),

(15)

where

clip (\cdot)

constrains the output to

[0, 1]

. By hierarchically integrating multi-level spatiotemporal features, HSTF effectively enhances local detail restoration while preserving global temporal consistency in the interpolated frame.

3.5. Loss Function

The proposed model is optimized using a joint objective that consists of an image similarity term, a pixel-wise reconstruction term, and a smoothness regularization term, namely, the normalized cross-correlation loss

L_{ncc}

, the Charbonnier loss

L_{cha}

, and the regularization term

L_{reg}

.

The normalized cross-correlation loss is defined as

L_{ncc} (I, J) = - \frac{1}{N} \sum_{x = 1}^{N} \frac{{(\sum_{p \in ω (x)} (I (p) - μ_{I} (x)) (J (p) - μ_{J} (x)))}^{2}}{(\sum_{p \in ω (x)} {(I (p) - μ_{I} (x))}^{2}) (\sum_{p \in ω (x)} {(J (p) - μ_{J} (x))}^{2}) + ε},

(16)

where I and J denote the ground truth and predicted images, respectively, x denotes a spatial location,

ω (x)

is the local window centered at x,

μ_{I} (x)

and

μ_{J} (x)

are the local means within the local window, N is the total number of spatial locations, and

ε

is a small constant for numerical stability.

The Charbonnier loss is defined as

L_{cha} (I, J) = \frac{1}{N} \sum_{x = 1}^{N} \sqrt{{(I (x) - J (x))}^{2} + ε^{2}} .

(17)

To encourage smooth spatiotemporal transformation fields, we further impose a diffusion regularization term on

Φ

:

L_{reg} (Φ) = \sum_{x = 1}^{N} ∥ \nabla Φ (x) ∥,

(18)

where ∇ denotes the spatial gradient operator.

For the bi-directional motion estimation module, the estimated transformation fields

Φ_{0 \to 1}

and

Φ_{1 \to 0}

are used to warp

F_{0}

and

F_{1}

, yielding

F_{0 \to 1}

and

F_{1 \to 0}

, respectively. The corresponding loss is formulated as

\begin{matrix} L_{TS} = & L_{ncc} (F_{0}, F_{1 \to 0}) + L_{ncc} (F_{1}, F_{0 \to 1}) \\ + L_{cha} (F_{0}, F_{1 \to 0}) + L_{cha} (F_{1}, F_{0 \to 1}) \\ + L_{reg} (Φ_{0 \to 1}) + L_{reg} (Φ_{1 \to 0}) . \end{matrix}

(19)

For the HSTF module, the loss is defined as

L_{HSTF} = L_{ncc} (F_{t}, {\hat{F}}_{t}) + L_{cha} (F_{t}, {\hat{F}}_{t}) + L_{reg} (Φ_{0 \to t}) + L_{reg} (Φ_{1 \to t}),

(20)

where

{\hat{F}}_{t}

denotes the predicted interpolated image.

The overall training objective is given by

L = L_{TS} + L_{HSTF} .

(21)

4. Experiments

4.1. Dataset

To validate the effectiveness of the proposed method for 4D medical image interpolation, experiments were conducted on two publicly available medical imaging datasets.

The ACDC cardiac dataset [32] contains 4D cardiac MRI sequences from 100 patients. Prior to training and evaluation, all volumes were resampled to a uniform spatial resolution of

128 \times 128 \times 32

. In this dataset, the end diastolic (ED) image was selected as the starting frame, while the end systolic (ES) image was used as the ending frame. For dataset partitioning, 90 subjects were used for training, and the remaining 10 subjects were reserved for testing.

The 4D-Lung dataset [33] consists of 82 thoracic 4D-CT sequences collected from 20 patients. All volumes were resampled to a uniform spatial resolution of

128 \times 128 \times 128

. For this dataset, the end inspiration (EI) image was used as the starting frame, and the end expiration (EE) image was adopted as the ending frame. Following the experimental setup, 68 sequences were used for training, while the remaining 14 sequences were used for testing.

4.2. Experimental Setup

Implementation Details. The proposed method was implemented in PyTorch and trained/tested on an NVIDIA RTX A6000 GPU. In the network architecture, the number of channels in the first encoder stage was set to

C = 8

. During training, the Adam optimizer was adopted with an initial learning rate of

1 \times 10^{- 4}

, a batch size of 1, and a total of 200 epochs.

Evaluation Metrics. To comprehensively evaluate model performance, we used PSNR [34], NCC, SSIM [35], NMSE and LPIPS [36]. Specifically, PSNR measures the voxel-level reconstruction quality, NCC reflects global correlation, SSIM evaluates local structural and textural similarity, NMSE quantifies overall reconstruction error, and LPIPS assesses perceptual similarity. Together, these metrics provide a comprehensive evaluation of the interpolated results in terms of voxel-wise fidelity, structural preservation, error magnitude, and perceptual quality.

4.3. Comparison with Other Methods

To evaluate the performance of RA-BiMENet, we compared it with six representative methods, including two supervised methods, SVIN [11] and MPVF [22], and four unsupervised methods, VM [20], DDM [21], UVI-Net [23], and TTT4MII [37]. These methods perform 4D medical image interpolation via spatiotemporal volumetric interpolation, multi-pyramid voxel flow, scaled deformation fields, diffusion-deformation modeling, unsupervised volumetric interpolation, and test-time training, respectively. Comparisons with these methods provided a comprehensive evaluation of RA-BiMENet in terms of prediction accuracy, structural preservation, and temporal consistency.

Quantitative Evaluation. Table 1 reports the results of the quantitative comparison of the proposed method and existing supervised and unsupervised interpolation methods on the Cardiac and 4D-Lung datasets. It can be observed that RA-BiMENet achieves the best performance on all evaluation metrics on both datasets.

On the Cardiac dataset, RA-BiMENet achieves a PSNR of 33.85 dB and an SSIM of 0.978, while significantly outperforming the competing methods in terms of NMSE and LPIPS. These results indicate that the interpolated volumes generated by our method exhibit higher structural fidelity and perceptual quality. On the 4D-Lung dataset, RA-BiMENet also achieves the best performance, with a PSNR of 35.42 dB and an SSIM of 0.986. The further improvements in NMSE (0.359) and LPIPS (1.185) demonstrate its strong generalization ability and robustness across different organs and imaging modalities.

Compared with recent representative methods such as UVI-Net and TTT4MII, RA-BiMENet is better on all metrics, indicating that it can more accurately model spatiotemporal motion patterns under complex nonlinear deformations. These results confirm the effectiveness of the proposed method in spatiotemporal consistency modeling and dynamic deformation learning, highlighting its superiority for high-quality 4D medical image interpolation.

Qualitative Evaluation. Figure 3 and Figure 4 present the qualitative results on the Cardiac dataset. Figure 3 shows the intermediate frames generated by RA-BiMENet at different time steps (

t = 0.1

to

t = 0.9

) over the temporal sequence from end diastole (ED) to end systole (ES). The results demonstrate that the interpolated frames are temporally smooth and coherent, accurately capturing the progressive myocardial contraction. The changes in ventricular cavity contraction and myocardial wall thickness are naturally reconstructed, showing high consistency with the ground truth (GT) sequence and no obvious motion discontinuities or temporal artifacts. Figure 4 further provides a visual comparison of single-frame interpolation results. It can be observed that SVIN, MPVF, VM, and DDM exhibit noticeable blur or artifacts around the myocardial boundaries and blood pool regions, with more highlighted regions in the corresponding error maps. UVI-Net and TTT4MII still show visible deviations in the apex and boundary regions. In contrast, RA-BiMENet achieves more accurate structural detail recovery and texture reconstruction, with the lowest error response in the regions of interest and a more uniformly low-error distribution overall, demonstrating superior local consistency and interpolation accuracy.

Figure 5 and Figure 6 show the qualitative results on the 4D-Lung dataset. Figure 5 presents the multi-frame interpolation results at different time steps (

t = 0.2, 0.4, 0.6, 0.8

) in both axial and coronal views. The results indicate that RA-BiMENet can accurately reconstruct the expansion and contraction dynamics of lung motion during respiration, while maintaining good continuity of fine structures such as lung contours and vascular textures, which remain highly consistent with the ground truth sequence. Figure 6 compares the single-frame interpolation results. Compared with other methods, RA-BiMENet more accurately recovers the instantaneous morphological variations in lung tissue during respiration, whereas the competing methods still exhibit clear deviations in anatomically complex regions such as the lung boundaries, apex, and base. The corresponding error maps demonstrate that RA-BiMENet is more effective in modeling subtle organ deformations and boundary variations, while better preserving temporal smoothness with almost no visible artifacts.

Model complexity. To evaluate model efficiency, we compare the number of parameters in Table 1. SVIN (8.4 M) and MPVF (26.45 M) are relatively heavy models, while VM (0.3 M) and UVI-Net (0.51 M) are lightweight. In contrast, our method contains 7.52 M parameters, which is significantly fewer than TTT4MII (14.29 M), while achieving the best performance on both datasets. These results demonstrate that our method achieves a good balance between accuracy and model complexity.

Overall, the qualitative results demonstrate that RA-BiMENet not only generates high-quality intermediate frames but also exhibits strong stability and generalization across organs (heart and lung), imaging planes, and motion amplitudes. Whether for fine-grained cardiac motion during contraction or large deformations induced by respiration, the generated interpolated frames remain highly consistent with the ground truth sequences. Meanwhile, the proposed model achieves this performance with a moderate number of parameters (7.52 M), indicating a favorable balance between reconstruction quality and model complexity.

4.4. Ablation Study

To verify the effectiveness of different components in RA-BiMENet, we conducted systematic ablation studies on the Cardiac and 4D-Lung datasets from three aspects:

Architectural design analysis. As shown in Table 2, the baseline model equipped with RAM-MLP significantly outperforms VoxelMorph [20] and Hire-MLP [38] on all evaluation metrics. This improvement mainly stems from the design of RAM-MLP. VoxelMorph relies on convolutional structures and therefore struggles to capture long-range dependencies in high-resolution volumetric data. Although Hire-MLP introduces MLP-based modeling, its single-scale design limits its ability to characterize multi-scale deformations. In contrast, RAM-MLP incorporates a relation-aware multi-scale modeling mechanism, enabling it to capture complex nonlinear motion across spatial levels and achieve a better balance between global consistency and local detail restoration.
Effect of the correlation modeling module. To evaluate the contribution of correlation modeling, we compared the performance of RAM-MLP with and without the correlation computation module. As shown in Table 3, introducing correlation modeling consistently achieves improvements on all metrics on both the Cardiac and 4D-Lung datasets. This demonstrates that local correlation information is beneficial for enhancing cross-frame motion representation, thereby improving the accuracy of spatiotemporal transformation estimation and reducing interpolation errors.
Effect of the multi-branch design. We further analyzed the influence of different branch combinations in RAM-MLP. As shown in Table 4, removing any one of the scale branches ( $3 \times 3 \times 3$ , $5 \times 5 \times 5$ , or $7 \times 7 \times 7$ ) leads to performance degradation on both datasets. The best results are achieved only when all three branches are retained, indicating strong complementarity among multi-scale representations. Specifically, the $3 \times 3 \times 3$ and $5 \times 5 \times 5$ branches are beneficial for capturing fine-grained and medium-scale deformations, while the $7 \times 7 \times 7$ branch is more effective for modeling large displacements. This design enables RAM-MLP to achieve a favorable balance between local detail recovery and global motion consistency.

5. Discussion

This paper presents RA-BiMENet, a relation-aware bi-directional motion estimation framework for continuous-time 4D medical image interpolation. Existing 4D medical image interpolation methods are mainly based on convolutional or deformation-driven architectures. Although effective, these methods often rely on limited receptive fields or coarse multi-scale matching, making it difficult to fully capture complex nonlinear motion and fine-grained spatiotemporal dependencies in dynamic medical images. To address these limitations, RA-BiMENet includes a spatiotemporal transform MLP (TS-MLP) module for bi-directional motion estimation and a hierarchical spatiotemporal fusion (HSTF) module for intermediate-frame reconstruction. TS-MLP adopts a pyramid-recursive architecture, in which relation-aware multi-scale MLP (RAM-MLP) units are used to progressively estimate and refine spatiotemporal transformation fields. By explicitly modeling local correlations and multi-scale dependencies, TS-MLP is able to better capture both global structural context and local motion details. Meanwhile, HSTF hierarchically integrates cross-temporal features through forward warping, cross-level aggregation, and self-attention, leading to improved detail restoration and temporal consistency.

Nevertheless, several limitations remain. The current framework was evaluated on only two datasets, and its generalization to more imaging modalities and anatomical regions still requires further validation. In addition, the multi-scale relation modeling and hierarchical fusion strategy increase the computational complexity of the network. Future work will focus on improving computational efficiency and incorporating stronger anatomical or physiological priors to further enhance interpolation accuracy and robustness.

6. Conclusions

In this paper, we proposed RA-BiMENet, a relation-aware bi-directional motion estimation network for continuous-time 4D medical image interpolation. The proposed framework consists of a spatiotemporal transform MLP (TS-MLP) module for bi-directional motion estimation and a hierarchical spatiotemporal fusion (HSTF) module for intermediate-frame reconstruction. Within TS-MLP, the RAM-MLP unit effectively captures complex nonlinear motion through local correlation modeling and multi-scale dependency learning, while HSTF enhances local detail restoration and preserves global temporal consistency via forward warping and self-attention.

Extensive experiments on multiple datasets demonstrate that RA-BiMENet generates high-fidelity and temporally coherent interpolated frames under complex deformation and large-motion scenarios, consistently outperforming existing state-of-the-art methods. These results indicate that RA-BiMENet provides an effective solution for 4D medical image interpolation and offers a promising framework for future studies on continuous-time medical image synthesis.

Author Contributions

Conceptualization, J.L. and L.L.; methodology, L.L.; validation, L.L.; formal analysis, L.L.; writing—original draft, L.L.; writing—review and editing, J.L.; supervision, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (grant 62371413), the Yantai Basic Research Key Project (2023JCYJ041) and the Youth Innovation Science and Technology Support Program of Shandong Province (grant 2023KJ239).

Data Availability Statement

The datasets used in this work are publicly accessible. The ACDC dataset is available at https://humanheart-project.creatis.insa-lyon.fr/database/#collection/637218c173e9f0047faa00fb (accessed on 20 July 2024), and the 4D-Lung dataset can be accessed at https://www.cancerimagingarchive.net/collection/4d-lung/, (accessed on 15 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pan, T.; Lee, T.Y.; Rietzel, E.; Chen, G.T. 4D-CT imaging of a volume influenced by respiratory motion on multi-slice CT. Med. Phys. 2004, 31, 333–340. [Google Scholar] [CrossRef] [PubMed]
Sun, X.; Cheng, L.H.; Plein, S.; Garg, P.; van der Geest, R.J. Deep learning based automated left ventricle segmentation and flow quantification in 4D flow cardiac MRI. J. Cardiovasc. Magn. Reson. 2024, 26, 100003. [Google Scholar] [CrossRef] [PubMed]
Zhang, W.; Brady, J.M.; Becher, H.; Noble, J.A. Spatio-temporal (2D+ T) non-rigid registration of real-time 3D echocardiography and cardiovascular MR image sequences. Phys. Med. Biol. 2011, 56, 1341–1360. [Google Scholar] [CrossRef] [PubMed]
Canè, F.; Verhegghe, B.; De Beule, M.; Bertrand, P.B.; Van der Geest, R.J.; Segers, P.; De Santis, G. From 4D medical images (CT, MRI, and ultrasound) to 4D structured mesh models of the left ventricular endocardium for patient-specific simulations. BioMed Res. Int. 2018, 2018, 7030718. [Google Scholar] [CrossRef] [PubMed]
Mory, C.; Janssens, G.; Rit, S. Motion-aware temporal regularization for improved 4D cone-beam computed tomography. Phys. Med. Biol. 2016, 61, 6856–6877. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Huang, X.; Wang, J. Advanced 4-dimensional cone-beam computed tomography reconstruction by combining motion estimation, motion-compensated reconstruction, biomechanical modeling and deep learning. Vis. Comput. Ind. Biomed. Art 2019, 2, 23. [Google Scholar] [CrossRef] [PubMed]
Küstner, T.; Pan, J.; Gilliam, C.; Qi, H.; Cruz, G.; Hammernik, K.; Yang, B.; Blu, T.; Rueckert, D.; Botnar, R.; et al. Deep-learning based motion-corrected image reconstruction in 4D magnetic resonance imaging of the body trunk. In Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC); IEEE: New York, NY, USA, 2020; pp. 976–985. [Google Scholar]
Tripathi, V.R.; Tibdewal, M.N.; Mishra, R. A survey on motion artifact correction in magnetic resonance imaging for improved diagnostics. SN Comput. Sci. 2024, 5, 281. [Google Scholar] [CrossRef]
Werlberger, M.; Pock, T.; Unger, M.; Bischof, H. Optical flow guided TV-L1 video interpolation and restoration. In Proceedings of the International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2011; pp. 273–286. [Google Scholar]
Jeong, S.G.; Lee, C.; Kim, C.S. Motion-compensated frame interpolation based on multihypothesis motion estimation and texture optimization. IEEE Trans. Image Process. 2013, 22, 4497–4509. [Google Scholar] [CrossRef] [PubMed]
Guo, Y.; Bi, L.; Ahn, E.; Feng, D.; Wang, Q.; Kim, J. A spatiotemporal volumetric interpolation network for 4d dynamic medical image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 4726–4735. [Google Scholar]
Chen, J.; Frey, E.C.; He, Y.; Segars, W.P.; Li, Y.; Du, Y. Transmorph: Transformer for unsupervised medical image registration. Med. Image Anal. 2022, 82, 102615. [Google Scholar] [CrossRef] [PubMed]
Liu, R.; Li, Y.; Tao, L.; Liang, D.; Zheng, H.T. Are we ready for a new paradigm shift? A survey on visual deep mlp. Patterns 2022, 3, 100520. [Google Scholar] [CrossRef] [PubMed]
Wolterink, J.M.; Zwienenberg, J.C.; Brune, C. Implicit neural representations for deformable image registration. In Proceedings of the International Conference on Medical Imaging with Deep Learning; PMLR: New York, NY, USA, 2022; pp. 1349–1359. [Google Scholar]
Niklaus, S.; Mai, L.; Liu, F. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 261–270. [Google Scholar]
Gui, S.; Wang, C.; Chen, Q.; Tao, D. Featureflow: Robust video interpolation via structure-to-texture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 14004–14013. [Google Scholar]
Liu, Z.; Yeh, R.A.; Tang, X.; Liu, Y.; Agarwala, A. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 4463–4471. [Google Scholar]
Jin, X.; Wu, L.; Shen, G.; Chen, Y.; Chen, J.; Koo, J.; Hahm, C.h. Enhanced bi-directional motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: New York, NY, USA, 2023; pp. 5049–5057. [Google Scholar]
Hu, M.; Jiang, K.; Zhong, Z.; Wang, Z.; Zheng, Y. Iq-vfi: Implicit quadratic motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 6410–6419. [Google Scholar]
Balakrishnan, G.; Zhao, A.; Sabuncu, M.R.; Guttag, J.; Dalca, A.V. Voxelmorph: A learning framework for deformable medical image registration. IEEE Trans. Med. Imaging 2019, 38, 1788–1800. [Google Scholar] [CrossRef] [PubMed]
Kim, B.; Ye, J.C. Diffusion deformable model for 4D temporal medical image generation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2022; pp. 539–548. [Google Scholar]
Wei, T.T.; Kuo, C.; Tseng, Y.C.; Chen, J.J. Mpvf: 4d medical image inpainting by multi-pyramid voxel flows. IEEE J. Biomed. Health Inform. 2023, 27, 5872–5882. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Yoon, H.; Park, G.; Kim, K.; Yang, E. Data-efficient unsupervised interpolation without any intermediate frame for 4D medical images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 11353–11364. [Google Scholar]
Kang, J.; Jung, E.C.; Koo, H.J.; Yang, D.H.; Ha, H. Flow-Rate-Constrained Physics-Informed Neural Networks for Flow Field Error Correction in Four-Dimensional Flow Magnetic Resonance Imaging. IEEE Trans. Med. Imaging 2025, 44, 5155–5171. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Van Rozendaal, T.; Brehmer, J.; Nagel, M.; Cohen, T. Implicit neural video compression. arXiv 2021, arXiv:2112.11312. [Google Scholar] [CrossRef]
Guo, Z.; Li, W.; Loy, C.C. Generalizable implicit motion modeling for video frame interpolation. Adv. Neural Inf. Process. Syst. 2024, 37, 63747–63770. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Xiong, H.; Xuan, K.; Zhuang, Z.; Liu, M.; Shen, Z.; Zhao, X.; Zhang, L.; Wang, Q. Spatial attention-based implicit neural representation for arbitrary reduction of MRI slice spacing. Med. Image Anal. 2024, 94, 103158. [Google Scholar] [CrossRef] [PubMed]
Hill, J.; Khokher, M.R.; Nguyen, C.; Adcock, M.; Li, R.; Anderson, S.; Morrell, T.; Diprose, T.; Salvado, O.; Wang, D.; et al. Automated instance segmentation and registration of spinal vertebrae from CT-Scans with an improved 3D U-net neural network and corner point registration. Comput. Biol. Med. 2025, 196, 110663. [Google Scholar] [CrossRef] [PubMed]
Ying, J.; Cattell, R.; Zhao, T.; Lei, L.; Jiang, Z.; Hussain, S.M.; Gao, Y.; Chow, H.H.S.; Stopeck, A.T.; Thompson, P.A.; et al. Two fully automated data-driven 3D whole-breast segmentation strategies in MRI for MR-based breast density using image registration and U-Net with a focus on reproducibility. Vis. Comput. Ind. Biomed. Art 2022, 5, 25. [Google Scholar] [CrossRef] [PubMed]
Xie, Y.; Zhang, J.; Shen, C.; Xia, Y. Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2021; pp. 171–180. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.A.; Cetin, I.; Lekadir, K.; Camara, O.; Ballester, M.A.G.; et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar] [CrossRef]
Hugo, G.D.; Weiss, E.; Sleeman, W.C.; Balik, S.; Keall, P.J.; Lu, J.; Williamson, J.F. Data from 4D Lung Imaging of NSCLC Patients. 2016. Available online: https://cir.nii.ac.jp/crid/1880302167830331392 (accessed on 1 January 2016).
Dosselmann, R.; Yang, X.D. Existing and emerging image quality metrics. In Proceedings of the Canadian Conference on Electrical and Computer Engineering; IEEE: New York, NY, USA, 2005; pp. 1906–1913. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 586–595. [Google Scholar]
Zhang, Q.; Lei, Y.; Zheng, Z.; Chen, Z.; Xie, Z. Test time training for 4D medical image interpolation. In Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN); IEEE: New York, NY, USA, 2025; pp. 1–8. [Google Scholar]
Guo, J.; Tang, Y.; Han, K.; Chen, X.; Wu, H.; Xu, C.; Xu, C.; Wang, Y. Hire-mlp: Vision mlp via hierarchical rearrangement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 826–836. [Google Scholar]

Figure 1. Overall architecture of the proposed RA-BiMENet, including (a) the overall pipeline, (b) the spatiotemporal transform MLP (TS-MLP) module for bi-directional motion estimation, and (c) the hierarchical spatiotemporal fusion (HSTF) module for intermediate-frame reconstruction.

Figure 2. Detailed architecture of proposed RAM-MLP module for relation-aware multi-scale motion modeling, where local correlations are exploited together with parallel WinMLP branches and channel attention to capture complex nonlinear motion.

Figure 3. Visualization of interpolated cardiac MRI frames at arbitrary time points.

Figure 4. Qualitative comparison of different interpolation methods on the ACDC dataset.

Figure 5. Visualization of lung CT interpolation results at different time points.

Figure 6. Qualitative comparison of different interpolation methods on 4D-Lung dataset.

Table 1. Quantitative comparison of different interpolation methods on ACDC (cardiac) and 4D-Lung datasets (↑ indicates higher is better; ↓ indicates lower is better; bold denotes best result; underlined values denote second-best result).

Method	Param (M)	Cardiac					4D-Lung
Method	Param (M)	PSNR ↑	NCC ↑	SSIM ↑	NMSE ↓	LPIPS ↓	PSNR ↑	NCC ↑	SSIM ↑	NMSE ↓	LPIPS ↓
SVIN	8.4	32.51	0.559	0.972	2.930	1.535	30.99	0.312	0.973	0.852	2.182
MPVF	26.45	33.15	0.562	0.971	2.847	1.458	31.18	0.310	0.972	0.761	2.554
VM	0.3	31.02	0.555	0.966	4.254	1.772	32.29	0.316	0.977	0.641	2.063
DDM	10.92	29.71	0.541	0.956	5.007	2.136	30.37	0.308	0.971	0.905	2.283
UVI-Net	0.51	33.52	0.565	0.977	2.433	1.134	34.00	0.320	0.980	0.552	1.489
TTT4MII	14.29	33.55	0.565	0.977	2.414	1.129	34.02	0.320	0.981	0.495	1.414
Ours	7.52	33.85	0.566	0.978	2.257	1.198	35.42	0.328	0.986	0.359	1.185

Table 2. Results of ablation study of architecture design.

Dataset	TS-MLP	PSNR ↑	NCC ↑	SSIM ↑	NMSE ↓	LPIPS ↓
Cardiac	VoxelMorph	33.17	0.562	0.971	2.847	1.458
	Hire-MLP	33.43	0.564	0.974	2.537	1.322
	RAM-MLP	33.85	0.566	0.978	2.257	1.198
4D-Lung	VoxelMorph	33.48	0.318	0.979	0.612	1.754
	Hire-MLP	34.35	0.322	0.983	0.472	1.334
	RAM-MLP	35.42	0.328	0.986	0.359	1.185

Table 3. Ablation study of effect of correlation modeling in RAM-MLP module.

Dataset	Correlation	PSNR ↑	NCC ↑	SSIM ↑	NMSE ↓	LPIPS ↓
Cardiac		33.63	0.565	0.977	2.395	1.276
Cardiac	✓	33.85	0.566	0.978	2.257	1.198
4D-Lung		35.28	0.326	0.985	0.397	1.226
4D-Lung	✓	35.42	0.328	0.986	0.359	1.185

Table 4. Impact of different RAM-MLP branches on interpolation performance.

Dataset	RAM-MLP Branch			PSNR ↑	NCC ↑	SSIM ↑	NMSE ↓	LPIPS ↓
Dataset	$3 \times 3 \times 3$	$5 \times 5 \times 5$	$7 \times 7 \times 7$	PSNR ↑	NCC ↑	SSIM ↑	NMSE ↓	LPIPS ↓
Cardiac	✓	✓		33.57	0.563	0.975	2.445	1.273
	✓		✓	33.49	0.562	0.974	2.458	1.322
		✓	✓	33.53	0.563	0.975	2.451	1.288
	✓	✓	✓	33.85	0.566	0.978	2.257	1.198
4D-Lung	✓	✓		35.38	0.327	0.985	0.394	1.285
	✓		✓	35.34	0.326	0.984	0.435	1.319
		✓	✓	35.36	0.327	0.985	0.409	1.296
	✓	✓	✓	35.42	0.328	0.986	0.359	1.185

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, L.; Lyu, J. RA-BiMENet: Continuous-Time 4D Medical Image Interpolation via Relation-Aware Bi-Directional Motion Estimation. Sensors 2026, 26, 3034. https://doi.org/10.3390/s26103034

AMA Style

Li L, Lyu J. RA-BiMENet: Continuous-Time 4D Medical Image Interpolation via Relation-Aware Bi-Directional Motion Estimation. Sensors. 2026; 26(10):3034. https://doi.org/10.3390/s26103034

Chicago/Turabian Style

Li, Liangjiang, and Jun Lyu. 2026. "RA-BiMENet: Continuous-Time 4D Medical Image Interpolation via Relation-Aware Bi-Directional Motion Estimation" Sensors 26, no. 10: 3034. https://doi.org/10.3390/s26103034

APA Style

Li, L., & Lyu, J. (2026). RA-BiMENet: Continuous-Time 4D Medical Image Interpolation via Relation-Aware Bi-Directional Motion Estimation. Sensors, 26(10), 3034. https://doi.org/10.3390/s26103034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RA-BiMENet: Continuous-Time 4D Medical Image Interpolation via Relation-Aware Bi-Directional Motion Estimation

Abstract

1. Introduction

2. Related Works

2.1. Video Frame Interpolation

2.2. Four-Dimensional Medical Image Interpolation

2.3. Multilayer Perceptron (MLP)

2.4. Multi-Scale Feature Fusion

3. Methodology

3.1. Overview

3.2. Spatiotemporal Transform MLP (TS-MLP)

3.3. Temporal Positional Encoding

3.4. Hierarchical Spatiotemporal Fusion (HSTF)

3.5. Loss Function

4. Experiments

4.1. Dataset

4.2. Experimental Setup

4.3. Comparison with Other Methods

4.4. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI