In this section, we conduct extensive experiments on three benchmark datasets: KTH, UCF101, and MotionX, to evaluate the performance of our proposed model. We first introduce the experimental setup and the evaluation metrics used, including MSE, MAE, Structure Similarity Index Measure (SSIM) [
44], Peak Signal-to-Noise Ratio (PSNR) [
44,
45], and Learned Perceptual Image Patch Similarity (LPIPS) [
46]. Then, we compare our model with several existing approaches ordered by publication year, including ConvLSTM, MMVP, PhyDNet, PredRNN-v2, and SimVP-v2. Finally, we present ablation studies to demonstrate the effectiveness of our method.
4.1. Data Collection
We employ three distinct datasets (KTH, Motion-X, and UCF-101) to evaluate our method and compare it with existing models. Detailed descriptions of these benchmarks are provided below.
1. KTH Action Dataset: The KTH dataset [
47], collected by KTH University, serves as a classical benchmark for human action recognition. A total of 25 subjects performed six fundamental actions (walking, jogging, running, boxing, hand waving, hand clapping) across four distinct experimental scenarios: indoor, outdoor, coat wearing and varying lighting. The dataset consists of 2391 video pairs in .avi format, characterized by a resolution of 160 × 120 pixels, a frame rate of 30 FPS, and an average duration of 4.0 s. Due to its minimal background clutter and external interference, KTH is particularly effective for evaluating discriminative motion features in early action recognition studies.
2. Motion-X Dataset: The Motion-X dataset [
48] is a large-scale benchmark designed for 3D human motion modeling, distinguished by its diversity of subjects and scenarios. It contains 81,100 video pairs (.avi format) with frame-level annotations, recorded at a resolution of 320 × 240 pixels and 30 FPS, with an average duration of 6.4 s. In this study, we exclusively utilize the RGB video component, intentionally excluding skeleton and pose annotations, to assess model robustness under multi-subject and multi-scene conditions. Given that Motion-X includes audio data similar to UCF-101, our framework incorporates an audio–visual fusion mechanism. To prepare the data, the audio tracks (typically sampled at 44.1 kHz or 48 kHz) are converted to mono and segmented to align with the duration of each video frame. We then employ Mel-spectrograms to extract acoustic features, generating a 40-dimensional audio vector for each frame. This approach leverages the complementarity of audio and visual modalities to capture complex spatiotemporal dynamics, thereby refining prediction accuracy and stability.
3. UCF-101: The UCF-101 dataset [
49] comprises 101 action categories recorded in unconstrained real-world environments. It covers diverse scenarios, including human–object interactions, sports, musical performances, and interpersonal activities. Unlike KTH, UCF-101 introduces significant challenges such as background clutter, illumination variations, and motion blur. Consequently, it serves as a standard benchmark for evaluating generalization. The dataset contains 13,320 video pairs (.avi format) with an average duration of 7.21 s, and the frame rate is 25 FPS. It shares the same native resolution (320 × 240) as Motion-X. For the accompanying audio, we preserve the original sampling rates of 44.1 kHz or 48.0 kHz. During preprocessing, audio tracks are converted to mono and segmented to align with each video frame. We then employ Mel-spectrogram analysis to extract acoustic features. This process yields a 40-dimensional audio vector for every frame, enabling the model to leverage audio–visual correlations for enhanced representation.
Regarding data preprocessing, we applied different strategies based on the native characteristics of each dataset. For KTH and MotionX, to ensure consistent feature extraction across varying aspect ratios and to optimize computational efficiency, all video frames were resized to a uniform resolution of
pixels. In contrast, for UCF-101, we retained the original resolution of
pixels to preserve the fine-grained details necessary for recognizing complex real-world actions. A comprehensive comparison of the detailed features and parameters for all three datasets is presented in
Table 2.
4.3. Quantitative Analysis
We conducted a comprehensive evaluation of the improved SimVP-v2 model on three widely used datasets of UCF-101, Motion-X, and KTH. We using the following performance metrics:
MSE and MAE: These metrics quantify the differences between the model’s predicted values and the ground-truth data, thereby serving as indicators of the prediction accuracy.
PSNR: PSNR is an objective measure of the visual quality of reconstructed frames. Higher PSNR values indicate better fidelity, with values below 20 typically reflecting poor quality, values between 20 and 40 representing moderate quality, and values above 40 approximating near-original quality.
SSIM: SSIM evaluates similarities between images in terms of luminance, contrast, and structure based on the human visual system. Unlike PSNR, SSIM provides a perceptually more consistent measure of image quality, where values closer to 1 indicate greater structural similarity.
LPIPS: LPIPS measures perceptual similarity between images by extracting features through pretrained deep neural networks and comparing them in feature space, thereby offering a fine-grained evaluation of perceptual quality.
The evaluation metrics of the improved model on each dataset are presented in
Table 3,
Table 4 and
Table 5. For the KTH dataset, the evaluation results are shown in
Table 3 and
Figure 4.
As shown in
Table 3, we introduce the Root Mean Square Error (RMSE) metric to provide a stricter assessment of large prediction errors. Our proposed A3DSimVP (
) consistently outperforms baseline methods across these key metrics. Compared to the strongest competitor, SimVP-v2, our method reduces MSE by 2.66% and MAE by 7.66%. Concurrently, the RMSE decreases from 6.71 to 6.62. This immediate drop in pixel-level error validates that explicit 3D kernels offer superior precision in capturing motion dynamics compared to implicit recurrent updates.
Beyond error minimization, the advantages of our model are evident in perceptual quality. The method achieves a 13.11% reduction in LPIPS compared to SimVP-v2, delivering sharper reconstructions. The combined gains in PSNR (27.35 dB) and LPIPS confirm that A3DSimVP effectively sustains structural similarity while improving the overall sharpness of the generated frames.
We further investigate the impact of temporal depth by varying from 2 to 4. Increasing yields consistent improvements in predictive accuracy. Specifically, MSE decreases from 44.69 () to 43.82 (), while PSNR rises from 27.27 dB to 27.35 dB. This trend indicates that a larger enhances the model’s capacity to model complex temporal evolution. Although LPIPS shows minor fluctuations, the configuration with achieves the lowest prediction errors and the highest peak signal-to-noise ratio, offering the optimal balance between pixel fidelity and structural coherence.
The visualization of the KTH dataset is shown in
Figure 4.
In
Figure 4,
T denotes the video time step. The visualization compares the ground truth (Target) against predictions from various baseline models and our proposed A3DSimVP. As illustrated in
Figure 4, our method demonstrates higher accuracy by preserving sharp edges and consistent temporal motion, contrasting with the blurrier baseline predictions. This superiority is particularly evident in the later prediction stages (frames
to
). While the SimVP-v2 prediction suffers from severe motion blur that renders the subject’s legs and body contours almost indistinguishable, A3DSimVP successfully maintains a clear definition of the subject’s posture throughout the sequence.
Visual comparisons reinforce these findings, utilizing T = 1 to T = 10 frames as input and predicting the subsequent T = 11 to T = 30 frames, where T represents the video time step. Competing approaches, including PredRNN-v2, MMVP, ConvLSTM, and baseline SimVP-v2, frequently suffer from blurriness, detail loss, and poor temporal coherence when applied to dynamic scenes. In contrast, our method captures motion details more precisely and maintains clearer background separation, producing predictions that are both sharper and more temporally consistent.
Quantitatively, our best-performing model, Ours ( = 4), achieved a state-of-the-art PSNR of 27.352, outperforming the best competing baseline SimVP-v2 (PSNR = 27.041) by approximately 1.15%. Furthermore, focusing on perceptual quality, the Ours ( = 2) model demonstrated a significant reduction in the LPIPS metric (0.2136) compared to SimVP-v2 (0.2524), representing a 15.37% improvement in perceived image fidelity. By integrating numerical superiority with enhanced perceptual quality, the proposed model demonstrates robust generalization and practicality, offering a strong benchmark for future research in real-time video prediction tasks.
For the MotionX dataset, the evaluation results are shown in
Table 4.
Table 4 details the quantitative evaluation on the Motion-X dataset. To rigorously assess error distribution, we incorporate the RMSE metric. While the MSE (351.07) and RMSE (18.74) are comparable to the competitive SimVP-v2, our method achieves a critical 7.38% reduction in Mean Absolute Error (MAE), dropping from 1823.62 to 1688.96. This improvement is driven by the explicit 3D spatiotemporal modeling, which enables convolutional kernels to jointly sample across space and time, directly encoding inter-frame displacement and motion trajectories. Consequently, A3DSimVP significantly outperforms recurrent architectures like PredRNN-V2, reducing MAE by 34.24% and offering superior stability.
In terms of perceptual fidelity, the model sets a new standard with a 12.92% reduction in LPIPS (0.1011 vs. 0.1161) and a peak SSIM of 0.8900. This gain is largely attributed to the audio modality fusion, which introduces complementary rhythmic and event-based cues via Channel Attention and lightweight FiLM gating strategies. By enhancing prediction stability at action initiation points, these multimodal signals contribute to the additional reduction in LPIPS. Overall, the Motion-X results provide strong evidence for the ability of our proposed method to simultaneously achieve high prediction accuracy and superior perceptual quality in complex dynamic sequences. Detailed ablation studies analyzing these components are presented in
Section 4.4.
For the UCF101 dataset, the evaluation results are shown in
Table 5.
As detailed in
Table 5, the A3DSimVP model demonstrates a decisive performance advantage on the UCF101 dataset. To strictly assess error distribution, we introduce the RMSE metric. Quantitatively, our proposed architecture achieves a substantial reduction in prediction error compared to the SimVP-v2 baseline. Specifically, the MSE is reduced by 9.65% (from 1253.55 to 1132.53), and the RMSE drops from 35.41 to 33.65. Concurrently, the MAE decreases by 11.91% (from 7769.91 to 6844.49). These significant drops in error metrics indicate that the integration of explicit 3D spatiotemporal modeling effectively mitigates the cumulative deviations often observed in rapid motion sequences. By capturing long-range temporal dependencies more accurately than the 2D-based baseline, our model ensures superior pixel-level alignment and stability across extended prediction horizons.
Beyond minimizing pixel-level errors, A3DSimVP excels in maintaining structural integrity and perceptual fidelity. The model attains a PSNR of 26.91 dB and an SSIM of 0.8453. Most critically, the Learned Perceptual LPIPS score is reduced to 0.1324. This represents a 6.6% improvement over the SimVP-v2 baseline (0.1417). The superior LPIPS performance suggests that the auxiliary audio cues effectively guide the generation process, reducing blurring artifacts and resulting in perceptually sharper, more realistic video content even in complex, dynamic scenes.
The results from the qualitative assessment powerfully corroborate the exceptional performance demonstrated by the quantitative metrics of our method. As illustrated in
Figure 5, we provide a detailed visual comparison of video prediction outputs from our model alongside various competing models on the UCF-101 dataset. The prediction sequences shown progress across time steps T = 1, 2, 3, 4, and 5, where T denotes the index of the predicted future frame. The successful maintenance of detail and motion coherence, particularly at the later time step, T = 5, serves as a crucial, high-stakes test for the model’s ability to maintain predictive quality over challenging long-term motion dependencies and resist the rapid accumulation of prediction error. Our visual outputs confirm that the explicit 3D modeling translates into clearer boundaries, fewer ghosting artifacts, and superior preservation of fine texture details over extended temporal horizons. Detailed ablation experiments regarding this dataset are discussed in
Section 4.4.
As illustrated in
Figure 5, we present a qualitative comparison between A3DSimVP and baseline models, including ConvLSTM and MMVP. The visualization demonstrates that our method achieves superior visual fidelity across the prediction horizon. Specifically, while competing models frequently suffer from significant motion blur and loss of fine texture, A3DSimVP effectively minimizes blurring artifacts and preserves sharp structural details. This qualitative evidence corroborates the quantitative results, highlighting the model’s capability to maintain clarity in complex motion scenarios.
In the prediction example of the Base Model, we observe that most comparative methods exhibit clear shortcomings in prediction quality. For instance, the predictions generated by the MMVP and ConvLSTM models are generally plagued by severe defocus blurring and loss of high-frequency details. The contours of moving objects, such as the hands and tools, become noticeably blurred after T = 5. The PhyDNet model often introduces pronounced artifacts and structural distortions. Even our Base Model, SimVP-V2, an efficient all-convolutional method designed to decouple spatial and temporal dependencies, shows a clear degradation in image quality during long-term prediction (T = 3 to T = 5), struggling to sustain stable sharpness.
Our method has superior visual effects compared to other methods in terms of visual perception, and our prediction results have better visual effects:
Exceptional Detail Preservation: Our method precisely retains fine textures and sharp edges that other models fail to maintain. At the most challenging time step, T = 9, the prediction images from other models are often too blurred to discern the geometric shape of the object or subtle background textures, such as leg contours or the edges of the tool. Our model, however, consistently provides a clear rendering of the tool’s geometric structure, minute floor texture details, and crisp object boundaries.
Superior Motion Coherence: Our predicted frames exhibit exceptional temporal coherence, effectively eliminating the flickering and structural inconsistencies commonly observed in competitor sequences, thereby producing a visually smoother and more realistic video flow.
Prediction Stability: As the time step T increases, the drop in clarity for our model’s predictions is significantly less severe than that of any baseline model, confirming the powerful capability of our approach in suppressing error accumulation and maintaining stability during long-range forecasting.
In summary, the performance of our method in this qualitative assessment, characterized by superior sharpness, more accurate texture details, and stable long-term prediction, is in perfect agreement with the quantitative improvements, definitively confirming the efficacy and superiority of our introduced modules and strategies in complex video prediction tasks.
4.4. Ablation Studies and Component Analysis
To systematically quantify the distinct contributions of the proposed 3D temporal modeling module and the audio modality fusion strategy, we conducted comprehensive ablation studies. The experimental design evaluates three critical configurations: the standard SimVP-v2 baseline, a variant integrating only the 3D depthwise convolution (Only 3D), and the complete dual-modality model (Ours).
We first present the comparative results on the UCF101 dataset in
Table 6. The introduction of the 3D module alone yields a reduction in MSE from 1253.55 to 1248.19 compared to the baseline. This result validates the role of explicit 3D temporal modeling in enhancing inter-frame dependency capture. Furthermore, the full model (’Ours’) achieves the most robust performance with the lowest MSE of 1243.87 and an optimal LPIPS score of 0.1340. The improvement in LPIPS signifies that the audio signal acts as a strong auxiliary cue, boosting prediction accuracy and enhancing perceptual realism.
Regarding the KTH dataset, it is important to note that it lacks audio modalities. Consequently, experiments on KTH were strictly limited to verifying the structural enhancements of the 3D module (as detailed in the main results). This distinction explains why the multimodal ablation analysis presented here for UCF101 and Motion-X cannot be replicated for KTH.
To further validate our architecture on a large-scale human motion domain, we extended our ablation analysis to the Motion-X dataset. As shown in
Table 7, the ’Only 3D’variant demonstrates a strong capability in enhancing structural fidelity, significantly reducing MAE by 6.09% (from 1823.62 to 1712.60) and LPIPS by 12.75% (from 0.1161 to 0.1013) compared to the baseline. Although this variant exhibits a slight fluctuation in MSE (359.38 versus 351.32), the higher SSIM (0.8885) indicates that explicit 3D modeling prioritizes structural coherence. The Only Audio’ variant provides marginal gains when used in isolation. However, the Full Model (’Ours’) effectively integrates both strengths. It corrects the MSE regression observed in the 3D-only model, achieving the lowest MSE of 351.07, while simultaneously attaining the best MAE (1688.96) and LPIPS (0.1011) scores. This confirms that the audio context serves as a stabilizing constraint, complementing the 3D spatial–temporal features to generate predictions that are both accurate and perceptually realistic.
Having established the efficacy of the components, we proceeded to determine the optimal method for integrating these audio–visual modalities. We evaluated three distinct fusion mechanisms on the Motion-X dataset: simple additive fusion (concatenation), Squeeze-and-Excitation (SE) fusion, and our proposed FiLM-based fusion. As shown in
Table 8, the simple Additive method yields the highest error rates (MSE: 631.64), suggesting that direct superimposition fails to capture complex non-linear correlations. The SE-Fusion strategy shows marginal improvement but lags in perceptual quality. In stark contrast, our FiLM strategy achieves a decisive performance leap, reducing the MSE to 351.07 and the MAE to 1688.96, while attaining the lowest LPIPS (0.1011). These results confirm that the affine transformation mechanism in FiLM allows the audio signal to dynamically modulate visual features, effectively suppressing ambiguity.
Finally, acknowledging that efficiency is a core value of the SimVP architecture, we analyzed the computational cost. Although the introduction of 3D convolutions and the FiLM audio module increases the parameter count, the impact on inference latency is effectively managed through a lightweight design. Specifically, our model maintains a competitive inference speed of 20.86 FPS with a computational load of 138 GFLOPs. This strikes an optimal balance between high-fidelity prediction and resource efficiency, ensuring that the significant improvements in perceptual quality do not come at a prohibitive computational expense.
4.5. Computing Resource Consumption Analysis
To evaluate the deployment feasibility of A3DSimVP, we benchmarked model complexity and inference speed against the baseline SimVP-v2 and representative recurrent architectures (ConvLSTM, PredRNN-v2, and MMVP). While we conducted efficiency benchmarks across all datasets, we present the representative analysis on the large-scale Motion-X dataset to demonstrate the deployment feasibility of A3DSimVP. The task involved predicting 20 future frames based on 10 input frames. Inference throughput was measured on a single NVIDIA GPU using a batch size of 1, with the input resolution standardized to .
Table 9 presents the detailed resource consumption metrics. Compared to the SimVP-v2 baseline, A3DSimVP incurs a 15.0% increase in FLOPs (0.120T to 0.138T) and requires 44.8% more GPU memory (10,244 MiB to 14,830 MiB). This moderate increase in computational load is primarily due to the introduction of explicit 3D convolutions for spatiotemporal modeling.
However, this resource investment yields a substantial return in processing speed. The inference efficiency rises by 63.6%, increasing from 12.75 samples/s to 20.86 samples/s. This result indicates that our architecture effectively utilizes the parallel computing capabilities of GPUs, avoiding the serial bottlenecks often found in recurrent units or unoptimized 2D implementations. Furthermore, compared to the complex recurrent model PredRNN-v2, our method reduces FLOPs by approximately while delivering a speedup.
In summary, the quantitative results confirm that A3DSimVP presents a highly favorable trade-off between resource consumption and computational efficiency. By accepting a manageable increase in GPU memory and FLOPs relative to SimVP-v2, our method achieves significantly higher inference throughput. This balance supports the suitability of A3DSimVP for real-time video prediction tasks where low latency is critical.