Predictive Autonomy for UAV Remote Sensing: A Survey of Video Prediction

Chen, Zhan; Zhu, Enze; Guo, Zile; Zhang, Peirong; Liu, Xiaoxuan; Wang, Lei; Zhang, Yidan

doi:10.3390/rs17203423

Open AccessReview

Predictive Autonomy for UAV Remote Sensing: A Survey of Video Prediction

by

Zhan Chen

^1,2,

Enze Zhu

^1,2,

Zile Guo

^1,2,

Peirong Zhang

^1,2

,

Xiaoxuan Liu

^1,2,

Lei Wang

^1,2 and

Yidan Zhang

^1,2,*

¹

Key Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

²

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(20), 3423; https://doi.org/10.3390/rs17203423

Submission received: 28 August 2025 / Revised: 26 September 2025 / Accepted: 7 October 2025 / Published: 13 October 2025

(This article belongs to the Special Issue Reviews in Remote Sensing Image Processing: Methods, Architectures, and Applications)

Download

Browse Figure

Versions Notes

Abstract

Highlights

What are the main findings?

This review introduces a novel three-axis taxonomy (operator architecture, generative nature, and training regime) that systematically clarifies the complex design space of video prediction models for resource-constrained UAVs.
Our analysis of the recent literature (2022–2025) reveals a clear technological shift towards efficient, long-range temporal operators, such as state-space models (SSMs), which are uniquely suited to the computational and memory constraints of aerial platforms.

What is the implication of the main finding?

The proposed framework serves as a practical design guide, enabling researchers and engineers to strategically navigate the trade-offs between model performance and onboard deployment limitations for specific UAV remote sensing applications.
By identifying key research gaps, engineering best practices, and future directions, this survey provides a roadmap to accelerate the development of robust, scalable world models, moving the field closer to true predictive autonomy.

Abstract

The analysis of dynamic remote sensing scenes from unmanned aerial vehicles (UAVs) is shifting from reactive processing to proactive, predictive intelligence. Central to this evolution is video prediction—forecasting future imagery from past observations—which enables critical remote sensing applications like persistent environmental monitoring, occlusion-robust object tracking, and infrastructure anomaly detection under challenging aerial conditions. Yet, a systematic review of video prediction models tailored for the unique constraints of aerial remote sensing has been lacking. Existing taxonomies often obscure key design choices, especially for emerging operators like state-space models (SSMs). We address this gap by proposing a unified, multi-dimensional taxonomy with three orthogonal axes: (i) operator architecture; (ii) generative nature; and (iii) training/inference regime. Through this lens, we analyze recent methods, clarifying their trade-offs for deployment on UAV platforms that demand processing of high-resolution, long-horizon video streams under tight resource constraints. Our review assesses the utility of these models for key applications like proactive infrastructure inspection and wildlife tracking. We then identify open problems—from the scarcity of annotated aerial video data to evaluation beyond pixel-level metrics—and chart future directions. We highlight a convergence toward scalable dynamic world models for geospatial intelligence, which leverage physics-informed learning, multimodal fusion, and action-conditioning, powered by efficient operators like SSMs.

Keywords:

video prediction; world models; proactive intelligence; embodied UAVs; aerial autonomy; UAV remote sensing

1. Introduction

1.1. The Emergence of Embodied Intelligence in UAVs

Embodied unmanned aerial vehicles (UAVs) are transitioning from teleoperated platforms to intelligent agents that perceive, decide, and act autonomously within complex, three-dimensional environments. This paradigm shift from passive remote sensing to active, embodied perception is unlocking new frontiers across precision agriculture [1], disaster response [2], infrastructure inspection [3], and environmental monitoring [4,5,6]. Unlike ground robots, which operate on a 2D or 2.5D manifold, UAVs contend with full six-degree-of-freedom (6-DoF) dynamics, making their interaction with the world inherently more complex [7]. The principles of Embodied AI, which emphasize the tight coupling between an agent’s body, its sensors, and its environment, are central to enabling this new generation of intelligent aerial systems [8]. True autonomy in these dynamic, partially observable, and often GPS-challenged or GPS-denied settings cannot rely on reactive control alone. Onboard intelligence must build and maintain an internal model of the world to anticipate likely futures [9]. In practice, this means predicting future sensory observations to reduce decision latency, hedge against uncertainty, and enable safer, more reliable, and proactive behaviors.

1.2. Video Prediction: The Core of Proactive Autonomy

Video prediction—the task of forecasting future frames from a sequence of past observations—is central to this vision of predictive autonomy. It serves a dual purpose. Beyond its explicit goal of generation, it functions as a powerful mechanism for self-supervised representation learning [10]. To minimize future prediction error, a model must implicitly encode the underlying regularities of the world: object permanence, scene geometry, appearance constancy, motion dynamics, and even rudimentary causal structures [11,12,13]. For embodied UAVs, which constantly face the challenges of a moving camera, strong parallax effects, and the need to discern small, distant targets, such foresight provides tangible benefits: (i) Anticipatory Planning: enabling navigation through cluttered spaces by predicting the outcomes of potential control actions in a model-based planning framework [14,15]. (ii) Robust State Estimation: maintaining target tracks through temporary occlusions or sensor dropouts by “imagining” the missing frames [16]. (iii) Unsupervised Anomaly Detection: flagging unexpected deviations from normal dynamics—such as cracks in a bridge or unauthorized activity—as high-prediction-error events [17,18]. Consequently, prediction is not merely a generative task but a cognitive prerequisite for proactive agency in aerial remote sensing.

1.3. The Gap: Limitations of Existing Taxonomies in a Rapidly Evolving Field

The landscape of video prediction models is heterogeneous and evolving at a breakneck pace. While several excellent surveys exist, they often categorize models along a single axis, such as “RNN-based vs. CNN-based” or “deterministic vs. generative” [19,20,21]. Furthermore, broader reviews on video generation often focus on creative applications rather than robotic control [22]. Such taxonomies, while historically useful, struggle to properly position the latest hybrid architectures and, more critically, fail to provide a clear framework for understanding the role of emerging temporal operators. In particular, state-space models (SSMs) defy simple classification: their inference process is recurrent, yet their training can be parallelized like a convolution, and they capture long-range dependencies akin to transformers [23,24]. Similarly, the proliferation of attention variants (e.g., sparse, low-rank, local) has blurred the lines, offering different inductive biases and scalability trade-offs compared to traditional convolutional recurrence [25,26]. For a practitioner aiming to deploy a model on a resource-constrained UAV, a taxonomy that explicitly disentangles the backbone topology from the choice of temporal operator is essential for navigating these design choices and their performance implications. This review aims to fill this gap.

1.4. Design Space for UAV Video Prediction: Constraints and Desiderata

Deploying predictive models on UAVs imposes a unique and stringent set of constraints that sharpen the design desiderata. These are not just theoretical preferences but hard engineering requirements [27]: (i) Throughput and Latency: Models must process frames and generate predictions at a rate that supports real-time decision-making (e.g., >15 FPS) at relevant resolutions (e.g., 512 × 512 to 1024 × 1024), all within a strict end-to-end latency budget. (ii) Compute and Memory Footprint: Onboard processors (e.g., NVIDIA Jetson series, NVIDIA, Santa Clara, CA, USA) have limited peak memory (VRAM) and power budgets (typically 15–40 W), precluding the direct deployment of massive foundation models [28]. (iii) Robustness to Ego-Motion: Models must be robust to fast 6-DoF ego-motion, which induces large optical flow, motion blur, and rolling shutter artifacts. (iv) Small-Object Sensitivity: In many remote sensing tasks, the regions of interest (e.g., a person, a vehicle, a crop disease spot) are very small, demanding high fidelity in specific parts of the frame. (v) Long-Horizon Consistency: For meaningful planning, predictions must remain coherent and avoid catastrophic error accumulation over long time horizons (e.g., >1 s or >20 frames). These factors create a strong selective pressure for architectures and temporal operators that balance global dependency modeling with linear-time (or near-linear) complexity and stable autoregressive rollouts [13,23,24].

1.5. Contributions, Research Questions, and a New Taxonomy

To address the aforementioned gap, we propose and utilize a unified, multi-dimensional taxonomy built upon three orthogonal axes. This framework forms the backbone of our survey:

Axis A: Operator Architecture: The model’s overall structure (CNN, Transformer, or hybrid) paired with the core mechanism for temporal modeling (convolutional recurrence, attention, SSM, or explicit warping/flow).
Axis B: Generative Nature: The fundamental approach to prediction, ranging from deterministic regression to probabilistic models that capture uncertainty (VAEs, GANs, diffusion models).
Axis C: Training and Inference Regime: The strategies used for training (e.g., teacher-forcing, curricula) and generation (e.g., autoregressive vs. one-shot), including techniques for edge deployment (distillation, quantization).

A conceptual illustration of this taxonomy is shown in Figure 1. Using this lens, we synthesize advances from 2022–2025, analyze their trade-offs, and assess their utility for core UAV tasks. This review is guided by three primary research questions:

Q1: How should different temporal operators (attention, SSM, conv-recurrence) be selected and configured based on the target resolution, prediction horizon, and specific motion characteristics of UAV applications?
Q2: Under what specific UAV task conditions (e.g., navigating through dense occlusions, tracking erratically moving targets) do probabilistic generators (VAEs, diffusion models) justify their significant computational overhead compared to deterministic models?
Q3: Which training regimes and model compression techniques (e.g., distillation, quantization) are most effective in bridging the gap between high-accuracy models and the stringent deployment constraints of edge platforms?

Our synthesis reveals a clear trajectory towards scalable world models [29,30], powered by efficient temporal operators like SSMs and enriched by multimodal fusion (RGB, IMU, event cameras) and action-conditioning.

Scope, Inclusion/Exclusion Criteria, and Terminology

This review focuses on video prediction under UAV constraints. Downstream tasks (detection/segmentation/tracking) are discussed only insofar as they inform prediction design and evaluation; exhaustive coverage is beyond scope. We include methods that explicitly predict pixels or latent representations over time; we exclude works limited to single-frame enhancement or purely static remote-sensing analysis. For clarity, we classify these approaches based on their fundamental generative mechanism, as summarized in Table 1. This classification distinguishes between the temporal operators, sequential AR rollouts, and parallel one-shot/blockwise predictions.

2. Fundamentals of Video Prediction

2.1. Nomenclature

To facilitate a clear and rigorous discussion, Table 2 establishes a common terminology for the core concepts, model architectures, and evaluation metrics central to this field.

2.2. Formal Problem Definition

Let

X_{t - T + 1 : t} = {x_{t - T + 1}, \dots, x_{t}}

denote a sequence of T observed context frames, and

Y_{t + 1 : t + T^{'}} = {y_{t + 1}, \dots, y_{t + T^{'}}}

represent the

T^{'}

ground-truth future frames. Each frame

x_{i}, y_{j} \in R^{C \times H \times W}

is a tensor with C channels and spatial dimensions

H \times W

. The goal of video prediction is to generate a sequence

{\hat{Y}}_{t + 1 : t + T^{'}} = {{\hat{y}}_{t + 1}, \dots, {\hat{y}}_{t + T^{'}}}

that is as close as possible to

Y

.

Deterministic predictors aim to learn a direct mapping

F_{Θ} : R^{T \times C \times H \times W} \to R^{T^{'} \times C \times H \times W}

by minimizing a reconstruction loss, implicitly modeling the conditional mean of the future distribution:

min_{Θ} L_{rec} (\hat{Y}, Y) .

(1)

Probabilistic predictors learn a conditional distribution

p_{Θ} (Y ∣ X)

and generate diverse futures by sampling from latent variables or via iterative denoising. The objective is often to maximize the log-likelihood:

max_{Θ} E_{Y \sim p_{data} (\cdot | X)} [log p_{Θ} (Y ∣ X)] .

(2)

Action-conditional predictors, crucial for embodied agents like UAVs, extend this to

p_{Θ} (Y ∣ X, A)

, where

A_{t : t + T^{'} - 1} = {a_{t}, \dots, a_{t + T^{'} - 1}}

is a sequence of future actions. For a UAV, an action

a_{t}

could be a high-level command (e.g., target velocity and yaw rate,

(v_{x}, v_{y}, v_{z}, ω_{z})

) or low-level motor inputs. This conditioning, a cornerstone of model-based reinforcement learning [31], allows the model to “imagine” the consequences of different plans, forming the basis of model-based control [30].

Core Challenges

The task is notoriously difficult due to the following: (i) High Dimensionality: Direct pixel-space generation is computationally expensive. This motivates latent-space prediction, where a compact representation is predicted first and then decoded to pixels, often leading to better efficiency and coherence. (ii) Inherent Stochasticity: The future is uncertain; a single past can lead to multiple plausible futures. Deterministic models often average these possibilities, resulting in blurry predictions. (iii) Complex Spatio-Temporal Dependencies: Models must capture both local motion and long-range interactions across space and time. (iv) Error Accumulation: In autoregressive generation, small errors in early frames can compound catastrophically over long horizons.

2.3. Training Objectives and Loss Functions

The choice of loss function profoundly impacts the quality of generated videos. Modern approaches typically use a weighted combination of several losses to balance fidelity, perceptual quality, and realism.

Pixel-Level Fidelity Losses

The most direct objectives are pixel-wise losses, which are computationally cheap and provide a stable training signal.

$L_{ℓ_{2}} = {∥ \hat{Y} - Y ∥}_{2}^{2}$ (Mean Squared Error, MSE) is widely used but tends to produce overly smooth and blurry results by penalizing large errors heavily.
$L_{ℓ_{1}} = {∥ \hat{Y} - Y ∥}_{1}$ (Mean Absolute Error, MAE) is known to produce sharper images compared to L2 and is often preferred in modern architectures.

Perceptual and Feature-Based Losses

To better align with human perception, losses can be computed in a learned feature space rather than pixel space.

$L_{perc}$ [32]: Uses features $ϕ_{l}$ from a pre-trained network (e.g., VGG-19 [33]) to enforce perceptual similarity. It computes the distance between feature maps of the generated and real frames, capturing texture and structural information that pixel losses miss.
$L_{style}$ [34]: A related loss that encourages the generated frames to match the style (correlations between features) of the real frames, further improving textural realism.

Adversarial Loss

To produce crisp and realistic images, Generative Adversarial Networks (GANs) [35] introduce a discriminator network D that is trained to distinguish real frames from generated ones. The generator G (the prediction model) is then trained to fool the discriminator.

L_{adv} = min_{G} max_{D} E_{Y} [log D (Y)] + E_{\hat{Y}} [log (1 - D (\hat{Y}))] .

(3)

This adversarial objective pushes the generator to produce samples that lie on the manifold of real data, effectively acting as a learned loss function for realism.

Temporal Consistency Losses

To ensure smooth motion, specific losses can be added to regularize the temporal dimension.

$L_{warp}$ : An optical flow-based warping loss, as described in [36], encourages the model to learn physically plausible motion by explicitly penalizing inconsistencies between a warped previous frame and the current predicted frame.
$L_{flow}$ : Some models directly predict optical flow $\hat{f}$ and are supervised with a ground-truth flow $f$ (if available), using losses like the Average End-Point-Error (EPE): $L_{flow} = {∥ \hat{f} - f ∥}_{2}$ .

2.4. Evaluation Metrics: A UAV-Centric View

Evaluating video prediction requires a multi-faceted approach. A robust evaluation protocol should include metrics from each of the following categories.

Pixel Fidelity.

Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM) [37] are standard. While fast to compute, they correlate poorly with human perception of quality and often penalize plausible but slightly displaced predictions, a common issue in UAV footage with high parallax.

Perceptual and Distributional Similarity

LPIPS [38]: Learned Perceptual Image Patch Similarity computes the distance between deep features, offering a more robust measure of perceptual quality. It has become a de facto standard for evaluating generative models.
FVD [39]: Fréchet Video Distance is the standard for evaluating the distribution of generated videos. It measures the Fréchet distance between Gaussian distributions fitted to features of real and generated video clips extracted from a pre-trained video classifier, capturing both frame-level quality and temporal coherence.
KVD [40]: Kernel Video Distance is an alternative to FVD that uses the Maximum Mean Discrepancy (MMD) with a polynomial kernel, proposed as a more stable metric for high-dimensional feature spaces.

Motion and Temporal Consistency

Metrics that directly assess motion are critical for UAVs. These include End-Point-Error (EPE) of predicted optical flow, trajectory Mean Absolute Error (MAE) on tracked keypoints, and temporal consistency scores like temporal LPIPS (tLP) which measures perceptual similarity between consecutive frames. A comprehensive review of video generation metrics can be found in [41].

UAV-Specific and Deployability

For UAV applications, metrics must be task-aware. ROI-aware metrics (e.g., ROI-PSNR) compute scores only on masked regions of interest (e.g., small moving targets), which is crucial for tracking applications. Deployability metrics are non-negotiable: latency (ms/frame), throughput (FPS), peak VRAM usage (GB), and energy efficiency (Joules/frame or Frames/Joule) must be reported on specified edge hardware (e.g., NVIDIA Jetson AGX Orin) and precision (e.g., FP16/FP8).

2.5. Traditional Approaches to Video Prediction

Before the dominance of deep learning, several traditional approaches were widely studied for video prediction. While less powerful in complex UAV scenarios, they remain important for historical context and lightweight baselines.

Optical Flow Extrapolation.

Methods in this line estimate motion fields between consecutive frames (e.g., Horn–Schunck, Lucas–Kanade) and extrapolate them to generate future frames. While computationally efficient, flow extrapolation struggles with occlusions, non-rigid motion, and long-horizon stability.

Dynamic Textures and Autoregressive Models.

Linear dynamical systems and autoregressive models (e.g., Dynamic Textures, Kalman filtering) capture pixel intensity evolution over time. These methods are suitable for short clips and simple textures but cannot generalize to high-resolution UAV videos with complex dynamics.

Handcrafted Features and Shallow Models.

Earlier work also exploited handcrafted features such as SIFT, HOG, or sparse coding, combined with regression models to predict motion trajectories. Although interpretable and lightweight, such approaches lack the representation capacity required for high-dimensional UAV scenes.

Compared with modern deep models, traditional approaches offer advantages in efficiency and interpretability but fall short in scalability and fidelity. As such, they are rarely competitive in current UAV prediction tasks, yet they provide useful baselines and motivate hybrid methods that incorporate physics priors.

2.6. Quantitative Summary of Current SOTA in UAV Predictive Autonomy

To help practitioners quickly track the latest capabilities, Table 3 consolidates representative accuracy and efficiency metrics across major model families used for UAV video prediction. Accuracy metrics (PSNR/SSIM/FVD) are drawn from reported results in recent literature, while efficiency (FPS/VRAM) reflects typical ranges observed on edge (Jetson Orin) and workstation (RTX 4090) deployments, consistent with Section 3.1, Section 3.2 and Section 3.3.

3. A Multi-Dimensional Taxonomy of Video Prediction Models

To navigate the complex design space of modern video prediction, we propose a three-axis taxonomy that disentangles architectural choices, generative principles, and training strategies. This framework provides a structured lens to analyze and compare models, especially for deployment on resource-constrained platforms like UAVs.

3.1. Axis A: Operator Architecture

This axis separates the model’s spatial feature extractor (the backbone) from its temporal evolution mechanism (the temporal operator).

3.1.1. Type I: Convolutional Recurrence

This family of models, built upon the Convolutional LSTM (ConvLSTM) [11], pioneered deep learning for video prediction. ConvLSTM replaces the matrix multiplications in a traditional LSTM [55] with convolutional operations, allowing it to maintain a 2D spatial structure in its hidden states. This makes it a natural fit for modeling pixel-space dynamics. A summary of key models in this category is presented in Table 4.

Key Models and Evolution

ConvLSTM [11]: The foundational work that introduced the convolutional recurrent cell for precipitation nowcasting. Its strength lies in its strong spatio-temporal locality bias.
PredRNN [12] and PredRNN++ [42]: This series of works identified key limitations of deep ConvLSTM stacks. PredRNN introduced a spatio-temporal LSTM (ST-LSTM) unit to better model spatial correlations, while PredRNN++ added a Gradient Highway Unit (GHU) to route gradient information through time, alleviating the vanishing gradient problem in very deep recurrent models.
E3D-LSTM [56]: Proposed an Eidetic 3D LSTM that uses a self-attention mechanism to recall detailed state representations from the past, allowing the model to better handle long-term dependencies with transient details.
MIM (Masked Input Modeling) [57]: Addressed the issue of static regions in videos by introducing a mechanism to block stationary memory cells from updating, focusing computation on dynamic areas.
PhyDNet [43]: Explicitly disentangled physical dynamics (e.g., predictable motion) from residual uncertainties. It uses a ConvLSTM to model the residual part, guided by a physics-based differential equation solver, making it well-suited for scenes with predictable fluid-like motion.
TAU (Temporal Attention Unit) [58]: A recent work that augments recurrent models with a temporal attention mechanism to achieve state-of-the-art performance in forecasting, showing that hybrid recurrent-attention models remain a competitive research direction.

UAV-Centric Analysis

Strengths: Convolutional recurrence has a strong inductive bias for local motion and translation equivariance, making it data-efficient for simple dynamics. Its recurrent nature allows for rollout to arbitrary lengths.
Weaknesses: The strictly sequential nature of recurrence (one step at a time) leads to high inference latency and makes training parallelization difficult. More importantly, they are prone to error accumulation and often struggle to model complex, non-local interactions (e.g., a UAV turning rapidly, causing the entire scene to rotate).

3.1.2. Type II: Attention and Transformers

Inspired by their success in NLP [59] and image recognition [60], Transformers have been adapted for video by treating a video as a sequence of spatio-temporal tokens. Their core component, self-attention, allows every token to directly attend to every other token, making it exceptionally powerful for modeling long-range dependencies. Table 5 summarizes key developments.

Key Models and Evolution

Factorized Transformers: Early works adapted vision transformers by factorizing spatial and temporal attention to manage the quadratic cost. TimeSformer [61] explored different space-time attention schemes, while ViViT [62] proposed efficient tokenization strategies like “tubelet” embedding.
Local/Windowed Attention: To improve scalability, models like the Video Swin Transformer [26] compute attention within local, non-overlapping windows that are shifted across layers, achieving linear complexity with respect to the number of tokens.
Generative Video Transformers: More recent works have focused on large-scale generation. Phenaki [63] can generate minute-long videos from a sequence of text prompts by using a bidirectional transformer to compress video into discrete tokens. MagVIT-v2 [48] is a state-of-the-art model that uses a multi-stage transformer to perform masked token prediction, achieving high-quality results. The architecture of Sora [64] also builds on the transformer paradigm, using a diffusion-transformer to process “spacetime patches” for large-scale, high-fidelity video generation.

UAV-Centric Analysis

Strengths: Unparalleled ability to model global, long-range spatio-temporal dependencies. This is crucial for UAVs performing complex maneuvers where the entire scene transforms globally. Their parallelizable nature is a significant advantage for training.
Weaknesses: The quadratic complexity of full self-attention ( $O (L^{2} D)$ for sequence length L) is prohibitive for high-resolution, long-horizon videos. While variants like Swin reduce this, memory usage remains a major bottleneck for edge deployment on UAVs. They also lack a strong temporal inductive bias, often requiring massive datasets to learn motion priors effectively.

3.1.3. Type III: State-Space Models (SSMs)

SSMs are a recent and highly promising class of temporal operators that blend the strengths of RNNs and Transformers. Inspired by classical state-space models from control theory, modern deep learning versions like S4 and Mamba learn these state-space parameters from data. Key SSM-based models are listed in Table 6.

Key Models and Evolution

S4 [23] and S5 [65]: The breakthrough works that stabilized SSMs for deep learning. S4 used a specific parameterization (diagonal plus low-rank), while S5 further simplified the formulation, making it highly effective for long-sequence modeling.
Mamba [24]: Improved upon prior SSMs by making the parameters input-dependent, allowing the model to selectively remember or forget information based on the content. This “selective” mechanism gives it capabilities rivaling attention but with linear-time complexity.
Vision Mamba (Vim) and VMamba [50,66]: Adapted the Mamba architecture for vision tasks by “flattening” image patches into a sequence and applying the Mamba block. These models have shown competitive performance to Vision Transformers with better computational scaling.
SimVPv2 [45]: A video prediction model that incorporates an “Invariant-Memory State-Space” (IM-SSM) block, demonstrating the direct applicability and benefits of SSMs for spatio-temporal forecasting.
ss-Mamba [67]: A recent variant of the Mamba family, ss-Mamba integrates semantic-aware embeddings and spline-based temporal encodings within a selective state-space modeling framework to enhance forecasting accuracy, robustness, and interpretability while reducing computational overhead in complex sequence modeling tasks.

UAV-Centric Analysis

Strengths: SSMs offer the “best of both worlds”: linear-time complexity ( $O (L D)$ ) and constant memory during autoregressive inference like an RNN, combined with parallelizable training and the ability to capture very-long-range dependencies like a Transformer. This makes them exceptionally well-suited for the long-horizon, high-resolution, and resource-constrained nature of UAV deployment.
Weaknesses: As a very new architecture class, best practices for implementation and hyperparameter tuning are still emerging. Their inductive bias for visual tasks is also less understood compared to CNNs or Transformers.

3.1.4. Type IV: Explicit Warping and Flow

This approach is founded on a strong physical prior: most changes in a video are due to motion. These models first explicitly estimate motion, typically as a dense optical flow field using networks like RAFT [68], and then use this field to warp the previous frame into the future. A separate, often smaller, neural network then predicts a residual to handle disocclusions, new content, and non-rigid changes. A summary is provided in Table 7.

Key Models and Evolution

DVF (Deep Voxel Flow) [36]: Learned a voxel flow representation to warp past frames into the future and then used a GAN to synthesize a sharp result.
SAVP (Spatially-Aware Video Prediction) [44]: An action-conditional model that disentangles static and dynamic content. It predicts future optical flow and combines it with a learned foreground mask to generate predictions.
Layered Representations: Some models predict that scenes are composed of multiple layers, each with its own motion. World of Bits [69] and later works model a scene as a set of sprites or objects on a background, predicting the motion of each independently.
Neural Radiance Fields (NeRF) based prediction: A new frontier involves using NeRF [71] for prediction. Models like DyNeRF [70] learn a dynamic scene representation that can be rendered from any viewpoint at any time, implicitly performing prediction by modeling the underlying 3D scene dynamics.

UAV-Centric Analysis

Strengths: Incorporating a strong geometric prior makes these models highly data-efficient and capable of producing sharp predictions for scenes dominated by camera motion, which is very common for UAVs.
Weaknesses: They are brittle. If the underlying flow estimation fails (e.g., in scenes with low texture, fast motion, or severe occlusions), the entire prediction quality collapses. The two-stage process (flow estimation then residual prediction) can also be less efficient than end-to-end models.

3.1.5. Comparative Analysis of Operator Architectures

While Section 3.1.1, Section 3.1.2, Section 3.1.3 and Section 3.1.4 described each operator family qualitatively, UAV practitioners often require a side-by-side comparison to evaluate deployment feasibility. Table 8 summarizes the key quantitative trade-offs across convolutional recurrent networks, Transformers, state-space models, and flow/warping-based methods.

From this comparison, SSMs provide the best balance between long-horizon modeling and efficiency, while Transformers achieve strong accuracy at the cost of memory and latency. ConvLSTM remains competitive on resource-limited UAVs, whereas flow-based methods excel when explicit motion priors dominate but are less flexible in unconstrained scenes.

3.2. Axis B: Generative Nature (Deterministic vs. Probabilistic)

This axis classifies models based on whether they produce a single, best-guess future (deterministic) or a distribution of possible futures (probabilistic). This choice is fundamental as it relates directly to handling the inherent uncertainty of the real world.

3.2.1. Deterministic Models

Deterministic models learn a one-to-one mapping from past frames to future frames, typically by minimizing a pixel-wise reconstruction loss like L1 or L2. They predict the conditional mean of the future distribution. Table 9 lists key examples.

Representative Models:

Many early ConvLSTM-based models fall into this category. A prime modern example is SimVP [13], which uses a simple, purely convolutional architecture to achieve state-of-the-art deterministic predictions. Its success demonstrates that a well-designed backbone can implicitly learn complex dynamics without explicit probabilistic modeling.

UAV-Centric Analysis:

Strengths: Deterministic models are computationally efficient, straightforward to train, and produce stable, repeatable predictions. This makes them suitable for UAV tasks where a single, plausible forecast is sufficient, such as short-term obstacle avoidance. Weaknesses: They famously suffer from the “blurry prediction” problem. When faced with multiple possible futures (e.g., a vehicle at an intersection could turn left or right), the model averages these outcomes, resulting in a fuzzy, unrealistic image. This limits their use in long-horizon planning or scenarios requiring risk assessment based on multiple outcomes.

3.2.2. Probabilistic Models

Probabilistic models aim to capture the full distribution of possible futures,

p (Y | X)

. This allows them to generate diverse and realistic samples, which is crucial for robust decision-making under uncertainty.

Type I: Variational Autoencoders (VAEs)

VAEs [72] learn a compressed latent distribution of the future. By sampling a latent vector

z

from this distribution and feeding it to a decoder, they can generate a variety of future sequences. This provides a principled and efficient way to model uncertainty. Key VAE-based models are summarized in Table 10.

VRNN (Variational RNN) [73]: One of the earliest works to combine RNNs with variational inference, allowing the model to handle high-dimensional sequential data like speech and video.
SVG (Stochastic Video Generation) [74]: A seminal VAE-based model for video prediction. It uses a learned prior and posterior distribution over a latent variable $z$ that is sampled for each predicted frame, enabling diverse outputs.
SLT (Stochastic Latent Transformer) [75]: This model represents a modern hybrid approach. It first uses a VAE to encode frames into a discrete latent space and then applies a Transformer to model the temporal dynamics of these latent representations. This leverages the probabilistic nature of VAEs and the long-range modeling capabilities of Transformers.
GH-VAE (Generative Hierarchical VAE) [76]: Employs a hierarchical latent space to model video from a global level down to fine-grained details, improving the quality and coherence of long-term generation.
VAE-SD (Supervised Disentanglement VAE) [77]: A more recent work that combines disentangled representation learning with VAEs to improve the diversity and quality of video generation.
TD-VAE (Temporally-Disentangled VAE) [78]: This work focuses on learning more structured representations by explicitly disentangling time-invariant factors (e.g., an object’s identity) from time-varying factors (e.g., its position and pose) within the latent space, which is crucial for better understanding and control.
Dreamer Series as World Models [30]: While complex systems, the Dreamer models are fundamentally built on VAE principles. They learn a compact, latent world model (often called a Recurrent State-Space Model with a variational component) where an agent can efficiently plan future actions through “imagination.” This represents a highly successful application of VAEs for decision-making and control, directly relevant to predictive autonomy.

UAV-Centric Analysis

Strengths: VAEs provide a principled way to model uncertainty and generate diverse futures. For a UAV, this means it could anticipate multiple possible trajectories for a pedestrian, allowing the planner to choose a path that is safe under all likely scenarios. The trend towards structured, disentangled latent spaces (e.g., TD-VAE) and their successful application in world models (Dreamer) makes them highly compelling for learning controllable and interpretable environment models.
Weaknesses: The diversity often comes at the cost of sharpness, as VAEs can also suffer from a form of blurring or mode-averaging within the decoder, though less severely than deterministic models. The quality of generation is highly dependent on the expressiveness of the latent space.

Type II: Generative Adversarial Networks (GANs)

GANs [35] use a minimax game between a generator and a discriminator to produce highly realistic outputs. The generator predicts future frames, while the discriminator is trained to distinguish between real and generated sequences. This adversarial pressure forces the generator to produce sharp, perceptually convincing frames that lie on the manifold of natural images. Key developments are summarized in Table 11.

vid2vid (Video-to-Video Synthesis) [79]: A foundational work in conditional video generation. While not strictly a prediction model, it demonstrates how GANs can translate a sequence of abstract inputs (like semantic segmentation maps) into photorealistic video. This capability is crucial for generating realistic simulation data from abstract planner outputs.
MoCoGAN [80]: Decomposes motion and content into separate latent spaces, allowing for better control and structured generation of video content.
TGANv2 (Temporal GAN v2) [81]: This work directly addresses the challenge of temporal consistency. It pairs a recurrent generator with a temporal discriminator that evaluates sequences of frames, encouraging the model to learn smoother and more plausible long-term motion dynamics.
DVD-GAN (Dual Video Discriminator GAN) [82]: Utilized two discriminators—a spatial discriminator that judges frame quality and a temporal discriminator that judges motion realism across frames—to generate high-quality, coherent videos.
StyleGAN-V and StyleVideoGAN [83,84]: These state-of-the-art models build upon the powerful StyleGAN architecture. They leverage style-based synthesis and disentangled latent spaces to generate high-resolution, continuous videos with a high degree of control over appearance and motion, demonstrating the continued power of GANs for high-fidelity applications.
Video-GPT [47]: While primarily using a VQ-VAE and Transformer, it crucially employs an adversarial objective on top of its decoder to enhance the realism and sharpness of the final pixel-space output, showcasing the effectiveness of hybrid approaches.

UAV-Centric Analysis

Strengths: Unmatched ability to produce crisp, high-fidelity images. For applications like sim-to-real, data augmentation, or generating realistic training environments for other perception modules on a UAV, GANs are highly effective. The ability to perform conditional synthesis (vid2vid) is particularly valuable for creating realistic sensor data from simulated planner outputs.
Weaknesses: GANs are notoriously difficult and unstable to train. They can also suffer from “mode collapse,” where the generator learns to produce only a limited variety of outputs, defeating the purpose of probabilistic modeling. Ensuring long-term temporal coherence remains a significant challenge, although models like TGANv2 have made progress.

Type III: Denoising Diffusion Models

Diffusion models [85,86] are the current state-of-the-art in generative modeling. They work by progressively adding noise to data in a “forward process” and then training a neural network to reverse this process, starting from pure noise and iteratively denoising it to generate a clean sample. Key models are listed in Table 12.

MCVD (Masked Conditional Video Diffusion) [17]: A leading diffusion-based model for video prediction. It takes past frames as a condition and uses a masked denoising process to generate future frames, demonstrating remarkable quality and diversity.
LVDM (Latent Video Diffusion Models) [51]: To manage the immense computational cost of running diffusion in pixel space, LVDM first uses an autoencoder to compress the video into a much lower-dimensional latent space. The diffusion process then occurs entirely in this latent space, followed by a final decoding step. This approach is now standard for large-scale video models.
LFDM (Latent Flow Diffusion Model) [87]: This work directly tackles the slow sampling speed of diffusion models. By leveraging Rectified Flow, LFDM learns a straighter trajectory between noise and data in the latent space, enabling it to produce high-quality video in as few as 4–8 inference steps—a significant move towards real-time feasibility.
W.A.L.T/Sora [64,88]: These models represent the state-of-the-art in generative fidelity by pairing the diffusion framework with a powerful Transformer backbone (often termed a Diffusion Transformer). They process video as a sequence of “spacetime patches,” enabling remarkable temporal consistency and high-resolution output.
MotionCtrl [89]: This model enhances the controllability of video generation by allowing explicit inputs for camera motion (e.g., pan, tilt, zoom) and object trajectories. This is a crucial step towards creating world models that can be precisely controlled for planning and simulation in robotics.
ExtDM (Extrapolative Diffusion Models) [90]: To address long-horizon generation, ExtDM introduces an autoregressive mechanism within the latent diffusion process. It predicts a short chunk of future latent codes, appends them to the context, and then predicts the next chunk, enabling the generation of much longer and more coherent video sequences than one-shot models.

UAV-Centric Analysis

Strengths: Produce the highest quality and most diverse samples among all generative models. Their ability to generate highly realistic future scenarios could be transformative for high-stakes UAV mission planning and simulation. Models like MotionCtrl show a clear path toward action-conditional world models with fine-grained control.
Weaknesses: Their primary drawback is the slow inference speed. However, recent advancements like LFDM are rapidly closing this gap by drastically reducing the required number of sampling steps. While still challenging for onboard deployment, these efficiency improvements suggest that real-time diffusion models may become feasible in the near future, especially when combined with hardware acceleration and distillation techniques.

3.2.3. Comparative Analysis of Generative Paradigms

While Section 3.2.1 and Section 3.2.2 outlined the strengths of deterministic, variational, adversarial, and diffusion-based models, UAV deployment requires a side-by-side evaluation of their trade-offs. Table 13 provides a structured summary, including predictive fidelity, diversity, inference efficiency, and lightweight strategies.

From this comparison, diffusion models clearly dominate in fidelity and diversity but suffer from high latency. Recent studies show that reducing diffusion steps from 250 to 20 improves FPS from ~2 to ~15 on edge hardware, with only minor PSNR/SSIM degradation (e.g., 28.2 → 28.0, SSIM 0.91 unchanged) [91,92]. Latent compression (e.g., spatial

64 \times 64

, temporal 8×) further boosts speed but reduces PSNR by ~5 dB [91,93]. Knowledge distillation can shrink diffusion models to 4–8 steps, retaining over 95% of performance while reaching 10+ FPS on mobile GPUs [91,94,95]. For UAV deployment, this suggests that diffusion is viable only with aggressive acceleration, while VAEs and GANs offer better real-time trade-offs.

3.3. Axis C: Training and Inference Regime

This axis describes the strategic choices in how models are trained, how they generate predictions, and how they are optimized for real-world deployment on edge devices. These choices often have as much impact on final performance as the model architecture itself.

3.3.1. Advanced Training Paradigms

Beyond the basic training loop, several advanced paradigms are used to improve model performance, data efficiency, and task alignment, as summarized in Table 14.

Curriculum and Multi-Stage Training

Instead of training on the full complexity of the task from the start, curriculum learning introduces concepts gradually.

Loss Scheduling: A common curriculum is to start by training with a simple L1/L2 loss to establish a stable baseline. Perceptual and adversarial losses are then gradually introduced or their weights increased, guiding the model toward finer details and realism without causing early training instability.
Multi-Stage Architecture Training: This is central to latent space models. First, a convolutional autoencoder (often a VQ-VAE) is trained on a massive image or video dataset to learn a robust and compressed latent space. In the second stage, this encoder/decoder is frozen, and the temporal model (Transformer, SSM, or diffusion model) is trained exclusively in the much lower-dimensional latent space. This approach, used by Video-GPT [47] and latent diffusion models, is far more computationally efficient.

Self-Supervised Pre-Training and Fine-Tuning

Inspired by the success of models like BERT and GPT, pre-training on large, unlabeled datasets followed by fine-tuning on a specific task has become a dominant paradigm.

Masked Autoencoding (MAE) [96] is a powerful self-supervised pre-training strategy. Models like MaskViT [97] are trained to reconstruct heavily masked (e.g., 75%) patches of a video. To succeed, the model must learn rich internal representations of motion and appearance.
Transfer Learning: A model pre-trained on a large-scale general video dataset (e.g., Kinetics-400 for human actions or web-crawled videos) can then be fine-tuned on a smaller, domain-specific UAV dataset. This transfers general knowledge about visual dynamics, significantly improving performance and reducing the amount of specialized data needed. This is a crucial strategy to overcome the data scarcity challenge in the UAV domain.

Fine-tuning with Reinforcement Learning (RL)

For control tasks, aligning the predictive model with the ultimate goal (e.g., reaching a waypoint without crashing) is more important than achieving perfect pixel accuracy.

World Models [29,30] fully embrace this concept. The predictive model is trained to forecast future latent states. A separate policy network is then trained via RL *entirely within the latent space of the world model*. The gradients from the RL reward signal can be used to fine-tune the predictive model itself, ensuring that what it predicts is not just visually plausible but also useful for making good decisions. This tight coupling of prediction and control is the essence of advanced predictive autonomy.

3.3.2. Generation and Inference Strategies

This defines the core mechanism by which a model produces its output sequence. Table 15 compares the two main approaches.

Autoregressive (AR) Models

Autoregressive (AR) models generate the future one frame at a time: the prediction for frame

t + 1

,

{\hat{y}}_{t + 1}

, is conditioned on all previously generated frames

{\hat{Y}}_{: t}

. Most ConvLSTM- and SSM-based predictors adopt this scheme due to their recurrent inference path.

A well-known pitfall of AR decoding is exposure bias. During training, the model typically consumes ground-truth frames (teacher forcing), which is efficient but induces a train–test mismatch: at inference it must condition on its own, potentially imperfect, rollouts. The resulting feedback can degrade quality rapidly over long horizons. Scheduled Sampling [98] mitigates this by gradually replacing ground-truth inputs with model predictions during training, thereby acclimating the model to its own errors.

Non-Autoregressive/One-Shot Models

These models generate the entire sequence of future frames (or large blocks of them) in a single forward pass, breaking the sequential dependency. This is typical for Transformer-based models and some diffusion models.

MagViT [48] is a prime example of a non-autoregressive transformer that predicts all future frames’ tokens in parallel, leading to a significant speed-up. Similarly, latent diffusion models like LVDM [51] can generate a sequence of future latents in a single diffusion process, which are then decoded in parallel.
UAV-Centric Trade-off: AR models offer flexibility for variable-length prediction but can be slow and suffer from error accumulation. Non-AR models are much faster for fixed-length prediction but may lack the fine-grained temporal consistency of their AR counterparts. For UAVs, a fast non-AR model might be ideal for short-term reactive planning, while a more deliberate AR model could be used for longer-term mission planning offline.

3.3.3. Deployment and Acceleration Strategies

Bridging the lab-to-field gap requires aggressive optimization to fit powerful models onto resource-constrained UAV hardware. Key techniques are outlined in Table 16.

Knowledge Distillation

This involves training a small, fast “student” model to mimic a large “teacher.”

Output-Based Distillation: The student is trained to match the final predictions of the teacher. For a diffusion model, this means training a single-step student network to replicate the output of the multi-step teacher, a technique explored in works like [95].
Feature-Based Distillation: A more powerful technique where the student is also trained to match the intermediate feature maps of the teacher. This provides a richer training signal. A recent framework, F-VLM [99], shows how pre-trained vision-language models can be distilled into much smaller, efficient students for deployment. While not specific to prediction, the principles are directly applicable.

Quantization and Pruning

These techniques reduce the model’s size and computational demands at the hardware level.

Quantization: A crucial step for deployment on platforms with specialized hardware like NVIDIA’s Tensor Cores or Qualcomm’s AI Engines. Toolsets like NVIDIA’s PTQ Toolkit provide methods for calibrating and converting models to FP8 with minimal accuracy loss. As demonstrated in works like ZeroQuant [100], sophisticated quantization techniques can be applied even to massive transformer models.
Pruning: While classic pruning creates sparse models that are hard to accelerate on GPUs, structured pruning removes entire filters or attention heads. Recent methods like LLM-Pruner [101] have developed effective strategies for pruning large transformer models, which can be adapted for vision transformers used in video prediction.

Hardware–Software Co-Design

This emerging paradigm involves designing the neural network architecture with the target hardware’s capabilities in mind from the very beginning.

Neural Architecture Search (NAS) can be used to automatically discover efficient model architectures. For instance, Once-for-All (OFA) [102] trains a single, large “over-parameterized” network from which specialized sub-networks can be quickly extracted to meet diverse hardware constraints without retraining. This allows for deploying tailored models for different UAV platforms (e.g., a small model for a nano drone, a larger one for a heavy-lift hexacopter) from a single trained asset.

3.3.4. Comparative Analysis of Training and Inference Regimes

While Section 3.3.1, Section 3.3.2 and Section 3.3.3 introduced autoregressive, non-autoregressive, and acceleration strategies individually, it is essential to evaluate their relative trade-offs for UAV deployment. Table 17 summarizes these regimes across multiple dimensions.

In summary, AR regimes excel at modeling long temporal dependencies but are limited by latency and error accumulation. Non-AR approaches deliver high efficiency, making them ideal for real-time UAV tasks, yet their fixed horizon constrains applicability. Hybrid approaches, though more complex, offer a promising compromise by mitigating error accumulation while retaining partial flexibility. This comparative view highlights that different UAV applications (e.g., long-term navigation vs. short-term obstacle avoidance) naturally map to distinct inference regimes.

4. Key Applications and Challenges for Embodied UAVs

The theoretical advancements in video prediction are compelling, but their true value is realized when grounded in real-world applications. For embodied UAVs, predictive models are not merely academic exercises; they are enabling technologies for a new level of autonomy. This section explores three core applications—navigation, tracking, and anomaly detection—analyzing the specific demands each places on model design through the lens of our three-axis taxonomy.

4.1. Predictive Control for Autonomous Navigation in GPS-Denied Environments

In environments like urban canyons, dense forests, or indoor spaces, GPS signals are unreliable or unavailable. Here, vision becomes the primary sense for navigation. Predictive control, particularly Model Predictive Control (MPC) [103], leverages a forward model of the world to plan actions. Video prediction models serve as powerful, learned forward models.

Problem Setting and Approach

Given the UAV’s past observations

X

and a candidate sequence of future control actions

A

, an action-conditional video prediction model forecasts the resulting future states or observations

\hat{Y} = F (X, A)

. A planner then optimizes over the space of possible action sequences by minimizing a cost function evaluated on the imagined futures. This “imagination-based planning” is the core idea of seminal works like Dreamer [15] and its successor DreamerV3 [30]. In the context of robotics, this approach is often referred to as Visual MPC, where systems like MPPI [104] provide efficient algorithms for searching the action space. The goal is to learn a policy that can navigate complex environments by “thinking ahead.”

Model Design Requirements (Mapping to Taxonomy)

Axis A (Operator): Long planning horizons ( $H \geq 20$ ) are essential for non-myopic behavior. This makes SSMs (e.g., SimVPv2) and efficient Transformers superior to ConvLSTMs, as they are less prone to vanishing gradients and better at capturing long-range dependencies with favorable scaling properties. Prediction is often performed in a compact latent space rather than pixel space to save computation.
Axis B (Generative Nature): Navigation in dynamic environments is fraught with uncertainty. A probabilistic model (e.g., VAE-based like SVG) is highly desirable as it can generate multiple possible futures for other agents (e.g., pedestrians, vehicles), enabling the planner to find a robust policy that is safe across many outcomes.
Axis C (Training/Inference): The model must be action-conditional. Training data must contain tightly synchronized `(observation, action, next_observation)’ triplets. At inference, the model is used in an autoregressive fashion to roll out long trajectories.

UAV-Specific Challenges

Unlike ground robots, UAVs operate in full 3D space with 6-DoF dynamics. This creates an enormous action space for the planner to search. The visual scene is dominated by ego-motion, and the lack of a ground-plane constraint means that small prediction errors in pitch or roll can lead to catastrophically divergent future trajectories. While early works demonstrated visual MPC on UAVs for tasks like trail following [105], adapting these models to the extreme dynamics and safety requirements of complex urban environments remains a major challenge. Recent works like Unified World Models (UWM) [106]: This line of work explores unified world models that integrate **video diffusion** and **action diffusion** within a single transformer architecture to model dynamics, inverse/forward mappings, and video generation jointly across multiple robotic domains. By training on both robot trajectories and unlabeled video data, UWM aims to generalize across diverse robots and environments with a single model.

4.2. Proactive Target Tracking Through Occlusion and Abrupt Motion

Visual object tracking from a UAV is a core capability for applications ranging from surveillance to search and rescue. A key failure mode for trackers is when the target is temporarily occluded or undergoes abrupt motion. Video prediction offers a robust solution by providing a learned motion prior.

Problem Setting and Approach

Most modern trackers follow a tracking-by-detection paradigm, which re-localizes the target in each frame. However, when the detector fails, the track is lost. By incorporating a predictive module, the tracker can “coast” through such failures. The system maintains a target state (e.g., a Kalman filter state) which is propagated forward in time using both a classical motion model and a deep predictive model. The deep model can forecast the target’s appearance and location, enabling rapid re-detection after occlusion. This principle is explored in works like P-Tracker [107] and has roots in earlier ideas of using generative models to handle occlusion in tracking [108]. Recent Siamese network-based trackers like Siam R-CNN [109] can also be augmented with temporal prediction modules to improve long-term robustness.

Model Design Requirements (Mapping to Taxonomy)

Axis A (Operator): Since tracking is a real-time task, low latency is critical. Lightweight CNNs (e.g., SimVP-style) or shallow ConvLSTMs are often preferred over heavy Transformers or SSMs for this application. The prediction can be done on a small, cropped region of interest (ROI) around the target to save computation.
Axis B (Generative Nature): For most tracking scenarios, a fast deterministic prediction is sufficient. The primary goal is to get a good estimate of the target’s next location, not to generate photorealistic crops. However, for highly erratic targets, a probabilistic model could forecast a distribution of possible next locations, guiding a more robust search strategy.
Axis C (Training/Inference): Training with ROI-aware losses is crucial. The loss should be computed only on or near the target’s bounding box to ensure the model focuses on learning the target’s specific dynamics, not the background.

UAV-Specific Challenges

Targets viewed from UAVs are often very small, occupy only a few pixels, and undergo dramatic scale and appearance changes. The VisDrone [110] and UAVDT [111] datasets are filled with such challenging scenarios. The high motion of both the camera and the target means that simple motion models (like constant velocity) are often insufficient, making learned, content-aware predictors particularly valuable.

4.3. Unsupervised Anomaly Detection for Infrastructure Inspection

UAVs are increasingly used for the automated inspection of critical infrastructure. Unsupervised anomaly detection aims to identify novel or unexpected phenomena—such as a new crack, corrosion, or vegetation overgrowth—without prior training on examples of these faults.

Problem Setting and Approach

The “prediction-based” paradigm is standard for this task. A model is trained exclusively on large amounts of data depicting “normal” conditions. At test time, the model predicts the next frame(s), and an anomaly score is computed based on the prediction error (PE) between the prediction

{\hat{y}}_{t}

and the actual frame

y_{t}

. This idea has been extensively explored in video anomaly detection (VAD), with models like AnoPred [112] and methods that combine reconstruction and prediction losses [18]. The core assumption is that the model, having only seen normal data, will fail to accurately predict unseen anomalous events. This approach is explored in aerial contexts in datasets like Drone-Anomaly [113], and a comprehensive survey of VAD techniques can be found in [114].

Model Design Requirements (Mapping to Taxonomy)

Axis A (Operator): High fidelity is key. The model must be powerful enough to accurately reconstruct fine details and textures of normal data. Deep CNN backbones are common. To handle the strong geometric structure, models incorporating explicit warping/flow can be very effective, as they can separate appearance changes from simple viewpoint shifts.
Axis B (Generative Nature): A high-capacity deterministic model is often the best choice, as the goal is to learn a single, precise model of the normal data distribution. Some works explore using the discriminator of a GAN to identify out-of-distribution (anomalous) samples.
Axis C (Training/Inference): A critical preprocessing step is ego-motion compensation. By using IMU data or estimating optical flow, the input frames can be warped to a reference frame, ensuring the model doesn’t flag changes due to UAV movement as anomalies.

UAV-Specific Challenges

The definition of “normal” can be highly variable due to changing illumination, weather, and seasons. A model trained only on sunny-day data may falsely flag shadows on a cloudy day as anomalies. Developing models that are robust to these environmental variations while remaining sensitive to subtle structural faults is a major open research problem. Furthermore, camera jitter and ego-motion can easily be mistaken for anomalies, making stabilization (as discussed in Section 5.1) a critical prerequisite.

4.4. UAV-Centric Datasets and Their Limitations

The performance of any predictive model is fundamentally tied to the quality and characteristics of the training data. While several UAV-centric datasets exist, many were not designed with video prediction in mind. We add to our earlier table with datasets like UAV-ARG [115], which provides high-quality data for gesture recognition, and Blackbird [116], a large-scale dataset specifically for aggressive, vision-based drone racing.

Critical Gaps in Existing Datasets

A major bottleneck for advancing UAV video prediction, especially for world models, is the lack of large-scale datasets that provide long, continuous video sequences tightly synchronized with high-frequency 6-DoF ego-motion data (IMU), and control inputs. Datasets from the autonomous driving domain, like Argoverse 2 [117] and the Waymo Open Dataset [118], provide a blueprint for what is needed: diverse scenarios with rich, multi-modal, synchronized sensor data. The remote sensing community needs a similar benchmark to truly push the boundaries of predictive autonomy for UAVs.

4.5. Toward Standardized Evaluation Protocols for UAV Settings

Meaningful progress requires consistent and comprehensive evaluation. We advocate for a standardized protocol that moves beyond simple PSNR/SSIM.

Stratified Reporting

Performance should not be a single number. Results must be stratified by key factors affecting difficulty: scene type (e.g., urban vs. rural), ego-motion dynamics (e.g., hover vs. aggressive flight, measured by average optical flow magnitude), and environmental conditions (day/night, clear/rainy).

Task-Oriented Metrics

Evaluation should be tied to the downstream task. For navigation, as demonstrated in benchmarks like the CARLA challenge [119], metrics should include route completion rates and infractions (collisions). For tracking, the primary metric is tracking accuracy (e.g., AUC, precision) during occlusion, directly measuring the benefit of the predictive module.

Efficiency and Power Disclosure

A model’s utility for UAVs is meaningless without a clear report of its performance on relevant edge hardware. We propose a standard reporting format: `FPS|VRAM (GB)|Power (W) @ Resolution on [Hardware] with [Precision]’. For instance: `32 FPS|3.1 GB|18 W @ 512 × 512 on Jetson AGX Orin with FP16’. This transparency is essential for reproducibility and fair comparison.

5. Engineering Considerations and Deployment

A performant video prediction model in a Jupyter notebook is merely a prototype. Transforming it into a reliable component on an autonomous UAV requires navigating a series of significant engineering challenges, from system architecture to real-time acceleration and safety validation.

5.1. System Architecture on Board a UAV

An effective onboard system is not a monolithic program but a collection of asynchronous, communicating modules. Modern robotics frameworks like ROS 2 (Robot Operating System 2) [120] provide a robust, real-time-capable foundation for such architectures.

A Decoupled, Multi-Module Pipeline

A typical pipeline for predictive autonomy can be structured as follows:

Sensor Abstraction Layer: Nodes that publish synchronized sensor data (camera frames, IMU readings, GPS if available) with consistent timestamps.
State Estimation Node: A Visual-Inertial Odometry (VIO) or SLAM module (e.g., VINS-Fusion [121]) that provides a high-frequency estimate of the UAV’s 6-DoF pose. This is the bedrock of the system.
Prediction Node: This core module subscribes to sensor data and pose estimates. It first performs ego-motion stabilization by warping incoming frames to a common reference frame based on the VIO data. This crucial step offloads the modeling of background motion from the neural network, allowing it to focus its capacity on learning object dynamics and scene changes. The stabilized frames are then fed into the video prediction model.
Planning and Decision Node: Subscribes to the predicted futures (e.g., latent states or semantic maps) from the Prediction Node. It performs trajectory optimization or task planning (e.g., using MPC) and publishes control commands.
Control and Actuation Node: A low-level controller (e.g., a PX4 flight controller [122]) that translates high-level commands into motor signals.

This decoupled design allows for modular development, testing, and multi-rate execution, which is essential in a complex real-time system.

5.2. Compression and Acceleration for Real-Time Inference

To meet the stringent latency and power budgets of a UAV, models must be aggressively optimized.

Inference Acceleration with Specialized Engines

Simply running a PyTorch v.2.8.0 or TensorFlow v.2.16.1 model on an edge device is inefficient. Hardware vendors provide specialized inference engines like NVIDIA’s TensorRT [123], which take a trained model and perform several key optimizations:

Graph Optimization: Fusing multiple layers (e.g., Conv -> BatchNorm -> ReLU) into a single computational kernel to reduce overhead.
Precision Calibration: Intelligently converting model weights to lower-precision formats like FP16 or FP8 (quantization) to leverage specialized hardware units (e.g., Tensor Cores) for massive speedups.
Kernel Auto-Tuning: Selecting the fastest available implementation for each layer from a library of hardware-specific kernels.

Advanced Compression Techniques

Beyond standard optimization, advanced techniques are needed:

Quantization-Aware Training (QAT): While Post-Training Quantization (PTQ) is fast, it can sometimes lead to accuracy degradation. QAT simulates the effects of quantization during the training process itself, allowing the model to adapt its weights to the lower precision, often resulting in higher accuracy for FP8 models [124]. This is particularly important for sensitive temporal models.
Structured Pruning and Distillation: Instead of just removing individual weights (unstructured pruning), which leads to sparse matrices that are inefficient on GPUs, structured pruning removes entire channels or blocks. This can be combined with knowledge distillation, where a smaller “student” model is trained not only on the data but also to mimic the feature maps or motion field predictions of a larger “teacher” model, preserving performance in a much smaller package.

5.3. Safety, Fault Tolerance, and Robust Testing

An autonomous system is only as good as its worst-case performance. Safety cannot be an afterthought.

Uncertainty-Aware Safety Layers

The system must be able to recognize when its predictive model is failing or uncertain. This requires the model to not only make a prediction but also to quantify its own confidence. A comprehensive survey of uncertainty quantification techniques in deep learning can be found in [125]. Probabilistic models (from Axis B) are a natural fit here.

By generating multiple future samples or estimating the parameters of a predictive distribution, the system can calculate the variance or divergence. A high divergence (the model is “unsure”) can serve as a trigger for a safety protocol, a core principle of safe learning-based control [126,127]. The UAV could then revert to a conservative behavior, such as hovering or increasing its safety margins.

From Software-in-the-Loop to Hardware-in-the-Loop

Real-world flight testing is expensive and risky. A rigorous, multi-stage testing protocol is essential, leveraging high-fidelity simulators like AirSim [128] or NVIDIA Isaac Sim [129].

Software-in-the-Loop (SIL): The entire software stack is tested in a purely virtual environment.
Hardware-in-the-Loop (HIL): The onboard computer (e.g., a Jetson) runs the complete flight software, but its sensor inputs come from a simulator, and its control outputs are sent back to the simulator. The importance and methodology of HIL for UAV validation are well-documented in works like [130]. This tests the real-time performance and computational load of the actual flight hardware without physical risk.
Real-World Flight Tests: Only after passing extensive HIL testing should the system be deployed on a physical UAV, initially in a controlled environment.

5.4. Datasets, Benchmarks, and Challenges

A significant challenge in UAV video prediction is the profound scarcity of datasets and benchmarks designed specifically for this task. Consequently, the field currently relies heavily on repurposing video data from benchmarks established for other UAV-centric computer vision tasks, such as object tracking, detection, and semantic segmentation. The real-world datasets presented in Table 18, with the exception of the AirSim simulator, are all prominent examples of this practice; none were originally created to evaluate generative video models.

This pragmatic “makeshift” approach, while necessary, also introduces a series of interconnected challenges. A core problem stems from a severe misalignment in evaluation metrics: the official benchmarks for these datasets use metrics (e.g., MOTA for tracking or mIoU for segmentation) that are entirely irrelevant for assessing the quality of predicted frames. This forces researchers to fall back on generic, and often perceptually flawed, image-quality metrics (e.g., PSNR, SSIM, LPIPS, etc.). These metrics still fail to evaluate what truly matters; they primarily assess single-frame perceptual quality but cannot directly measure the temporal coherence of motion or the task-relevant utility of a prediction.Compounding this issue is the lack of corresponding dense ground truth. Most of these datasets provide only sparse labels (e.g., bounding boxes), which precludes the evaluation of richer, more useful predictions such as future depth, optical flow, or scene geometry. Furthermore, the datasets themselves can contain task-specific biases; for instance, camera motions in tracking datasets are often designed to keep a target centered, which may not fully reflect the full spectrum of dynamics encountered in autonomous navigation. This reliance on repurposed data results in several critical gaps for predictive autonomy, including the insufficient temporal continuity of short video clips and a general lack of synchronized multi-modal data, such as IMU readings and control inputs, which are essential for training world models. Collectively, these challenges not only complicate the fair and comprehensive evaluation of predictive models but also hinder progress towards architectures truly optimized for the prediction task itself.

6. Future Directions

While significant progress has been made, the journey towards truly predictive autonomy for UAVs is far from over. We highlight five key directions that will shape the future of this field.

6.1. Beyond Pixels: Long-Horizon and Semantic Forecasting

For high-level decision-making, raw pixel predictions are often inefficient and brittle. The future lies in predicting more abstract, task-relevant representations.

Semantic and Instance Forecasting: Instead of predicting future RGB values, models can predict future semantic segmentation maps, allowing a UAV to reason about interactions between object categories.
Bird’s-Eye-View (BEV) Prediction: Popularized in autonomous driving, BEV prediction projects sensor data onto a top-down grid representation. Predicting future BEV occupancy grids is more compact and directly useful for planning than predicting from a perspective view [131]. Recent works like BEVerse [132] are pushing the state-of-the-art in this area, and adapting these powerful representations to the full 3D world of UAVs is a key research challenge.

6.2. Injecting Priors: Physics-Informed and Differentiable Models

Purely data-driven models can sometimes produce physically implausible futures. Injecting physical priors can regularize the learning process and improve generalization.

Physics-Informed Neural Networks (PINNs): While traditionally used for solving PDEs, the core idea of designing loss functions that penalize violations of physical laws can be integrated into predictive models [133]. For UAVs, this could mean ensuring predictions respect gravity or basic aerodynamic constraints.
Differentiable Simulation: The rise of differentiable simulators [134] allows gradients to be passed through the simulation process itself. This enables models to be trained via “analysis-by-synthesis,” where a model’s parameters are optimized to produce a simulation that matches reality, a powerful paradigm for learning dynamics.

6.3. The Grand Goal: Video Prediction as General World Modeling

The ultimate goal of video prediction is not just forecasting pixels, but learning a comprehensive, reusable internal model of the world.

Reinforcement Learning in Latent Imagination: As demonstrated by the Dreamer family [30], a world model allows an agent to learn complex behaviors entirely by “dreaming” in its compact latent space. A thorough review of world models can be found in [9].
Language-Conditioned Goals: The next frontier is to combine these world models with the reasoning capabilities of Large Language Models (LLMs). This would allow a user to specify high-level, semantic goals (e.g., “inspect the windows on the north face of that building”). Works like VoxPoser [135] demonstrate how LLMs can parse such commands and, combined with visual models, generate robotic plans, paving the way for more intuitive human–robot interaction with UAVs.

6.4. Fusing Senses: End-to-End Multimodal Systems

Vision is powerful but has limitations. True robustness will come from intelligently fusing multiple sensor modalities.

Event Cameras: Fusing event streams with standard frames can dramatically improve motion estimation and prediction during fast UAV maneuvers [136].
LiDAR, Thermal, and Radar: Fusing sparse but geometrically accurate LiDAR data can anchor visual predictions in metric 3D space [137]. Thermal cameras can provide crucial information for search-and-rescue. Millimeter-wave radar, which is robust to adverse weather, is also becoming a viable sensor for UAVs [138]. The key challenge is designing temporal operators (e.g., multimodal SSMs) that can naturally ingest and align these asynchronous, heterogeneous data streams.

6.5. Building the Future: Better Benchmarks and Responsible Deployment

Progress requires the right tools and a commitment to responsible innovation.

The Need for a “Flying-Argoverse”: As highlighted in Section 4.4, the community urgently needs a large-scale, multi-modal benchmark dataset for UAV predictive autonomy.
Explainable and Trustworthy AI (XAI): For safety-critical systems, black-box models are unacceptable. Future research must focus on making predictive models more transparent and explainable. A survey of XAI for robotics can be found in [139]. We need models that can not only predict the future but also articulate their uncertainty and the reasoning behind their forecasts, building trust with operators and regulators.

7. Conclusions

The paradigm of unmanned aerial remote sensing is shifting from passive data collection to active, predictive autonomy. In this survey, we have charted the rapidly evolving landscape of video prediction, a technology cornerstone for this transition. We addressed the limitations of existing single-axis taxonomies by proposing a novel, three-axis framework that disentangles backbone architecture & temporal operator, generative nature, and training/inference regime. This framework provides a structured lens to analyze the complex trade-offs involved in designing models for the unique constraints of UAVs.

Through this lens, we reviewed the evolution from convolutional recurrence to the global modeling capabilities of Transformers and the promising efficiency of state-space models (SSMs). We contrasted deterministic and probabilistic approaches, highlighting the critical role of uncertainty modeling for robust planning. Our analysis of key applications—autonomous navigation, proactive tracking, and anomaly detection—revealed a clear mapping from task requirements to model design choices. Finally, we bridged the gap between theory and practice by outlining crucial engineering considerations for deployment and projecting a roadmap for future research, emphasizing the push towards multimodal world models and responsible, trustworthy AI.

As computational platforms on the edge become more powerful and predictive models more efficient, the vision of UAVs as truly intelligent agents, capable of anticipating and proactively shaping their environment, is moving steadily from science fiction to engineering reality.

Author Contributions

Conceptualization, Z.C., E.Z. and Z.G.; methodology, Z.C. and Z.G.; software, Z.C. and E.Z.; validation, Z.C., E.Z. and Z.G.; formal analysis, Z.C.; investigation, Z.C.; resources, P.Z. and X.L.; data curation, Z.C. and E.Z.; writing—original draft preparation, Z.C.; writing—review and editing, E.Z., Z.G. and P.Z.; visualization, Z.C.; supervision, X.L., L.W. and Y.Z.; project administration, L.W. and Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Program of the Chinese Academy of Sciences, grant numbers RCJJ-145-24-13 and KGFZD-145-25-38; and the Science and Disruptive Technology Program, grant number AIRCAS2024-AIRCAS-SDTP-03.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tsouros, D.; Bibi, S.; Sarigiannidis, P. A review on UAV-based applications for precision agriculture. Information 2019, 10, 349. [Google Scholar] [CrossRef]
Rabinovich, A.; Geva, A.; Ben-Dor, E.; Tchetchik, A. UAV-based situational awareness for disaster management: A systematic review. Int. J. Disaster Risk Reduct. 2023, 93, 103778. [Google Scholar]
Espitia, C.; Cardozo, N.; Trujillo, M. Deep learning for bridge inspection: A comprehensive review and analysis of computer vision-based approaches. Eng. Appl. Artif. Intell. 2022, 116, 105452. [Google Scholar]
Colomina, I.; Molina, P. Unmanned Aerial Systems for Photogrammetry and Remote Sensing: A Review. ISPRS J. Photogramm. Remote Sens. 2014, 92, 79–97. [Google Scholar] [CrossRef]
Nex, F.; Remondino, F. UAV for 3D Mapping Applications: A Review. Appl. Geomat. 2014, 6, 1–15. [Google Scholar] [CrossRef]
Arafat, M.; Alam, M.; Moh, S. Vision-Based Navigation Techniques for Unmanned Aerial Vehicles: Review and Challenges. Drones 2023, 7, 89. [Google Scholar] [CrossRef]
Vidović, A.; Štimac, I.; Mihetec, T.; Patrlj, S. Application of Drones in Urban Areas. Transp. Res. Procedia 2024, 81, 84–97. [Google Scholar] [CrossRef]
Duan, S.; Chen, Y.; Liu, Y.; Liu, J.; Zhou, W.; Li, H. A survey of embodied AI: From simulators to research tasks. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 7, 135–149. [Google Scholar] [CrossRef]
Koco, E.; Ben-Nun, T.; Hoefler, T. Deep learning for world model-based reinforcement learning: A survey. Mach. Learn. 2021, 110, 2501–2551. [Google Scholar]
Finn, C.; Goodfellow, I.; Levine, S. Unsupervised learning for physical interaction through video prediction. In Proceedings of the Advances in Neural Information Processing Systems 29, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; Woo, W.-C. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the NIPS, Montréal, Canada, 7–12 December 2015. [Google Scholar]
Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Gao, Z.; Tan, C.; Shen, L.; Zhang, S.; Li, S.Z. SimVP: Simpler yet Better Video Prediction. arXiv 2022, arXiv:2206.05099. [Google Scholar] [CrossRef]
Wu, L.; Liu, J.; Gao, Y. Evaluating MEDIRL: A Replication and Ablation Study of Maximum Entropy Deep Inverse Reinforcement Learning for Human Social Navigation. arXiv 2024, arXiv:2406.00968. [Google Scholar] [CrossRef]
Hafner, D.; Lillicrap, T.; Ba, J.; Norouzi, M. Dreamer: Reinforcement Learning with Latent World Models. In Proceedings of the ICLR, Online, 26 April–1 May 2020. [Google Scholar]
Ondruska, P.; Posner, I. Deep Tracking in the Wild: End-to-End Tracking Using Recurrent Neural Networks. Int. J. Robot. Res. 2018, 37, 492–511. [Google Scholar]
Voleti, V.; Jolicoeur-Martineau, A.; Pal, C. MCVD - Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. In Proceedings of the Advances in Neural Information Processing Systems 36, New Orleans, LA, USA, 28 November–2 December 2022. [Google Scholar]
Liu, W.; Luo, W.; Lian, D.; Gao, S. Future frame prediction for anomaly detection—A new baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Oprea, S.; Martinez-Gonzalez, P.; Garcia-Garcia, A.; Jover-Alvarez, D.; Castro, J.; Escalera, S.; Oprea, S. A survey on deep learning for video prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5883–5903. [Google Scholar]
Yildiz, B.; Mettes, P.; Worring, M. A review of deep learning-based approaches for video prediction. IEEE Trans. Artif. Intell. 2023. [Google Scholar]
Villegas, R.; He, J.; Mori, G. Hierarchical Models for Video Prediction. In Proceedings Part I, Proceedings of the Pattern Recognition, ICPR International Workshops and Challenges: CVIUI, UWS, WvRN, FAB, and AI4B, Beijing, China, 20–24 August 2018; Springer: Cham, Switzerland, 2018; pp. 318–334. [Google Scholar]
Melnik, A.; Ljubljanac, M.; Lu, C.; Yan, Q.; Ren, W.; Ritter, H. Video Diffusion Models: A Survey. arXiv 2024, arXiv:2405.03150. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. In Proceedings of the ACL, Online, 5–10 July 2020. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Li, J.; Liu, W.; Li, R.; Li, S.Y. A survey on deep learning for edge computing: Research issues and challenges. J. Supercomput. 2021, 77, 3431–3473. [Google Scholar]
Alipour, K.; Mousavi, V.; Samadzadegan, F. A review of onboard processing for UAVs: From embedded systems to deep learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 4789–4812. [Google Scholar]
Ha, D.; Schmidhuber, J. World Models. arXiv 2018, arXiv:1803.10122. [Google Scholar]
Hafner, D.; Pasukonis, J.; Ba, J.; Lillicrap, T. Mastering Diverse Domains through World Models. arXiv 2023, arXiv:2301.04104. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA; London, UK, 2018. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems 27, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Liu, Z.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Video Frame Synthesis via Deep Voxel Flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Unterthiner, T.; van Steenkiste, S.; Kurach, K.; Marinier, R.; Michalski, M.; Gelly, S. Towards accurate generative models of video: A new metric and challenges. arXiv 2018, arXiv:1812.01717. [Google Scholar]
Zeng, G.; Mettes, P.; Snoek, C.G. Kernel video distance. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 8355–8366. [Google Scholar]
Xiang, X.; Li, Z.; Wang, Y.; Liu, Z.; Zhang, W.; Ye, W.; Zhang, J. A Survey of AI-Generated Video Evaluation. arXiv 2024, arXiv:2410.19884. [Google Scholar] [CrossRef]
Wang, Y.; Gao, Z.; Long, M.; Wang, J.; Yu, P.S. PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning. Proc. Int. Conf. Mach. Learn. 2018, 80, 5123–5132. [Google Scholar]
Le Guen, V.; Thome, N. Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Lee, A.X.; Zhang, R.; Ebert, F.; Abbeel, P.; Fiin, C.; Levine, S. Stochastic Adversarial Video Prediction. arXiv 2018, arXiv:1804.01523. [Google Scholar] [CrossRef]
Tan, C.; Gao, Z.; Chen, X.; Li, S.Z. SimVPv2: Towards Simple yet Powerful Spatiotemporal Predictive Learning. arXiv 2022, arXiv:2211.12509. [Google Scholar] [CrossRef]
Denton, E.; Kariyappa, S.; Cheung, B.; Carreira, J.; Teh, Y.W. Stochastic Video Generation with a Learned Prior. Proc. Int. Conf. Mach. Learn. 2018, 80, 1174–1183. [Google Scholar]
Yan, W.; Zhang, Y.; Abbeel, P.; Srinivas, A. VideoGPT: Video Generation using VQ-VAE and Transformers. arXiv 2021, arXiv:2104.10157. [Google Scholar]
Yu, L.; Sohn, K.; Gu, J.; Li, C.Y.; Mahajan, D.; Kumar, I.; Fleet, D.J.; Han, X.; Chai, S.; Tulyakov, S.; et al. MAGVIT-v2: Masked Generative Video Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Li, K.; Wang, Y.; He, Y.; Li, Y.; Wang, G.; Zhou, Y.; Qiao, Y. VideoMamba: State Space Model for Efficient Video Understanding. arXiv 2024, arXiv:2403.06981. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Models. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
He, Y.; Wang, T.; Zhang, Y.; Shan, Y.; Chen, Q. Latent Video Diffusion Models for High-Fidelity Long Video Generation. arXiv 2022, arXiv:2211.13221. [Google Scholar]
Salimans, T.; Ho, J. Progressive Distillation for Fast Sampling of Diffusion Models. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar]
Villegas, R.; Yang, J.; Zou, Y.; Sohn, S.; Lin, X.; Lee, H. Decomposing Motion and Content for Natural Video Sequence Prediction. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Liang, F.; Wu, B.; Wang, J.; Yu, L.; Li, K.; Zhao, Y.; Misra, I.; Huang, J.; Zhang, P.; Vajda, P.; et al. FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Jiang, L.; Yang, M.; Li, L.; Long, M.; Fei-Fei, L. Eidetic 3d lstm: A model for video prediction and beyond. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Wang, Y.; Zhang, J.; Zhu, H.; Long, M.; Wang, J.; Yu, P.S. Memory in Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity From Spatiotemporal Dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Tan, C.; Gao, Z.; Wu, L.; Xu, Y.; Xia, J.; Li, S. Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is Space-Time Attention All You Need for Video Understanding? In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lucic, M.; Schmid, C. ViViT: A Video Vision Transformer. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Villegas, R.; Babaeizadeh, M.; Kindermans, P.; Moraldo, H.; Zhang, H.; Saffar, M.T.; Castro, S.; Kunze, J.; Erhan, D. Phenaki: Variable length video generation from open domain textual descriptions. arXiv 2022, arXiv:2210.02399. [Google Scholar] [CrossRef]
Brooks, T.; Peebles, B.; Holmes, C.; DePue, W.; Yu, W.; Li, L.; Karras, T.; Kaplan, D.; An, D.; Gonzalez, R. Video Generation Models as World Simulators. 2024. Available online: https://openai.com/research/video-generation-models-as-world-simulators (accessed on 15 August 2025).
Smith, J.T.; Warrington, A.; Linderman, S.W. Simplified state space layers for sequence modeling. arXiv 2022, arXiv:2208.04933. [Google Scholar]
Liu, Y.; Wang, Y.; Chen, Z.; Yu, L.; Wu, Z.; Chen, C.; Zhao, G.; Sun, W. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Ye, Z. ss-Mamba: Semantic-Spline Selective State-Space Model. arXiv 2025, arXiv:2506.14802. [Google Scholar]
Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Eslami, S.A.; Heess, N.; Weber, T.; Tassa, Y.; Szepesvari, D.; Kavukcuoglu, K.; Hinton, G.E. Attend, infer, repeat: Fast scene understanding with generative models. In Proceedings of the Advances in Neural Information Processing Systems 29, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Li, Z.; Niklaus, S.; Snavely, N.; Wang, O. Neural scene flow fields for space-time view synthesis of dynamic scenes. arXiv 2020, arXiv:2011.13084. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville, A.C.; Bengio, Y. A recurrent latent variable model for sequential data. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Babaeizadeh, M.; Finn, C.; Erhan, D.; Campbell, R.H.; Levine, S. Stochastic variational video prediction. In Proceedings of the The 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Lai, G.; Li, B.; Zheng, G.; Yang, Y. Stochastic WaveNet: A Generative Latent Variable Model for Sequential Data. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Wu, B.; Nair, S.; Martin-Martin, R.; Fei-Fei, L.; Finn, C. Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 20–25 June 2021. [Google Scholar]
Fotiadis, S.; Valencia, M.; Hu, S.; Cantwell, C.D.; Bharatch, A.A. Disentangled generative models for robust dynamical system prediction. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Gopalakrishnan, R.; Greenspan, H.; Madabhusi, A.; Mousavi, P.; Salcudean, S.; Duncan, J.; Mahmood, T.; Taylor, R. Temporally-Disentangled VAE for Surgical Video Prediction. In Proceedings of the Medical Image Computing and Computer Assisted Intervention (MICCAI), Vancouver, BC, Canada, 8–12 October 2023. [Google Scholar]
Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Liu, G.; Tao, A.; Kautz, J.; Catanzaro, B. Video-to-video synthesis. In Proceedings of the Advances in Neural Information Processing Systems 31, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
Tulyakov, S.; Liu, M.Y.; Yang, X.; Kautz, J. MoCoGAN: Decomposing Motion and Content for Video Generation. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Saito, M.; Saito, S. TGANv2: Efficient Training of Large Models for Video Generation with Multiple Subsampling Layers. arXiv 2018, arXiv:1811.09245. [Google Scholar]
Clark, A.; Donahue, J.; Simonyan, K. Efficient Video Generation on Complex Datasets. arXiv 2019, arXiv:1907.06571. [Google Scholar]
Skorokhodov, I.; Tulyakov, S.; Elhoseiny, M. StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Gereon, F.; Ayush, T.; Mohamed, E.; Christian, T. StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN. arXiv 2021, arXiv:2107.07224. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems 33, Online, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Ni, H.; Shi, C.; Li, K.; Huang, S.X.; Min, M.R. Conditional Image-to-Video Generation with Latent Flow Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Denver, CO, USA, 3–7 June 2023. [Google Scholar]
Gupta, A.; Yu, L.; Sohn, K.; Gu, X.; Hahn, M.; Li, F.; Essa, I.; Jiang, L.; Lezama, J. Photorealistic Video Generation with Diffusion Models. arXiv 2023, arXiv:2312.06662. [Google Scholar] [CrossRef]
Wang, Z.; Yuan, Z.; Wang, X.; Chen, T.; Xia, M.; Luo, P.; Shan, Y. MotionCtrl: A Unified and Flexible Motion Controller for Video Generation. arXiv 2023, arXiv:2312.03641. [Google Scholar] [CrossRef]
Zhang, Z.; Hu, J.; Cheng, W.; Paudel, D.; Yang, J. ExtDM: Distribution Extrapolation Diffusion Model for Video Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Hwang, G.; Ko, H.k.; Kim, Y.; Lee, S.; Park, E. Taming Diffusion Transformer for Real-Time Mobile Video Generation. arXiv 2025, arXiv:2507.13343. [Google Scholar] [CrossRef]
Zhan, Z.; Wu, Y.; Gong, Y.; Meng, Z.; Kong, Z.; Yang, C.; Yuan, G.; Zhao, P.; Niu, W.; Wang, Y. Fast and Memory-Efficient Video Diffusion Using Streamlined Inference. In Proceedings of the Advances in Neural Information Processing Systems 38, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Deng, Z.; He, X.; Peng, Y. Efficiency-optimized Video Diffusion Models. In Proceedings of the the MM’23: The 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023. [Google Scholar]
Zhai, Y.; Lin, K.; Yang, Z.; Li, L.; Wang, J.; Lin, C.; Doermann, D.; Yuan, J.; Wang, L. Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation. In Proceedings of the Advances in Neural Information Processing Systems 38, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Ding, Z.; Jin, C.; Liu, D.; Zheng, H.; Singh, K.K.; Zhang, Q.; Kang, Y.; Lin, Z.; Liu, Y. DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization. arXiv 2024, arXiv:2412.15689. [Google Scholar] [CrossRef]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Gupta, A.; Tian, S.; Zhang, Y.; Wu, J.; Martín-Martín, R.; Fei-Fei, L. MaskViT: Masked Visual Pre-Training for Video Prediction. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Bengio, S.; Vinyals, O.; Jaitly, N.; Shazeer, N. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. In Proceedings of the Neural Information Processing Systems 28, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Cui, W.; Wang, Y.; Wang, Z.; Dou, Z.; Scherer, S.; Wang, X. Open-Vocabulary Object Detection upon Frozen Vision and Language Models. In Proceedings of the The 20th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Yao, Z.; Tang, R.Y.; Zhang, Z.D.; Gholami, A.; Keutzer, K. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Proceedings of the Advances in Neural Information Processing Systems 36, New Orleans, LA, USA, 28 November–2 December 2022. [Google Scholar]
Ma, X.; Lin, G.; Pan, Y.; Hu, X.; Yang, N.; Ren, D. Llm-pruner: On the structural pruning of large language models. arXiv 2023, arXiv:2305.11627. [Google Scholar] [CrossRef]
Cai, H.; Gan, C.; Wang, T.; Zhu, Z.; Han, S. Once-for-all: Train one network and specialize it for efficient deployment. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
Camacho, E.F.; Bordons, C. Model Predictive Control; Springer: Cham, Switzerland, 2013. [Google Scholar]
Williams, G.; Wagener, N.; Gold, B.; Heiden, E.; Themner, G.T.; Chiriyath, A.R.; Boots, B. Model predictive path integral control: From theory to parallel computation. J. Guid. Control Dyn. 2017, 40, 344–357. [Google Scholar] [CrossRef]
Loquercio, A.; Kaufmann, E.; Ranftl, R.; Müller, M.; Koltun, V.; Scaramuzza, D. Deep drone racing: From simulation to reality with domain randomization. IEEE Trans. Robot. 2020, 36, 1–14. [Google Scholar] [CrossRef]
Zhu, C.; Yu, R.; Feng, S.; Burchfiel, B.; Shah, P.; Gupta, A. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets. arXiv 2025, arXiv:2504.02792. [Google Scholar] [CrossRef]
Zhang, D.; Zhu, F.; Zhang, H.; Hu, X.; Liu, Z.; You, X. P-tracker: A cross-modality deep tracker for unstructured scenarios. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022. [Google Scholar]
Li, P.; Li, J.; Liu, D. Generative adversarial networks: A comprehensive review and the way forward. In Proceedings of the 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Suzhou, China, 19–21 October 2019. [Google Scholar]
Voigtlaender, P.; Luiten, J.; Torr, P.H.; Leibe, B. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Nie, Q.; Cheng, H.; Liu, C.; Liu, X.; et al. VisDrone-DET2018: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Mueller, M.; Smith, N.; Ghanem, B.; Leibe, B.; Matas, J.; Sebe, N.; Welling, M. A Benchmark and Simulator for UAV Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Luo, W.; Liu, W.; Gao, S. Remembering history with convolutional lstm for anomaly detection. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017. [Google Scholar]
Pu, J.; Mou, L.; Xia, G.-S.; Zhu, X.X. Anomaly Detection in Aerial Videos With Transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5628213. [Google Scholar]
Pang, G.; Shen, C.; Van Den Hengel, A.; Reid, I. Deep learning for anomaly detection: A review. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
Mueller, M.; Ghanem, B. The uav-arg dataset: A benchmark for action recognition from aerial videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Online, 5–9 January 2021. [Google Scholar]
Antonini, A.; Guerra, W.; Murali, V.; Sayre-McCord, T.; KaramanKaraman, S. The Blackbird Dataset: A Large-Scale Dataset for UAV Perception in Aggressive Flight. In Proceedings of the 2018 International Symposium on Experimental Robotics, Buenos Aires, Argentina, 5–8 November 2018. [Google Scholar]
Wilson, B.; Qi, W.; Agarwal, T.; Lambert, J.; Singh, J.; Khandelwal, S.; Pan, B.; Kumar, R.; Hart, A.; Ettinger, S.; et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting. In Proceedings of the Advances in Neural Information Processing Systems 37, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2443–2451. [Google Scholar]
Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. Carla: An open urban driving simulator. In Proceedings of the Conference on Robot Learning. PMLR, Sydney, Australia, 6–11 August 2017. [Google Scholar]
Macenski, S.; Foote, T.; Gerkey, B.; Lalancette, C.; Woodall, W. Robot Operating System 2: Design, architecture, and uses in the wild. Am. Assoc. Adv. Sci. 2022, 7, eabm6074. [Google Scholar] [CrossRef]
Qin, T.; Pan, J.; Cao, S.; Shen, S. A general optimization-based framework for local odometry estimation with multiple sensors. arXiv 2019, arXiv:1901.03638. [Google Scholar] [CrossRef]
Meier, L.; Honegger, D.; Pollefeys, M. PX4: A node-based multithreaded open source robotics framework for deeply embedded platforms. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 6235–6240. [Google Scholar]
Vanholder, H. NVIDIA TensorRT Developer Guide, White Paper, NVIDIA. 2024. Available online: https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-825/pdf/TensorRT-Developer-Guide.pdf (accessed on 16 August 2025).
Jacob, B.; Kursun, S.; Swinkels, M.; Gschwind, M.; Hartwig, M.; Post, U.; Ku, P.O. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.R.; et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Inf. Fusion 2021, 76, 243–297. [Google Scholar] [CrossRef]
Garcia, J.; Fernández, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, 16, 1437–1480. [Google Scholar]
Brunke, L.; Greeff, M.; Hall, A.W.; Yuan, Z.; Zhou, S.; Panerati, J.; Schoellig, A.P. Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning. Annu. Rev. Control Robot. Auton. Syst. 2022, 5, 49–77. [Google Scholar] [CrossRef]
Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Proceedings of the Field and Service Robotics; Springer: Cham, Switzerland, 2018; pp. 621–635. [Google Scholar]
Makoviychuk, V.; Wawrzyniak, L.; Guo, Y.; Lu, M.; Storey, K.; Macklin, M.; Hoeller, D.; Rudin, N.; Allshire, A.; Handa, A.; et al. Isaac gym: High performance gpu-based physics simulation for robot learning. In Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), Online, 6–14 December 2021. [Google Scholar]
Szolc, H.; Kryjak, T. Hardware-in-the-loop simulation of a UAV autonomous landing algorithm implemented in SoC FPGA. In Proceedings of the Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, 27 June–1 July 2022. [Google Scholar]
Hu, Y.; Yang, J.; Chen, L. Planning-oriented autonomous driving. arXiv 2022, arXiv:2212.10156. [Google Scholar]
Zhang, Y.; Zhu, Z.; Zheng, W. BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving. arXiv 2022, arXiv:2205.09743. [Google Scholar]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
de Avila Belbute-Peres, F.; Smith, K.; Allen, K.; Tenenbaum, J.; Kolter, J.Z. End-to-end differentiable physics for learning and control. In Proceedings of the Advances in Neural Information Processing Systems 31, Montreal, QC, Canada, 2–8 December 2018. [Google Scholar]
Huang, W.; Wang, C.; Zhang, r.; Li, Y.; Wu, J.; Fei-Fei, L. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv 2023, arXiv:2307.05973. [Google Scholar] [CrossRef]
Gallego, G.; Delbrück, T.; Orchard, G.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.J.; Conradt, J.; Daniilidis, K.; et al. Event-Based Vision: A Survey. IEEE TPAMI 2022, 44, 154–180. [Google Scholar]
Huang, K.; Shi, B.; Li, X.; Li, X.; Huang, S.; Li, Y. Multi-modal Sensor Fusion for Auto Driving Perception: A Survey. arXiv 2022, arXiv:2202.02703. [Google Scholar] [CrossRef]
Elgayar, I. 4D radar-based perception for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2023. [Google Scholar]
Adebayo, A.; Ajayi, O.; Chukwurah, N. Explainable AI in Robotics: A Critical Review and Implementation Strategies for Transparent Decision-Making. J. Front. Multidiscip. Res. 2024, 5, 26–32. [Google Scholar] [CrossRef]

Figure 1. A conceptual illustration of our proposed three-axis taxonomy for video prediction models. This framework allows for a nuanced classification of modern architectures by decoupling backbone design, temporal mechanism, generative properties, and deployment strategies. Representative models are positioned within this 3D space based on their core characteristics.

Table 1. A comprehensive comparison of time-series forecasting mechanisms.

Methods	Definition	Generative Natures	Training and Inference
temporal operator	The mechanism that couples features over time (e.g., attention, SSM, conv-recurrence, warping/flow)	Foundational and Diverse: Defines how temporal dependencies are processed. Properties vary by operator: Attention: Excels at long-range dependencies; high complexity. Recurrence: Handles variable lengths; poor parallelism. Convolution: Efficient and parallel; limited receptive field.	As the core computational module within the AR or one-shot framework, it encodes and couples temporal information at each step.
AR	Autoregressive rollout	Sequential and Fine-grained: Flexible output length and detailed modeling. Main Drawback: Prone to error accumulation, where initial errors compound over time, degrading long-term reliability.	In training stage, it uses ground-truth values as input for the next step, which is stable and parallelizable. While in inference stage, it uses the model’s own predictions as input, resulting in slow inference and a train-test discrepancy.
one-shot/blockwise	A parallel prediction paradigm	Parallel and Holistic: Fundamentally avoids error accumulation and enables fast inference. Main Drawback: Fixed output length (less flexible). Direct prediction is a harder task, potentially requiring a more powerful model for high accuracy.	Trained to learn a direct mapping from the input sequence to the entire output block. Inference is highly parallel, completing all predictions in a single forward pass.

Table 2. Nomenclature of key abbreviations and terms.

Abbreviation	Full Name	Description
Core Concepts
UAV	Unmanned Aerial Vehicle	The aerial platform at the core of this survey.
DOF	Degrees of Freedom	Refers to the number of independent motions available to the UAV (typically 6).
EAI	Embodied Artificial Intelligence	A field of AI focused on agents acting within an environment.
Model Architectures & Processes
AR	Autoregressive	A sequential generation process where each step depends on the previous output.
SSM	State-Space Model	A class of sequential models with linear time complexity.
VLM	Vision-Language Model	Models that connect visual and textual data.
Evaluation Metrics
FPS	Frames Per Second	A common measure of inference speed.
PSNR	Peak Signal-to-Noise Ratio	A classic pixel-level image distortion metric.
SSIM	Structural Similarity Index Measure	A metric for measuring the structural similarity between two images.
LPIPS	Learned Perceptual Image Patch Similarity	A metric that measures the perceptual quality of generated images.
FVD	Fréchet Video Distance	A metric for evaluating the quality and temporal coherence of generated videos.
General Terminology
SOTA	State-of-the-Art	Denotes the highest-performing methods currently available.

Table 3. Unified SOTA summary across the three axes (operator, generative, regime). Accuracy ranges are representative values from prior work; efficiency reflects typical deployment ranges on Jetson Orin (edge) and RTX 4090 (workstation).

Axis A: Operator	Axis B: Generative	Axis C: Regime	Representative Model(s)	Accuracy (PSNR/SSIM/FVD)	FPS (Orin/4090)	VRAM	Dataset (Resolution)
>ConvLSTM/RNN	Deterministic	AR	PredRNN/ PredRNN++ [12,42]	PSNR 30–32, SSIM 0.90–0.92	15–25/120–180	6–8	KITTI (128 × 160), Human3.6M (256 × 256)
	Physics-hybrid	AR	PhyDNet [43]	PSNR 28–30, SSIM 0.88–0.90	10–15/80–120	6–10	Human3.6M (256 × 256)
	Adversarial	AR	SAVP [44]	FVD ∼60–80; sharp frames	8–12/60–100	8–12	BAIR Robot Pushing (64 × 64)
>CNN (Simple)	Deterministic	Non-AR	SimVP/ SimVPv2 [13,45]	PSNR 29–31, SSIM 0.89–0.91	20–30/150–200	6–8	Human3.6M (256 × 256), WeatherBench
>CNN (Simple)	Variational (VAE)	Non-AR	SVG [46]	PSNR 27–31, SSIM 0.86–0.90	12–20/100–150	6–10	Human3.6M (256 × 256)
>Transformer	Deterministic	AR	ViT-based predictors (e.g., Video Swin)	PSNR >30, SSIM ∼0.92–0.94	2–5/30–50	>48	KITTI (224 × 224), Kinetics (224 × 224)
>Transformer	Tokenized (VQ)	AR (token)	VideoGPT/ MAGVIT [47,48]	High perceptual quality	<5/20–40	>24	UCF-101 (256 × 256), VQ-custom
>State-Space	Deterministic	AR/Hybrid	VideoMamba/ VMamba [49,50]	PSNR 30–31, SSIM ∼0.90	20–30/160–200	3–5	Kinetics (224 × 224), SSv2
>State-Space	Diffusion (latent)	Non-AR	LVDM/DiT variants [51,52]	FVD 28–35; PSNR 26–28	2 (∼250 steps) → 15 (∼20 steps)	12–24	UCF-101 (256 × 256), Sky Timelapse
>Explicit Warping	Deterministic	AR	CDNA/STP [10]	Sharp motion; domain-specific	15–25/120–180	6–8	BAIR Robot Pushing (64x64), KITTI
>Explicit Warping	Hybrid (flow)	Non-AR	MCnet/Flow-Vid [53,54]	Improved temporal consistency	10–20/80–120	8–12	KITTI (128 × 160)

Table 4. Summary of key models based on convolutional recurrence.

Model [Ref]	Year	Core Contribution
ConvLSTM [11]	2015	Foundational work; replaces matrix multiplication in LSTMs with convolutions to process spatio-temporal data.
PredRNN/++ [12,42]	2017/18	Introduces spatio-temporal LSTM (ST-LSTM) and Gradient Highway Unit (GHU) to improve spatial modeling and alleviate vanishing gradients.
E3D-LSTM [56]	2019	Augments ConvLSTM with a self-attention mechanism to recall detailed past states, improving long-term dependency modeling.
MIM [57]	2022	Focuses computation by using a masking mechanism to prevent state updates for static background regions.
PhyDNet [43]	2020	Disentangles physical dynamics from residual information, using a ConvLSTM to model the less predictable components.
TAU [58]	2023	A modern hybrid model that enhances recurrent networks with a temporal attention unit, achieving SOTA performance.

Table 5. Summary of key models based on Attention and Transformers.

Model/Concept [Ref]	Year	Core Contribution
Factorized Transformers [61,62]	2021	Manage computational cost by factorizing attention into separate spatial and temporal steps, or using efficient “tubelet” tokenization.
Local/Windowed Attention [26]	2021	Achieves linear complexity by computing attention within local windows that are shifted across layers, enabling efficient high-resolution processing.
Generative Transformers [48,63]	2022/2023	Focus on large-scale, high-quality video generation, often using masked prediction on video tokens in a non-autoregressive manner.
Diffusion Transformers [64]	2024	Represent the state-of-the-art in generative fidelity by combining the scalability of Transformers with the generative power of diffusion models.

Table 6. Summary of key models based on state-space models (SSMs).

Model [Ref]	Year	Core Contribution
S4/S5 [23,65]	2021/22	Foundational works that stabilized SSMs for deep learning, enabling a dual formulation as a parallel convolution (train) and a fast recurrence (inference).
Mamba [24]	2023	Introduces input-dependent parameters (“selective scan”), allowing the model to dynamically modulate information flow, achieving performance rivaling attention with linear complexity.
Vim/VMamba [50,66]	2024	Adapts the Mamba architecture for vision by treating an image as a sequence of patches, demonstrating competitive scaling and performance against ViTs.
SimVPv2 [45]	2023	A pure-CNN model augmented with a simple SSM block, showing the strong potential of hybrid CNN-SSM architectures for video prediction.

Table 7. Summary of approaches based on explicit warping and flow.

Model/Concept [Ref]	Year	Core Contribution
DVF/SAVP [36,44]	2018/19	Pioneer the two-stage approach: first predict optical flow (or voxel flow) and then use a synthesis network (often a GAN) to generate a sharp residual.
Layered Representations [69]	2016	Model scenes as a composition of independently moving layers or objects (sprites), disentangling foreground and background motion.
Dynamic NeRFs [70]	2021	Learn a continuous, 4D (space + time) representation of a dynamic scene, performing prediction by rendering the learned field at future timesteps.

Table 8. Quantitative comparison of operator architectures for UAV video prediction. Estimates assume a 1B-parameter FP16 model processing

512 \times 512

video (batch size 1). FPS and VRAM are projected on Jetson Orin (edge) and RTX 4090 (workstation).

Table 8. Quantitative comparison of operator architectures for UAV video prediction. Estimates assume a 1B-parameter FP16 model processing

512 \times 512

video (batch size 1). FPS and VRAM are projected on Jetson Orin (edge) and RTX 4090 (workstation).

Operator Family	Complexity	VRAM/GB	FPS (Orin/4090)	Max Frames
ConvLSTM/RNN-based	$O (L D^{2})$	6–8	20/150	10–20
Transformer (Self-Attn)	$O (L^{2} D)$	>24	3/15	20–30
State-Space Models (SSMs)	$O (L D)$	3–5	25/180	20–40
Flow/Warping-based	$O (L D^{2})$	4–6	18/120	10–15

Table 9. Summary of key deterministic models.

Model [Ref]	Year	Core Contribution
Early ConvLSTMs [11]	2015+	Many early recurrent models were deterministic by nature, trained with simple reconstruction losses.
SimVP [13]	2022	A modern, high-performing deterministic baseline using a pure CNN architecture, showing the power of architectural design over complex temporal operators.

Table 10. Summary of key VAE-based probabilistic models.

Model [Ref]	Year	Core Contribution
VRNN [73]	2015	A foundational model combining RNNs with variational inference to handle sequential data.
SVG [74]	2017	A seminal model for stochastic video prediction that learns a prior over a per-frame latent variable to generate diverse futures.
SLT [75]	2018	Models dynamics in the VAE latent space using a Transformer, combining the strengths of both architectures.
GH-VAE [76]	2021	Employs a hierarchical latent space to improve the quality and coherence of long-term video generation.
VAE-SD [77]	2022	Combines disentangled representation learning with VAEs for improved generation diversity.
TD-VAE [78]	2023	Focuses on disentangling time-invariant and time-varying factors in a video’s latent representation.
Dreamer Series [30]	2023	Represents a pinnacle application of VAE principles, learning a world model in a latent space for model-based reinforcement learning.

Table 11. Summary of key GAN-based probabilistic models.

Model [Ref]	Year	Core Contribution
vid2vid [79]	2018	A seminal work on conditional video-to-video synthesis, enabling translation from semantic maps to photorealistic video.
MoCoGAN [80]	2018	Decomposes motion and content into separate latent spaces for more structured and controllable video generation.
TGANv2 [81]	2018	Improves temporal consistency in long video generation by combining a recurrent generator with a temporal discriminator.
DVD-GAN [82]	2019	Uses dual (spatial and temporal) discriminators to enforce both per-frame realism and temporal coherence.
StyleGAN-V [83]	2022	A state-of-the-art model for high-fidelity continuous video generation based on the powerful StyleGAN architecture.
StyleVideoGAN [84]	2022	Enables high-resolution, controllable video synthesis with disentangled control over motion and appearance.

Table 12. Summary of key diffusion-based probabilistic models.

Model [Ref]	Year	Core Contribution
MCVD [17]	2022	A leading model specifically for video prediction, using a masked conditional diffusion process to generate future frames.
LVDM [51]	2022	Pioneers the now-standard approach of performing the diffusion process in a compressed latent space for computational efficiency.
LFDM [87]	2023	Improves inference speed by using Rectified Flow in the latent space, enabling high-quality generation in as few as 4 steps.
W.A.L.T/Sora [64,88]	2023/24	Represents the state-of-the-art by combining the diffusion framework with a powerful Transformer backbone (Diffusion Transformer).
MotionCtrl [89]	2023	Allows for explicit control over camera motion and object trajectories, enhancing the controllability of video generation.
ExtDM [90]	2024	Enables long-horizon video generation through an autoregressive extrapolation scheme on latent codes within the diffusion model.

Table 13. Comparison of generative paradigms for UAV video prediction. Metrics are representative values drawn from recent literature. FPS measured on Jetson Orin (edge) and RTX 4090 (workstation).

Paradigm	Diversity	Fidelity (PSNR/SSIM/FVD)	FPS (Orin/4090)	Lightweight Strategies
Deterministic (e.g., PredRNN)	None	PSNR 28–32, SSIM 0.88–0.92	25/200	Direct pruning, quantization
VAE-based (e.g., SVG)	Moderate	PSNR 27–31, SSIM 0.86–0.90	15/120	Latent compression, KL annealing
GAN-based (e.g., SAVP)	Moderate	Sharp, unstable; FVD ~60–80	12/100	Spectral norm, progressive growing
Diffusion (e.g., LVDM)	High	PSNR 28–35, SSIM 0.88–0.90+, FVD ~28–35,	2/20 (250 steps); 15/120 (20 steps)	Fast sampling, latent compression, knowledge distillation

Table 14. Summary of advanced training paradigms.

Paradigm [Ref]	Core Idea
Curriculum/Multi-Stage [47,51]	Starts with an easier task (e.g., simple loss, short sequences) and gradually increases complexity. Often involves pre-training components like an autoencoder.
Self-Supervised Pre-Training [96,97]	Learns representations from large unlabeled datasets (e.g., by reconstructing masked parts) before fine-tuning on a specific task.
RL Fine-Tuning (World Models) [29,30]	Uses the predictive model as dynamics model within an RL loop; fine-tunes the predictor based on task rewards, not just pixel error.

Table 15. Comparison of generation and inference strategies.

Strategy	Core Idea	Key Challenge/Trade-Off
Autoregressive (AR)	Generates one frame at a time, conditioning on previously generated frames.	Prone to error accumulation and exposure bias; slower inference but flexible length.
Non-Autoregressive (Non-AR)	Generates all future frames in a single forward pass.	Much faster inference but can struggle with long-term consistency; fixed prediction horizon.

Table 16. Summary of deployment and acceleration strategies.

Technique [Ref]	Core Idea
Knowledge Distillation [95,99]	Trains a small “student” model to mimic a large “teacher” model, transferring its knowledge.
Quantization and Pruning [100,101]	Reduces model size by lowering numerical precision (e.g., FP32 to FP8) or removing redundant weights.
Hardware–Software Co-Design [102]	Designs the model architecture with the target hardware’s constraints in mind from the beginning (e.g., via NAS).

Table 17. Comparison of training and inference regimes for UAV video prediction.

Regime	Flexibility	Speed (FPS, Orin/4090)	Error Accumulation	Prediction Horizon	Typical UAV Use Case
Autoregressive (AR)	Variable-length	5–20/100+	High (exposure bias)	20–40+ frames	Long-horizon planning, offline simulation
Non-AR (One-shot/Block-wise)	Fixed-length	20+/200+	Low	5–20 frames	Real-time reactive control, obstacle avoidance
Hybrid (Block-AR, Scheduled Sampling)	Semi-variable	10–15/120+	Moderate	15–30 frames	Balanced planning in semi-structured tasks

Table 18. Summary of datasets for UAV video prediction research.

Datasets	Content	Resolution	Temporal Continuity	Scope	Acquisition Method
VisDrone2019	Aerial videos from 14 different Chinese cities, featuring sparse and dense pedestrians, vehicles, etc.	960 × 540–2000 × 1500	Medium (average of 909 frame)	Offers diverse viewpoints, weather, and lighting conditions, ideal for testing model generalization in complex real-world scenarios.	Real-World UAV
UAVid	Urban aerial video sequences with fine-grained, pixel-level semantic annotations.	4096 × 2160/3840 × 2160	Medium (at least 45 s)	Provides an excellent data foundation for semantic video prediction, aiding models in understanding scene structure.	Real-World UAV
UAVDT	Aerial videos of urban roads and highways, focusing on vehicle detection and tracking.	1080 × 540	Medium (average of 800 frames)	Characterized by high object density, small objects, and significant camera motion, while also covering adverse conditions like nighttime and fog.	Real-World UAV
UAV123	123 video sequences from low-altitude UAVs, focusing on tracking a single object.	720p–4K	Medium (average of 915 frames)	Videos focus on specific target trajectories, making it ideal for evaluating a model’s ability to predict future object motion.	Real-World UAV
Okutama-Action	Videos of 43 min-long sequences with 12 action classes, filmed in the Okutama region of Japan.	3840 × 2160	Long (minutes)	Presents the unique challenge of dynamic transition of actions, significant changes in scale and aspect ratio, abrupt camera movement, and multi-labeled actors.	Real-World Multi-view
UZH-FPV Drone Racing	First-person view (FPV) videos of a high-speed drone flying in complex environments.	346 × 260 or 1280 × 960	Long (average of 6 min)	Involves drastic camera motion, posing extreme challenges for models, especially for testing ego-motion prediction.	Real-World FPV UAV
AirSim	User-customizable, high-fidelity 3D simulation environments like cities, forests, etc.	256 × 144–4K	Unlimited (user-defined)	Provides perfect ground-truth sensor data (e.g., depth, optical flow) in controllable scenes. Ideal for algorithm development and ablation studies.	Simulator

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Z.; Zhu, E.; Guo, Z.; Zhang, P.; Liu, X.; Wang, L.; Zhang, Y. Predictive Autonomy for UAV Remote Sensing: A Survey of Video Prediction. Remote Sens. 2025, 17, 3423. https://doi.org/10.3390/rs17203423

AMA Style

Chen Z, Zhu E, Guo Z, Zhang P, Liu X, Wang L, Zhang Y. Predictive Autonomy for UAV Remote Sensing: A Survey of Video Prediction. Remote Sensing. 2025; 17(20):3423. https://doi.org/10.3390/rs17203423

Chicago/Turabian Style

Chen, Zhan, Enze Zhu, Zile Guo, Peirong Zhang, Xiaoxuan Liu, Lei Wang, and Yidan Zhang. 2025. "Predictive Autonomy for UAV Remote Sensing: A Survey of Video Prediction" Remote Sensing 17, no. 20: 3423. https://doi.org/10.3390/rs17203423

APA Style

Chen, Z., Zhu, E., Guo, Z., Zhang, P., Liu, X., Wang, L., & Zhang, Y. (2025). Predictive Autonomy for UAV Remote Sensing: A Survey of Video Prediction. Remote Sensing, 17(20), 3423. https://doi.org/10.3390/rs17203423

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predictive Autonomy for UAV Remote Sensing: A Survey of Video Prediction

Abstract

Highlights

Abstract

1. Introduction

1.1. The Emergence of Embodied Intelligence in UAVs

1.2. Video Prediction: The Core of Proactive Autonomy

1.3. The Gap: Limitations of Existing Taxonomies in a Rapidly Evolving Field

1.4. Design Space for UAV Video Prediction: Constraints and Desiderata

1.5. Contributions, Research Questions, and a New Taxonomy

2. Fundamentals of Video Prediction

2.1. Nomenclature

2.2. Formal Problem Definition

2.3. Training Objectives and Loss Functions

2.4. Evaluation Metrics: A UAV-Centric View

2.5. Traditional Approaches to Video Prediction

2.6. Quantitative Summary of Current SOTA in UAV Predictive Autonomy

3. A Multi-Dimensional Taxonomy of Video Prediction Models

3.1. Axis A: Operator Architecture

3.1.1. Type I: Convolutional Recurrence

3.1.2. Type II: Attention and Transformers

3.1.3. Type III: State-Space Models (SSMs)

3.1.4. Type IV: Explicit Warping and Flow

3.1.5. Comparative Analysis of Operator Architectures

3.2. Axis B: Generative Nature (Deterministic vs. Probabilistic)

3.2.1. Deterministic Models

3.2.2. Probabilistic Models

3.2.3. Comparative Analysis of Generative Paradigms

3.3. Axis C: Training and Inference Regime

3.3.1. Advanced Training Paradigms

3.3.2. Generation and Inference Strategies

3.3.3. Deployment and Acceleration Strategies

3.3.4. Comparative Analysis of Training and Inference Regimes

4. Key Applications and Challenges for Embodied UAVs

4.1. Predictive Control for Autonomous Navigation in GPS-Denied Environments

4.2. Proactive Target Tracking Through Occlusion and Abrupt Motion

4.3. Unsupervised Anomaly Detection for Infrastructure Inspection

4.4. UAV-Centric Datasets and Their Limitations

4.5. Toward Standardized Evaluation Protocols for UAV Settings

5. Engineering Considerations and Deployment

5.1. System Architecture on Board a UAV

5.2. Compression and Acceleration for Real-Time Inference

5.3. Safety, Fault Tolerance, and Robust Testing

5.4. Datasets, Benchmarks, and Challenges

6. Future Directions

6.1. Beyond Pixels: Long-Horizon and Semantic Forecasting

6.2. Injecting Priors: Physics-Informed and Differentiable Models

6.3. The Grand Goal: Video Prediction as General World Modeling

6.4. Fusing Senses: End-to-End Multimodal Systems

6.5. Building the Future: Better Benchmarks and Responsible Deployment

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI