You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Review
  • Open Access

24 August 2025

AI and Generative Models in 360-Degree Video Creation: Building the Future of Virtual Realities

,
and
1
School of Technology, Yoobee College of Creative Innovation, Auckland 1010, New Zealand
2
Auckland Bioengineering Institute, University of Auckland, Auckland 1010, New Zealand
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Smart Technologies Integrating AI with Virtual and Augmented Reality Applications

Abstract

The generation of 360° video is gaining prominence in immersive media, virtual reality (VR), gaming projects, and the emerging metaverse. Traditional methods for panoramic content creation often rely on specialized hardware and dense video capture, which limits scalability and accessibility. Recent advances in generative artificial intelligence, particularly diffusion models and neural radiance fields (NeRFs), are examined in this research for their potential to generate immersive panoramic video content from minimal input, such as a sparse set of narrow-field-of-view (NFoV) images. To investigate this, a structured literature review of over 70 recent papers in panoramic image and video generation was conducted. We analyze key contributions from models such as 360DVD, Imagine360, and PanoDiff, focusing on their approaches to motion continuity, spatial realism, and conditional control. Our analysis highlights that achieving seamless motion continuity remains the primary challenge, as most current models struggle with temporal consistency when generating long sequences. Based on these findings, a research direction has been proposed that aims to generate 360° video from as few as 8–10 static NFoV inputs, drawing on techniques from image stitching, scene completion, and view bridging. This review also underscores the potential for creating scalable, data-efficient, and near-real-time panoramic video synthesis, while emphasizing the critical need to address temporal consistency for practical deployment.

1. Introduction

In recent years, the rise of immersive technologies, such as virtual reality (VR), augmented reality (AR), and gaming environments, has driven increasing demand for high-quality 360-degree content. Unlike conventional 2D media, 360° panoramas offer a full spherical field of view, allowing users to look in any direction and experience scenes from multiple perspectives. This capability is central to immersive storytelling, interactive education, simulation training, and virtual tourism. In fact, 360° video is now widely adopted across multiple industries. As shown in Figure 1, its applications include professional sports, travel content, live entertainment, and even news broadcasting and film production [,]. Figure 1 indicates that adoption is strongest in visually dynamic domains such as sports and travel, where immersive perspectives enhance audience engagement, while sectors like news and documentaries adopt it more selectively. This distribution underscores the need for generation methods that balance visual quality with adaptability across varied content genres.
Figure 1. Distribution of 360° video applications across industries. The highest usage is observed in professional sports and travel content, highlighting the medium’s strong concentration within the entertainment and film sector. In contrast, the lowest adoption of 360° video is reported in news, documentaries, and TV shows.
Traditionally, 360° content has been captured using specialized multi-camera rigs or panoramic recording setups. While effective, these approaches are often costly, hardware-dependent, and difficult to scale. As a result, there is growing interest in data-driven alternatives that can synthesize panoramic video content using artificial intelligence (AI). Recent advances in neural rendering have enabled promising progress in this area. For example, neural radiance fields (NeRFs) [] have demonstrated impressive results in reconstructing static 3D scenes from multi-view images (frames). However, standard NeRF frameworks depend on consistent lighting and static geometry, which limits their effectiveness in dynamic or uncontrolled environments. NeRF in the Wild (NeRF-W) is an extension of the classic NeRF model, designed to address the limitations of NeRF when dealing with “in-the-wild” photo collections. The classic NeRF model required a static scene and controlled lighting, making it difficult to use with real-world images from the internet []. The NeRF-W model introduces two main innovations to overcome these challenges:
  • Static and Transient Components: To address the static scene limitation in the classic NeRF model, the NeRF-W model decomposes the scene into static and transient components. The static component represents the permanent parts of the scene (e.g., a building), while the transient component models temporary elements like people or cars moving through the scene. This allows the model to reconstruct a clean, static representation of the scene, even with transient occluders present in the input images [].
  • Appearance Embeddings: The model learns a low-dimensional latent space to represent variations in appearance, such as different lighting conditions, weather, or post-processing effects (e.g., filters). By assigning an appearance embedding to each input image, NeRF-W can disentangle these variations from the underlying 3D geometry of the scene [].
To support another challenge in classic NeRF models, which is the high complexity of outdoor and unbounded environments, the mipmap neural radiance fields 360 (Mip-NeRF 360) model [] introduces hierarchical scene parameterization and online distillation, improving rendering quality and reducing aliasing in wide-baseline, unbounded scenes. Unlike classic NeRF, which employs uniform ray sampling and fixed positional encoding, Mip-NeRF 360 utilizes conical frustum-based integrated positional encoding and a coarse-to-fine proposal multi-layer perceptron to enhance fidelity. While these improvements significantly boost rendering quality, they come at the cost of increased computational resources and training time. Nonetheless, the model remains limited to static imagery and does not support dynamic video generation [].
To address temporal synthesis, Imagine360 proposes a dual-branch architecture that transforms standard perspective videos into 360° panoramic sequences. By incorporating antipodal masking and elevation-aware attention, the model generates plausible motion across panoramic frames []. Similarly, models like LAMP (Learn a Motion Pattern) aim to generalize motion dynamics from few-shot inputs, showing that motion priors can be learned and reused efficiently []. Recently, Wen [] introduced a human–AI co-creation framework that enables intuitive generation of 360° panoramic videos via sketch and text prompts, highlighting the growing trend toward collaborative and creatively controllable generation pipelines.
In parallel, diffusion-based models have emerged as a powerful alternative to GANs and NeRFs for generative image and video tasks. SphereDiff mitigates the distortions introduced by equirectangular projection [] through a spherical latent-space Mip-NeRF and distortion-aware fusion []. Meanwhile, Diffusion360 enhances panoramic continuity by applying circular blending techniques during generation, producing seamless and immersive results [].
Conceptually, sparsity [] describes a property where a vector or matrix contains a high percentage of zero or near-zero values. However, in the context of generating 360-degree videos from narrow-field images, sparsity primarily refers to a different, yet equally critical, challenge: the limited number and distribution of input images. This sparsity manifests as minimal overlap and insufficient visual information in images across the entire 360-degree field.
There are three sparsity bands used to describe behavior and output quality in 360° video generation, which are described as follows:
  • Low sparsity band: The number of NFoV frames is high, angular baselines are small, and view overlap is substantial. Generative models typically produce the highest visual fidelity. Fine textures, consistent color gradients, and stable temporal coherence are preserved due to abundant spatial and angular information. Geometry-aware modules can more accurately reconstruct depth, and parallax artifacts are minimal. However, computational demands are higher, and redundant viewpoints can lead to inefficiencies in training and inference [,,].
  • Medium sparsity band: Moderate angular baselines and reduced overlap begin to challenge models’ ability to maintain detail across large scene variations. While global scene structure remains plausible, fine detail and texture consistency may degrade, especially in peripheral regions. Temporal smoothness may also be affected, with occasional flickering in dynamic areas. This regime can serve as a trade-off point, balancing efficiency with acceptable quality for many real-world applications [,].
  • High sparsity band: Viewpoints are widely spaced and overlap is minimal. Models struggle to interpolate unseen regions accurately. This often results in visible artifacts such as ghosting, stretched textures, and structural distortions, particularly in occluded or complex geometry regions. Temporal coherence suffers significantly in motion-rich scenes, and geometry-aware approaches may produce unstable depth estimates. While computational cost is low, the resulting videos typically require strong priors, advanced inpainting, or multimodal constraints to achieve acceptable quality [,,].
Table 1 defines the three sparsity bands for 360° video generation based on spatial and temporal sampling. Low sparsity (dense) uses many views (≥36 NFoV frames) with small angular baselines (≤10°), high overlap (≥70%), fast capture rates (≥60 views/s), and highly uniform coverage (≥80% entropy). Medium sparsity has moderate views (12–36) with 10°–30° baselines, 30–70% overlap, 30–60 views/s capture, and moderately uniform coverage (40–80% entropy). High sparsity (sparse) involves few views (≤12) with wide baselines (>30°), low overlap (≤30%), slow capture (≤30 views/s), and uneven coverage (≤40% entropy).
Table 1. Sparsity bands for 360° video generation.
These developments collectively establish a strong foundation for this research, which explores how generative models can synthesize immersive 360° video from sparse visual inputs, widely spaced perspective images, typically fewer than ten, that provide incomplete coverage of the panoramic viewing sphere. As illustrated in Figure 2, the proposed pipeline situates NFoV inputs within a broader generative framework, highlighting how diffusion models and radiance-field methods can transform limited perspective data into coherent panoramic sequences. This schematic underscores the role of AI-driven synthesis in replacing hardware-intensive capture workflows, enabling both scalability and creative flexibility. By unifying the strengths of radiance-field modeling, motion-aware learning, and diffusion-based generation, this work aims to address the limitations of traditional hardware-dependent pipelines and pave the way toward scalable, cost-effective, and creatively flexible panoramic video generation.
Figure 2. Schematic view of 360° video generation using generative algorithms. Generative models (e.g., diffusion models, NeRF models) enable the transformation of NFoV images into panoramic videos for immersive applications.
The remainder of this paper is organized as follows. Section 2 outlines the methodology and literature review strategy. Section 3 presents related work in panoramic video generation. Section 4 describes the identified gaps. Section 5 compares computational and processing times. Section 6 discusses the limitations of existing approaches. Section 7 highlights challenges and opportunities for innovation. Section 8 concludes this paper with future research directions.

2. Methodology

This research is based on a structured literature review of peer-reviewed and preprint publications from 2021 to 2025. The review focused on 360° video generation, panoramic vision, neural radiance fields (NeRFs), diffusion models [], and generative AI frameworks. Special attention was given to studies addressing sparse-input generation, such as synthesizing panoramic content from a limited number of static images or perspective views.
Relevant papers were sourced through IEEE Xplore, arXiv, and Google Scholar using targeted keyword combinations, including “360 video diffusion model,” “omnidirectional vision,” “panoramic video generation,” “text-to-360 synthesis,” “NeRF 360 video,” “image stitching,” and “bridging the gap.” These search terms were selected to identify works focusing on spatial consistency, stitching logic, and few-shot synthesis in the context of immersive media.
Studies were selected based on technical relevance, originality, full-text accessibility, and their contribution to the evolving field of AI-based panoramic generation. No experimental data, human subjects, or proprietary datasets were used in this research. The reviewed materials include both open-access and publisher-restricted papers, retrieved through academic databases and institutional access where applicable.

2.1. Research Questions

To guide the scope of this review and structure the analysis of current approaches, the following research questions were formulated:
Research Question 1 (RQ1): What generative approaches are currently used for 360° image and video synthesis?
This question investigates the main categories of generative models, such as NeRFs, GANs, and diffusion models, and how they have evolved to handle immersive panoramic content.
Research Question 2 (RQ2): What unique technical contributions do current models offer for panoramic image and video generation?
This question focuses on how different approaches introduce novel architectures, generation strategies, or domain-specific techniques to advance the field of 360° content synthesis.
Research Question 3 (RQ3): What technical challenges and research gaps remain in current 360° video generation systems?
This question emphasizes the need to address issues such as stitching artifacts, generalization across scenes, and efficient generation from limited viewpoints.

2.2. Literature Selection and Review Strategy

The literature selection process for this review followed a structured approach using the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) framework. A total of 43 records were identified through database searches in IEEE Xplore, arXiv, and Google Scholar, with an additional 13 records identified through other sources such as academic recommendations and manual browsing. After removing duplicates, 56 records remained for screening. During the screening stage, four records were excluded based on a review of the titles and abstracts. The remaining 52 full-text articles were assessed for eligibility, from which 6 were excluded due to insufficient methodological detail or lack of direct relevance to 360° panoramic generation. The final review included 46 papers that contribute to the analysis of AI-driven approaches for 360° video and image synthesis. The PRISMA flow diagram in Figure 3 visualizes this process.
Figure 3. PRISMA chart of selected studies. The chart illustrates the systematic process of identifying, screening, and including studies in the review, ensuring methodological transparency. Although it reflects a structured approach, the selection remains limited by the scope of the databases and sources searched.
The literature search aimed to identify recent and high-impact studies related to 360° image and video generation using artificial intelligence. Searches were conducted across IEEE Xplore, arXiv, Google Scholar, and the proceedings of top-tier conferences such as CVPR. A range of search strings was used, both individually and in combination, to ensure comprehensive coverage of the domain. These are summarized in Table 2.
Table 2. Search keywords used in the literature search.
After the initial keyword search, all retrieved papers were screened using predefined inclusion and exclusion criteria to ensure technical relevance, methodological depth, and alignment with the research objectives. Table 3 and Table 4 present these criteria in detail.
Table 3. Inclusion criteria.
Table 4. Exclusion criteria.
Following the selection process, 50 papers were reviewed, from which 12 core works were identified for in-depth analysis due to their significant technical contributions across categories such as diffusion-based generation, sparse-view stitching, neural radiance fields, and latent motion modeling. These core studies were selected based on their methodological depth, relevance to panoramic synthesis under sparse-input conditions, and alignment with the objectives of this thesis. The technical details were extracted from each paper to support the narrative summaries and structured comparisons in the following sections.

2.3. BERTopic for Topic Modeling

To organize and analyze the research papers, BERTopic, a transformer-based topic modeling method, was applied. This technique automatically grouped abstracts into semantically similar clusters, facilitating the identification of recurring themes and research directions within the field.
As shown in Figure 4, BERTopic works [,] by first embedding the text using a pre-trained language model. These embeddings are then reduced in dimensionality and clustered to discover patterns in the data. Each cluster is summarized with key terms using a class-based TF-IDF approach, allowing us to label and interpret the core ideas represented in each topic.
Figure 4. BERTopic modeling process, including embedding, clustering, and topic extraction. The model applies document embeddings and UMAP with HDBSCAN to cluster semantically similar documents, from which topics are extracted and refined using c-TF-IDF and Maximal Marginal Relevance (MMR). This approach offers a robust method for topic discovery but requires careful parameter tuning to yield meaningful results.
After applying this process to the collection of over 40 research papers, we identified 10 distinct topics. These ranged from 360° video generation and scene reconstruction to image outpainting and spherical vision techniques. The top-ranked terms for each topic are presented in Figure 5, providing a clear snapshot of the focus areas that emerged from the modeling []. This visualization highlights the semantic structure of current research, revealing that while some topics cluster around foundational techniques (e.g., NeRF-based depth modeling, diffusion-driven generation), others focus on application-oriented challenges such as VR/AR integration and panoramic video blending. These thematic clusters help contextualize the breadth of methods and objectives in the panoramic video generation literature.
Figure 5. The top 10 topics discovered through BERTopic modeling, visualized with their top five representative keywords and corresponding c-TF-IDF scores. Each horizontal bar chart displays the relative importance of keywords within a topic, enabling a clear and quantitative understanding of the topic’s content.
Each topic was manually interpreted to go beyond automated keyword summaries. This involved reviewing representative documents from each cluster to understand the context behind the extracted terms. Based on these insights, we assigned human-readable labels to the topics (e.g., “Temporal Scene Synthesis,” “Diffusion-Based Rendering,” and “Camera Calibration and Stitching Challenges”) that more accurately reflected their underlying themes. This step helped ensure that the model outputs were not only statistically coherent but also semantically meaningful and aligned with the literature’s actual research focus.
Using BERTopic not only saved time compared to manual categorization but also allowed for a more objective and data-driven structure for the literature review. It helped us understand the landscape of the field more clearly and informed the direction of our research.

4. Identified Gaps

Although significant advances have been made in 360° panoramic image and video generation, the existing body of literature reveals several persistent gaps that limit the scalability, generalizability, and deployability of current approaches.
A major limitation lies in the reliance on dense, overlapping, or structured input data. Many models, whether based on stitching [,], GANs [], or diffusion [,], assume input completeness or paired view correspondences to function effectively. Even state-of-the-art solutions such as Imagine360 [] and PanoDreamer [] require perspective video streams or semantic layout priors, making them less applicable in scenarios involving sparse or non-overlapping NFoV images. While methods like SphereDiff and PanFusion mitigate projection issues, they remain tuned for image-level outputs and often lack generalization under low-data regimes. This limitation reduces accessibility in real-world cases where full scene coverage is unavailable, such as mobile capture, quick scans, or low-cost drone footage.
Temporal modeling in generative video synthesis is also constrained by input assumptions and lacks flexibility under few-shot conditions. Works like 360DVD [], LaMD [], and MagicVideo [] demonstrate strong results in controlled settings but still rely on continuous motion prompts or full video sequences. Models like Latent-Shift [] and MCDiff [] propose efficient mechanisms for motion conditioning, yet few studies have examined how they can be adapted for sparse temporal inputs or motion hallucination from static frames. This gap limits their practical use in applications where continuous video is not available or motion must be inferred from minimal data, such as surveillance, fast prototyping, or archival conversion.
Another critical gap is the underutilization of geometric information such as scene depth, symmetry, and layout. While methods like 360MonoDepth [] and Hara et al. [] highlight the role of depth and symmetry in improving spatial realism, these concepts are not yet embedded in most panoramic generation pipelines. Diffusion-based models like SphereDiff [] operate in latent space without incorporating explicit geometry, and many stitching or blending approaches [,] still treat geometry as a post-processing correction rather than an integral modeling component. As a result, generated content often lacks spatial coherence or realism when deployed in 3D environments like VR, AR, or game engines.
Controllability and cross-modal conditioning remain promising but underdeveloped directions. Text-to-360-degree generation models such as Panogen [], PanoDiT [], and Customizing360 [] have made early strides in prompt-based generation but still rely heavily on training-time conditioning or paired datasets. Generalization to unseen prompts, free-form editing, or hybrid inputs remains limited, particularly in the context of dynamic video generation or immersive interaction. Recent advancements in AR content creation [] demonstrate how vision–language models can generate scene-aware 3D assets based on environmental context and textual intent, highlighting a path forward for panoramic systems to become more adaptive and context-sensitive. This restricts personalization and user interaction in consumer-facing applications like virtual tourism or narrative-based VR experiences.
Lastly, real-time deployment, streaming readiness, and lightweight inference are seldom addressed. While some works explore architectural efficiency [,] or adaptive streaming strategies [], these aspects are rarely integrated with generative quality. This gap limits the real-world applicability of current systems in domains such as VR/AR, mobile media, or simulation training, where performance, latency, and interactivity are as critical as fidelity. This affects scalability in environments with bandwidth, latency, or device constraints, particularly in education, live events, or mobile deployment scenarios.
In summary, while the field has progressed rapidly across several fronts, it continues to face challenges in sparse-input generalization, temporal realism, geometry integration, controllability, and deployment efficiency. These limitations directly motivate this thesis, which proposes a unified framework that synthesizes 360° panoramic video from sparse NFoV inputs while maintaining temporal coherence and visual fidelity. By integrating geometric priors, motion-aware learning, and efficient generative pipelines, this work aims to bridge key gaps in both technical capability and practical usability.

5. Discussion

The findings from this review illustrate a rapidly evolving field of research in panoramic image and video generation, driven by the convergence of generative AI, radiance-field modeling, and immersive media demands. Across the spectrum, from classical image stitching and outpainting to diffusion-based 360° synthesis and multimodal guidance, researchers have made significant strides in generating semantically rich, geometrically coherent panoramic content. However, several critical limitations and emerging trends shape the trajectory of future development.
Across this evolution, three primary families of generative approaches (GAN-based methods, radiance-field models such as NeRF, and diffusion models) define the field’s progression in both capability and design philosophy. GAN-based pipelines formed the first wave of generative panoramic methods, offering rapid image-to-panorama outpainting and computational efficiency. However, they frequently suffered from mode collapse, texture artifacts, and limited geometric reasoning, which constrained their ability to produce consistent results in sparse or unstructured scenarios. Radiance-field approaches marked a significant shift by introducing explicit 3D awareness and depth-consistent view synthesis, achieving strong geometric fidelity but at the cost of high data requirements and slow training or inference. Diffusion models now represent the current frontier, unifying high visual fidelity, semantic controllability, and multimodal integration through iterative denoising in high-dimensional latent spaces. Taken together, these model families illustrate a clear trajectory: from texture-focused generative methods to geometry-aware radiance fields, and ultimately to controllable, scalable, and multimodal pipelines capable of supporting truly immersive 360° media production.
One of the most prominent trends is the shift from deterministic stitching and inpainting to generative architectures that leverage GANs and diffusion models. These models have enabled richer semantic understanding and improved visual fidelity, allowing for more coherent scene completion and text-guided synthesis. Yet a considerable number of these methods rely on dense input data or structured conditioning signals, limiting their utility in real-world scenarios where only sparse or unregistered NFoV images are available. The assumption of data richness represents a barrier to scalability and accessibility, especially in low-resource or mobile capture environments. This concern is echoed in mobile delivery surveys, which highlight the challenge of scaling 360° content under bandwidth constraints and device limitations [].
Temporal modeling has also emerged as a major research focus, particularly in transforming still inputs into temporally coherent video sequences. While models like 360DVD and Imagine360 introduce mechanisms for panoramic video synthesis, they often depend on motion priors, dense video captions, or perspective-based inputs. These conditions make them less applicable to use cases requiring few-shot or one-shot generation from minimal visual cues. Furthermore, generative video models frequently lack robust control over temporal consistency, leading to flickering, drift, or incoherent motion propagation across frames.
Scene geometry and depth awareness remain underutilized in most panoramic generation pipelines. Despite the introduction of 360°-specific solutions like SphereDiff and SphereSR, the majority of models treat the spherical canvas as a projection rather than a spatial structure. This overlooks the importance of geometric cues such as symmetry, parallax, and depth, which are essential for producing immersive, explorable environments. Methods such as 360MonoDepth and spherical symmetry-based generation, while addressing geometric aspects, often remain isolated from the broader generative pipeline and have yet to be integrated into end-to-end frameworks.
Recent advances in radiance-field models illustrate how geometry can be leveraged more effectively. The Mip-NeRF 360 [] model introduces a mipmapping-based, anti-aliased representation that supports level-of-detail rendering and high-fidelity reconstruction of unbounded 360° environments, mitigating aliasing and edge distortions that classic NeRF models cannot handle. Similarly, the NeRF-W model [] incorporates appearance embeddings to manage unstructured input with varying lighting and occlusions, achieving robust synthesis from sparse and inconsistent viewpoints. These innovations highlight how radiance-field methods are evolving toward practical panoramic use, bridging the gap between classical geometry reasoning and modern generative pipelines.
The relevance of geometric realism is further reinforced by recent applications in architecture, engineering, and construction (AEC), where 360° media has been adopted for training, simulation, and virtual site monitoring [].
Another trend is the growing emphasis on interactivity and control. Recent models have begun supporting text-to-360-degree generation, user-guided outpainting, and conditional scene customization. While these capabilities offer greater flexibility, they are still constrained by limitations in layout reasoning, input conditioning complexity, and prompt sensitivity. Cross-view synthesis techniques further demonstrate the feasibility of generating panoramas from abstract or top-down inputs, but these approaches frequently suffer from realism trade-offs and limited generalization across domains.
Finally, practical deployment considerations remain largely absent. Few models account for real-time processing, lightweight inference, or streaming efficiency, critical factors for applications in VR, AR, or edge-device rendering. Although frameworks like TS360 and MagicVideo address deployment and compression indirectly, there is a pressing need for panoramic generation systems that balance quality with performance and accessibility. Recent surveys on bandwidth optimization and edge delivery of 360° video [,] emphasize this point, highlighting how the infrastructure for delivering immersive content still faces limitations, even when generation is technically solved.
In summary, the field is trending toward more powerful and flexible generation tools, yet many models fall short of addressing the core challenges related to sparse-input handling, geometric reasoning, temporal coherence, and real-time usability. The insights from this review underscore the importance of unified, efficient, and generalizable architectures that can scale across use cases from creative applications to immersive media production. This forms the basis for the research direction proposed in this thesis, which aims to bridge these gaps through an integrated approach to sparse-input, generative panoramic video synthesis.

5.1. Computational and Deployment Considerations

While substantial progress has been made in the visual quality and flexibility of 360° image and video generation models, computational performance and deployment feasibility remain underreported in the literature. Among the models reviewed, SphereDiff [], MVSplat360 [], and 360DVD [] provide some information regarding training time, GPU memory usage, inference latency, and deployment scalability (see Section 5.2).
Nevertheless, architectural characteristics suggest that many of these models may pose significant computational demands. For instance, diffusion-based models like SphereDiff are known for their iterative sampling process, which can result in slow inference speeds and high memory consumption. Similarly, 360DVD leverages view-dependent rendering techniques inspired by NeRFs, which are computationally intensive and difficult to optimize for real-time scenarios. Imagine360’s dual-branch temporal design, incorporating attention-based motion synthesis and spatial transformations, likely incurs high GPU memory requirements during training and inference, especially for longer video sequences.
The lack of standardized benchmarks on runtime efficiency, hardware requirements, or deployment constraints still makes it difficult to assess the practical readiness of these models for use in latency-sensitive or resource-constrained environments. This gap is particularly concerning for applications in mobile VR/AR, simulation training, or remote collaboration, where responsiveness and computational efficiency are as critical as visual fidelity.

5.2. Comparison of Hardware and Processing Times

Table 14 presents a comparative overview of recent 360° video generation models, including both NeRF-based and diffusion-based approaches. The table summarizes key attributes such as model parameters, training compute requirements, inference latency or FPS, the resolution tested, and the hardware used for training or inference. NeRF, Mip-NeRF, and Mip-NeRF 360 are MLP-based methods that require per-scene training, with training times ranging from several hours on multiple TPU cores to over a day on a V100 GPU (NVIDIA Corporation, 2788 San Tomas Expressway Santa Clara, CA 95051, USA), while inference speed is reported only for NeRF. Diffusion-based methods, including 360DVD, VideoPanda, SphereDiff, MVSplat360, and SpotDiffusion, vary widely in computational cost and runtime: some require extensive multi-GPU training, whereas others, like SphereDiff and SpotDiffusion, operate without additional training but may have long inference times or high VRAM requirements; for instance, SphereDiff requires ≤40 GB of VRAM, while the VRAM usage of other models is not reported. This table provides a concise reference for evaluating the trade-offs between model complexity, computational resources, and runtime performance in 360° video generation.
Table 14. Performance comparison of recent 360° video generation models *.

6. Limitations

While this study offers a comprehensive review of AI-driven 360° image and video generation, it is not without limitations. First, the scope of the literature was restricted to publicly available academic and preprint publications from 2021 to 2025. While this range captures recent innovations, it may exclude relevant industrial systems or proprietary models that have not been documented in open-access venues. Consequently, the findings may not fully reflect the most cutting-edge or large-scale implementations used in commercial applications.
Additionally, the review emphasized papers indexed in IEEE, arXiv, CVPR, and similar repositories, which may introduce selection bias. Although an effort was made to include a diverse set of models and perspectives, from stitching-based methods to diffusion transformers, the selection process may have overlooked significant contributions published in less prominent forums or outside the core computer vision community. This could limit the generalizability of the review’s conclusions across subfields like robotics, 3D graphics, or immersive interface design.
Another limitation stems from the inherently subjective nature of evaluating generative models. Many papers rely on qualitative visuals or user preference studies to assess output fidelity, which introduces inconsistency in model comparison. Objective metrics for 360° content, especially for video, remain underdeveloped, making it difficult to benchmark models uniformly. This hinders the ability to draw conclusive insights about model performance across tasks and datasets.
Furthermore, while this study aims to identify architectural and conceptual gaps, it does not experimentally validate new models. As such, its insights remain grounded in secondary analysis and author-reported findings. Future work could complement this review with empirical studies or implementation-based evaluations to measure practical performance and scalability.
Lastly, the scope of this thesis focuses on generating panoramic video content from sparse inputs such as 8–10 NFoV images. While this approach is promising for enhancing accessibility and efficiency, it also introduces constraints related to motion realism, occlusion handling, and depth consistency. These trade-offs may limit the fidelity of generated content in edge cases and complex scene structures. Addressing these limitations will require further refinement of geometry-aware, temporally conditioned generative pipelines.

7. Challenges

Although panoramic video generation has made significant strides, several persistent challenges continue to delay its deployment in real-world applications. Based on literature prevalence and practical impact, the key challenges are summarized below.

7.1. Sparse-Input Handling (Highest Priority)

The most significant challenge lies in handling sparse inputs, as seen in many state-of-the-art models such as 360DVD [] and Diffusion360 [], which assume access to dense, overlapping multi-view inputs. When only a few NFoV images are available, these models struggle to preserve spatial coherence and often hallucinate content []. Partial remedies include techniques like the Splatter360 model [], which leverages point cloud splatting for sparse inputs, but its performance still degrades with very few views.

7.2. Temporal Coherence

Maintaining consistent motion across frames is challenging, especially with sparse or near-static inputs. Approaches such as Imagine360 [] and LAMP [] utilize motion priors or interpolation techniques; however, their effectiveness decreases significantly with reduced frame density []. Flicker and jitter reduce realism and immersion, which are critical for VR and simulation applications.

7.3. Geometry Awareness

Most generative pipelines overlook structural cues such as depth, symmetry, or parallax, focusing on 2D projections or flattened panoramas. Methods like SphereDiff [] and 360MonoDepth [] introduce geometry-aware modules, but these are rarely integrated into full video synthesis workflows, resulting in visually plausible but structurally unrealistic outputs.

7.4. Controllability and Semantic Alignment

Aligning AI-generated content with user intent remains limited. Systems often require retraining or paired data to support text, sketch, or layout-driven control []. This constrains creative flexibility and interactive authoring in applications like education, storytelling, or co-creation environments.

7.5. Real-Time Performance and Deployment Constraints

High-fidelity models typically demand substantial computation, large memory footprints, and specialized hardware, hindering their deployment on mobile or bandwidth-constrained platforms. Techniques like adaptive streaming and lightweight architectures [] exist, but they are rarely combined with full-quality panoramic synthesis.
By systematically addressing these prioritized challenges, from sparse-input handling to deployment constraints, the next generation of 360° panoramic video models can achieve realistic, controllable, and efficient immersive experiences, bridging current experimental prototypes with practical applications.

8. Conclusions and Future Directions

This research presents a comprehensive review of 360° panoramic image and video generation techniques, tracing the field’s evolution from early stitching-based methods to modern generative frameworks powered by diffusion models, neural radiance fields, and text-based conditioning. By examining over 40 peer-reviewed and preprint studies, the review categorizes key contributions in areas such as sparse-view synthesis, temporal modeling, multimodal generation, geometric awareness, and immersive rendering.
The findings reveal a clear trajectory toward more flexible and high-fidelity systems. Models like 360DVD [], Imagine360 [], and SphereDiff [] exemplify progress in motion continuity, semantic understanding, and spherical representation. However, limitations persist. Many systems depend on dense input data, structured prompts, or computationally intensive inference. Generalization to sparse perspectives, depth-consistent generation, and real-time processing remains an open research challenge.
In light of these observations, this review establishes a strong foundation for addressing the next wave of innovation in panoramic generation. As immersive applications grow across VR, simulation, training, and entertainment, there is an urgent need for scalable, controllable, and geometry-aware pipelines that can function under real-world constraints.
The direction proposed in this thesis contributes to this vision by focusing on panoramic video generation from as few as 8–10 NFoV images, bridging the gap between efficiency and expressiveness. Through this unified, data-efficient approach, this project aims to push the boundaries of what is possible in 360° content synthesis while offering practical benefits for both academic research and industry deployment.
To address the identified gaps, the proposed method integrates several key innovations. By using sparse NFoV inputs, it directly tackles the challenge of data efficiency and removes reliance on dense camera setups. Incorporating temporal attention and latent motion prediction enables the generation of temporally coherent video, even under limited input conditions. Additionally, the inclusion of geometric priors, such as depth cues or symmetry constraints, enhances spatial realism and alignment. Finally, by leveraging prompt-adaptive control mechanisms, the system supports flexible conditioning through sketch, text, or hybrid inputs, reducing dependency on paired datasets and enabling creative control. These design choices position the framework as a practical and adaptable solution for immersive content creation in bandwidth-limited, mobile, or creator-driven environments.

Future Directions

As the demand for immersive and interactive media continues to grow, several promising avenues for future research in 360° panoramic generation emerge directly from the challenges in Section 7. Addressing these gaps requires solutions that tackle multiple constraints: sparse-input handling, temporal coherence, geometry awareness, controllability, and deployment efficiency in an integrated manner.
One-shot and few-shot panoramic video generation remains especially challenging because most current models rely on dense input sequences or strong motion priors. In real-world scenarios, however, creators and developers often have access to only a handful of perspective frames due to hardware, bandwidth, or cost constraints. Future research should focus on entropy-aware input conditioning, few-shot motion interpolation, and conditional diffusion bridges to produce temporally coherent 360° video from minimal inputs while preserving spatial realism. Leveraging latent motion prediction and temporal attention mechanisms will further stabilize motion across long sequences. Achieving this would directly reduce dependency on multi-camera rigs, making panoramic generation feasible for mobile VR content creation, low-cost simulation environments, and independent creator workflows.
Ensuring structural realism across sparse multi-view sequences will require embedding geometry- and scene-aware representations, leveraging depth, symmetry, and parallax cues, into generative pipelines. This integration would address current limitations where models overlook structural cues, resulting in outputs that are visually plausible but geometrically inconsistent.
Another critical direction is improving the controllability and expressiveness of generative models. While text-driven synthesis and layout conditioning are now common, they often require paired data or task-specific fine-tuning, limiting deployment flexibility. Future systems should emphasize training-free or prompt-adaptive frameworks that allow dynamic control over generation using hybrid inputs such as sketches, depth cues, or interactive prompts. This capability would empower creators to iterate rapidly without retraining models, enabling more accessible and scalable production pipelines. Wen [] demonstrates the potential of such systems in human–AI co-creation for 360° video, where real-time bidirectional interaction allows user intent to directly guide generative outcomes.
Future directions also include embedding contextual and semantic awareness into panoramic generation. Behravan et al. [] highlight the value of hybrid vision–language conditioning in AR, suggesting that panoramic synthesis could benefit from scene-aware cues such as spatial layout, object roles, and environmental affordances to produce content that is both coherent and meaningful in context. Additionally, Liu and Yu [] illustrate the potential of emotionally responsive video generation, which can communicate mood, influence perception, or support goal-directed experiences. Integrating these capabilities into 360° video generation could enable emotionally resonant applications in education, virtual tourism, and therapeutic simulations, where engagement depends on both realism and affective impact.
Finally, addressing real-time performance and deployment constraints will require optimizing inference efficiency, employing adaptive streaming strategies, and developing lightweight architectures that balance quality and speed. These approaches will ensure deployability across a wide range of hardware platforms, from high-end VR systems to mobile devices. By aligning these innovations, future systems can deliver panoramic video generation that is not only realistic and controllable but also efficient and accessible, bridging the gap between experimental prototypes and scalable real-world applications.

Author Contributions

Conceptualization, N.A.C., M.N. and J.T.; methodology, N.A.C.; formal analysis, N.A.C. and M.N.; writing—original draft preparation, N.A.C., M.N.; writing—review and editing, M.N. and J.T.; visualization, N.A.C., M.N.; supervision, M.N. and J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The studies supporting this review are derived from publicly available resources cited within the article. No new data were created or analyzed in this study.

Acknowledgments

The authors would like to thank all those who contributed indirectly to this work through inspiring discussions, critical insights, and support within the broader academic environment.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wong, E.S.; Wahab, N.H.A.; Saeed, F.; Alharbi, N. 360-degree video bandwidth reduction: Technique and approaches comprehensive review. Appl. Sci. 2022, 12, 7581. [Google Scholar] [CrossRef]
  2. Shafi, R.; Shuai, W.; Younus, M.U. 360-degree video streaming: A survey of the state of the art. Symmetry 2020, 12, 1491. [Google Scholar] [CrossRef]
  3. Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
  4. Martin-Brualla, R.; Radwan, N.; Sajjadi, M.S.; Barron, J.T.; Dosovitskiy, A.; Duckworth, D. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7210–7219. [Google Scholar]
  5. Barron, J.T.; Mildenhall, B.; Verbin, D.; Srinivasan, P.P.; Hedman, P. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
  6. Barron, J.T.; Mildenhall, B.; Tancik, M.; Hedman, P.; Martin-Brualla, R.; Srinivasan, P.P. Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 5835–5844. [Google Scholar] [CrossRef]
  7. Tan, J.; Yang, S.; Wu, T.; He, J.; Guo, Y.; Liu, Z.; Lin, D. Imagine360: Immersive 360 Video Generation from Perspective Anchor. arXiv 2024, arXiv:2412.03552. [Google Scholar] [CrossRef]
  8. Wu, R.; Chen, L.; Yang, T.; Guo, C.; Li, C.; Zhang, X. Lamp: Learn a motion pattern for few-shot-based video generation. arXiv 2023, arXiv:2310.10769. [Google Scholar]
  9. Wen, Y. “See What I Imagine, Imagine What I See”: Human-AI Co-Creation System for 360° Panoramic Video Generation in VR. arXiv 2025, arXiv:2501.15456. [Google Scholar]
  10. Ray, B.; Jung, J.; Larabi, M.C. A low-complexity video encoder for equirectangular projected 360 video content. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1723–1727. [Google Scholar]
  11. Park, M.; Kang, T.; Yun, J.; Hwang, S.; Choo, J. SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation. arXiv 2025, arXiv:2504.14396. [Google Scholar]
  12. Feng, M.; Liu, J.; Cui, M.; Xie, X. Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models. arXiv 2023, arXiv:2311.13141. [Google Scholar] [CrossRef]
  13. Wei, L.; Zhong, Z.; Lang, C.; Yi, Z. A survey on image and video stitching. Virtual Real. Intell. Hardw. 2019, 1, 55–83. [Google Scholar] [CrossRef]
  14. Yao, X.; Hu, Q.; Zhou, F.; Liu, T.; Mo, Z.; Zhu, Z.; Zhuge, Z.; Cheng, J. SpiNeRF: Direct-trained spiking neural networks for efficient neural radiance field rendering. Front. Neurosci. 2025, 19, 1593580. [Google Scholar] [CrossRef]
  15. Zhang, Q.; Huang, C.; Zhang, Q.; Li, N.; Feng, W. Learning geometry consistent neural radiance fields from sparse and unposed views. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 8508–8517. [Google Scholar]
  16. Zhang, Y.; Zhang, G.; Li, K.; Zhu, Z.; Wang, P.; Wang, Z.; Fu, C.; Li, X.; Fan, Z.; Zhao, Y. DASNeRF: Depth consistency optimization, adaptive sampling, and hierarchical structural fusion for sparse view neural radiance fields. PLoS ONE 2025, 20, e0321878. [Google Scholar] [CrossRef] [PubMed]
  17. Niemeyer, M.; Barron, J.T.; Mildenhall, B.; Sajjadi, M.S.; Geiger, A.; Radwan, N. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5480–5490. [Google Scholar]
  18. Younis, T.; Cheng, Z. Sparse-View 3D Reconstruction: Recent Advances and Open Challenges. arXiv 2025, arXiv:2507.16406. [Google Scholar] [CrossRef]
  19. Truong, P.; Rakotosaona, M.J.; Manhardt, F.; Tombari, F. Sparf: Neural radiance fields from sparse and noisy poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4190–4200. [Google Scholar]
  20. Luo, C. Understanding diffusion models: A unified perspective. arXiv 2022, arXiv:2208.11970. [Google Scholar] [CrossRef]
  21. Gupta, P.; Ding, B.; Guan, C.; Ding, D. Generative AI: A systematic review using topic modelling techniques. Data Inf. Manag. 2024, 8, 100066. [Google Scholar] [CrossRef]
  22. Cheddak, A.; Ait Baha, T.; Es-Saady, Y.; El Hajji, M.; Baslam, M. BERTopic for enhanced idea management and topic generation in Brainstorming Sessions. Information 2024, 15, 365. [Google Scholar] [CrossRef]
  23. Sumantri, J.S.; Park, I.K. 360 panorama synthesis from a sparse set of images with unknown field of view. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 2386–2395. [Google Scholar]
  24. Akimoto, N.; Matsuo, Y.; Aoki, Y. Diverse plausible 360-degree image outpainting for efficient 3dcg background creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11441–11450. [Google Scholar]
  25. Liao, K.; Xu, X.; Lin, C.; Ren, W.; Wei, Y.; Zhao, Y. Cylin-Painting: Seamless 360 panoramic image outpainting and beyond. IEEE Trans. Image Process. 2023, 33, 382–394. [Google Scholar] [CrossRef]
  26. Zheng, D.; Zhang, C.; Wu, X.M.; Li, C.; Lv, C.; Hu, J.F.; Zheng, W.S. Panorama Generation From NFoV Image Done Right. arXiv 2025, arXiv:2503.18420. [Google Scholar] [CrossRef]
  27. Nie, L.; Lin, C.; Liao, K.; Liu, S.; Zhao, Y. Deep rectangling for image stitching: A learning baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5740–5748. [Google Scholar]
  28. Nie, L.; Lin, C.; Liao, K.; Liu, S.; Zhao, Y. Parallax-tolerant unsupervised deep image stitching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7399–7408. [Google Scholar]
  29. Zhou, T.; Li, H.; Wang, Z.; Luo, A.; Zhang, C.L.; Li, J.; Zeng, B.; Liu, S. Recdiffusion: Rectangling for image stitching with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2692–2701. [Google Scholar]
  30. Trepkowski, C.; Eibich, D.; Maiero, J.; Marquardt, A.; Kruijff, E.; Feiner, S. The effect of narrow field of view and information density on visual search performance in augmented reality. In Proceedings of the 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Osaka, Japan, 23–27 March 2019; pp. 575–584. [Google Scholar]
  31. Chen, Y.; Zheng, C.; Xu, H.; Zhuang, B.; Vedaldi, A.; Cham, T.J.; Cai, J. MVSplat360: Feed-forward 360 scene synthesis from sparse views. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
  32. Bao, C.; Zhang, X.; Yu, Z.; Shi, J.; Zhang, G.; Peng, S.; Cui, Z. Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views. arXiv 2025, arXiv:2503.24382. [Google Scholar] [CrossRef]
  33. Liu, F.; Sun, W.; Wang, H.; Wang, Y.; Sun, H.; Ye, J.; Zhang, J.; Duan, Y. ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model. arXiv 2025, arXiv:2408.16767. [Google Scholar] [CrossRef]
  34. Nagoor Kani, B.R.; Lee, H.Y.; Tulyakov, S.; Tulsiani, S. UpFusion: Novel View Diffusion from Unposed Sparse View Observations. In Proceedings of the Computer Vision—ECCV 2024: 18th European Conference, Milan, Italy, 29 September–4 October 2024; Proceedings, Part LXXVI. Springer: Berlin/Heidelberg, Germany, 2024; pp. 179–195. [Google Scholar] [CrossRef]
  35. Irshad, M.Z.; Zakharov, S.; Liu, K.; Guizilini, V.; Kollar, T.; Gaidon, A.; Kira, Z.; Ambrus, R. Neo 360: Neural fields for sparse view synthesis of outdoor scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 9187–9198. [Google Scholar]
  36. Chen, Z.; Wu, C.; Shen, Z.; Zhao, C.; Ye, W.; Feng, H.; Ding, E.; Zhang, S.H. Splatter-360: Generalizable 360° Gaussian Splatting for Wide-baseline Panoramic Images. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 21590–21599. [Google Scholar]
  37. Dastjerdi, M.R.K.; Hold-Geoffroy, Y.; Eisenmann, J.; Khodadadeh, S.; Lalonde, J.F. Guided co-modulated gan for 360° field of view extrapolation. In Proceedings of the 2022 International Conference on 3D Vision (3DV), Prague, Czech Republic, 12–16 September 2022; pp. 475–485. [Google Scholar]
  38. Zhang, C.; Wu, Q.; Gambardella, C.C.; Huang, X.; Phung, D.; Ouyang, W.; Cai, J. Taming Stable Diffusion for Text to 360° Panorama Image Generation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
  39. Wang, H.; Xiang, X.; Fan, Y.; Xue, J.H. Customizing 360-degree panoramas through text-to-image diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 4933–4943. [Google Scholar]
  40. Zhou, D.; Wang, W.; Yan, H.; Lv, W.; Zhu, Y.; Feng, J. Magicvideo: Efficient video generation with latent diffusion models. arXiv 2022, arXiv:2211.11018. [Google Scholar]
  41. An, J.; Zhang, S.; Yang, H.; Gupta, S.; Huang, J.B.; Luo, J.; Yin, X. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv 2023, arXiv:2304.08477. [Google Scholar]
  42. Chen, T.S.; Lin, C.H.; Tseng, H.Y.; Lin, T.Y.; Yang, M.H. Motion-conditioned diffusion model for controllable video synthesis. arXiv 2023, arXiv:2304.14404. [Google Scholar] [CrossRef]
  43. Zhang, S.; Wang, J.; Zhang, Y.; Zhao, K.; Yuan, H.; Qin, Z.; Wang, X.; Zhao, D.; Zhou, J. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv 2023, arXiv:2311.04145. [Google Scholar]
  44. Ni, H.; Shi, C.; Li, K.; Huang, S.X.; Min, M.R. Conditional image-to-video generation with latent flow diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18444–18455. [Google Scholar]
  45. Wang, Q.; Li, W.; Mou, C.; Cheng, X.; Zhang, J. 360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 6913–6923. [Google Scholar]
  46. Hu, Y.; Chen, Z.; Luo, C. LaMD: Latent Motion Diffusion for Image-Conditional Video Generation. Int. J. Comput. Vis. 2025, 133, 4384–4400. [Google Scholar] [CrossRef]
  47. Unterthiner, T.; van Steenkiste, S.; Kurach, K.; Marinier, R.; Michalski, M.; Gelly, S. FVD: A New Metric for Video Generation. 2019. Available online: https://openreview.net/pdf?id=rylgEULtdN (accessed on 1 August 2025).
  48. Unterthiner, T.; Van Steenkiste, S.; Kurach, K.; Marinier, R.; Michalski, M.; Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv 2018, arXiv:1812.01717. [Google Scholar]
  49. Hessel, J.; Holtzman, A.; Forbes, M.; Bras, R.L.; Choi, Y. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. arXiv 2022, arXiv:2104.08718. [Google Scholar] [CrossRef]
  50. Xie, K.; Sabour, A.; Huang, J.; Paschalidou, D.; Klar, G.; Iqbal, U.; Fidler, S.; Zeng, X. VideoPanda: Video Panoramic Diffusion with Multi-view Attention. arXiv 2025, arXiv:2504.11389. [Google Scholar]
  51. Xia, Y.; Weng, S.; Yang, S.; Liu, J.; Zhu, C.; Teng, M.; Jia, Z.; Jiang, H.; Shi, B. PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-aware Mechanisms. arXiv 2025, arXiv:2505.22016. [Google Scholar]
  52. Liu, J.; Lin, S.; Li, Y.; Yang, M.H. Dynamicscaler: Seamless and scalable video generation for panoramic scenes. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 6144–6153. [Google Scholar]
  53. Wu, S.; Tang, H.; Jing, X.Y.; Zhao, H.; Qian, J.; Sebe, N.; Yan, Y. Cross-view panorama image synthesis. IEEE Trans. Multimed. 2022, 25, 3546–3559. [Google Scholar] [CrossRef]
  54. Li, J.; Bansal, M. Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation. Adv. Neural Inf. Process. Syst. 2023, 36, 21878–21894. [Google Scholar]
  55. Yang, Z.; Teng, J.; Zheng, W.; Ding, M.; Huang, S.; Xu, J.; Yang, Y.; Hong, W.; Zhang, X.; Feng, G.; et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv 2024, arXiv:2408.06072. [Google Scholar]
  56. Choi, D.; Jang, H.; Kim, M.H. OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar] [CrossRef]
  57. Zhang, M.; Chen, Y.; Xu, R.; Wang, C.; Yang, J.; Meng, W.; Guo, J.; Zhao, H.; Zhang, X. PanoDiT: Panoramic Videos Generation with Diffusion Transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Phfodelphia, PA, USA, 20 February–4 March 2025; Volume 39, pp. 10040–10048. [Google Scholar]
  58. Xiong, Z.; Chen, Z.; Li, Z.; Xu, Y.; Jacobs, N. PanoDreamer: Consistent Text to 360-Degree Scene Generation. arXiv 2025, arXiv:2504.05152. [Google Scholar] [CrossRef]
  59. Huo, Y.; Kuang, H. Ts360: A two-stage deep reinforcement learning system for 360-degree video streaming. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
  60. Yoon, Y.; Chung, I.; Wang, L.; Yoon, K.J. Spheresr: 360° image super-resolution with arbitrary projection via continuous spherical image representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5677–5686. [Google Scholar]
  61. Baniya, A.A.; Lee, T.K.; Eklund, P.W.; Aryal, S. Omnidirectional video super-resolution using deep learning. IEEE Trans. Multimed. 2023, 26, 540–554. [Google Scholar] [CrossRef]
  62. Yang, S.; Tan, J.; Zhang, M.; Wu, T.; Li, Y.; Wetzstein, G.; Liu, Z.; Lin, D. Layerpano3d: Layered 3d panorama for hyper-immersive scene generation. arXiv 2024, arXiv:2408.13252. [Google Scholar]
  63. Kou, S.; Zhang, F.L.; Nazarenus, J.; Koch, R.; Dodgson, N.A. OmniPlane: A Recolorable Representation for Dynamic Scenes in Omnidirectional Videos. IEEE Trans. Vis. Comput. Graph. 2025, 31, 4095–4109. [Google Scholar] [CrossRef] [PubMed]
  64. Behravan, M.; Matković, K.; Gračanin, D. Generative AI for context-aware 3D object creation using vision-language models in augmented reality. In Proceedings of the 2025 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR), Lisbon, Portugal, 27–29 January 2025; pp. 73–81. [Google Scholar]
  65. Rey-Area, M.; Yuan, M.; Richardt, C. 360monodepth: High-resolution 360deg monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3762–3772. [Google Scholar]
  66. Hara, T.; Mukuta, Y.; Harada, T. Spherical image generation from a few normal-field-of-view images by considering scene symmetry. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 6339–6353. [Google Scholar] [CrossRef]
  67. Lu, C.N.; Chang, Y.C.; Chiu, W.C. Bridging the visual gap: Wide-range image blending. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 843–851. [Google Scholar]
  68. Mahmoud, M.; Rizou, S.; Panayides, A.S.; Kantartzis, N.V.; Karagiannidis, G.K.; Lazaridis, P.I.; Zaharis, Z.D. A survey on optimizing mobile delivery of 360 videos: Edge caching and multicasting. IEEE Access 2023, 11, 68925–68942. [Google Scholar] [CrossRef]
  69. Shinde, Y.; Lee, K.; Kiper, B.; Simpson, M.; Hasanzadeh, S. A Systematic Literature Review on 360° Panoramic Applications in Architecture, Engineering, and Construction (AEC) Industry. J. Inf. Technol. Constr. 2023, 28, 405–437. [Google Scholar] [CrossRef]
  70. Partarakis, N.; Zabulis, X. A review of immersive technologies, knowledge representation, and AI for human-centered digital experiences. Electronics 2024, 13, 269. [Google Scholar] [CrossRef]
  71. Liu, C.; Yu, H. Ai-empowered persuasive video generation: A survey. ACM Comput. Surv. 2023, 55, 1–31. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.