1. Introduction
With the rapid advancement of network communication and multimedia technologies, short video services have experienced exponential growth in various fields, such as entertainment, industry, agriculture, sports, healthcare, and transportation. These services can be easily accessed through ubiquitous smart devices like mobile phones and desktop computers, regardless of temporal and spatial constraints. While short videos afford users abundant knowledge across diverse domains and rich entertainment experiences, they entail several critical challenges, namely production cost constraints, quality limitations, and creativity deficits.
The emergence of generative artificial intelligence (AI) technology presents a paradigmatic shift in human–computer interaction, which is enabled by deep learning architectures trained on massive datasets to achieve different content generation of texts, images, and videos. By integrating advanced capabilities in natural language processing, computer vision, and synthesis reasoning, generative AI models show outstanding abilities in outputs with semantic coherence, superior to those of traditional rule-based models. Among these technological advancements, artificial Intelligence generated Content (AIGC), a specialized subset of generative AI, explicitly focuses on autonomous content creation as a critical research frontier.
The AIGC is capable of modelling intricate data distributions and analyzing sophisticated contexts for producing multimodal sequences with awareness, allowing for semantic consistency and content coherence. Built upon advanced neural network architectures that capture long-range dependencies in data and feature a competitive learning process structure, AIGC offers superior output quality and diversity in autonomous multimedia content generation. Moreover, it can automatically adapt to consumer preferences by customizing proficiency levels, a mechanism that is pivotal for ensuring content relevance and effectiveness in seamless integration scenarios. By blending art styles from historical and modern design paradigms, AIGC enhances operational efficiency, creativity, and user-friendliness, consequently overcoming limitations such as high costs, manual labor intensity, stylistic uniformity, and professional barriers.
To meet the ever-increasing demand for short video services characterized by diversified and personalized content, the AIGC has the potential to breakthrough the bottlenecks in short videos creation [
1]. Utilizing deep learning algorithms, AIGC can identify latent patterns within many short videos and synthesize innovative concepts through generative models, thereby significantly enhancing creative output. Additionally, the AIGC outperforms traditional production capacities in real-time personalization and cross-cultural adaptability. By analyzing user behavior data, it can generate location-specific video variants at scale, lessening the gap between supply and evolving consumer needs. Despite the notable success of AIGC in high-quality video production, state-of-the-art approaches still suffer from two critical drawbacks: intra-clip temporal incoherence and inter-clip semantic jumps, both of which significantly compromise content efficiency.
On the one hand, Diffusion models in AIGC often fail to capture temporal dynamics effectively, while powerful in spatial generation. It could lead to single clips exhibiting unnatural motion patterns and blurry motion transitions, both of which disrupt the fluidity of actions and reduce visual realism. On the other hand, when users manually splice multiple generated video clips, the absence of explicit scene-evolution constraints brings a cascade of issues stemming from the visual style inconsistency and character traits disconnection. Directly splicing clips often results in narrative rhythm disorder and abrupt audio-visual jumps, seriously damaging content coherence and interfering with the audience’s ability to follow the narrative naturally.
This paper proposes a controllable multimodal short-video generation framework grounded in constrained diffusion models, which features several pivotal contributions.
Multimodal attention fusion: The proposed framework leverages a multimodal attention fusion mechanism that dynamically integrates text, image, and audio features, resulting in the generation of short videos with richly textured and layered content. By adaptively assigning varying degrees to different modalities at each stage of video generation, this mechanism not only significantly improves the semantic coherence of the generated short videos but also enriches the overall user experience, producing outputs that are both engaging and contextually relevant.
Scene graph-driven multi-segment consistency modeling: The scene graph-driven multi-segment consistency modeling approach serves as an emphasis on maintaining the integrity of short-video compositions, where these two types of scene graphs work in tandem. The global scene graph sets the global guidelines, while the local scene graphs handle the segment-specific refinements. By imposing constraints derived from both global and local scene graphs, it effectively guarantees scene consistency and logical coherence during the concatenation of multiple video segments.
Three-tier constraint mechanism: The proposed framework incorporates a three-tier constraint mechanism, which plays a crucial role in ensuring precise and adaptable short-video generation. User intent is thoroughly analyzed and understood for extracting key semantic elements, which are then transformed into explicit control signals. These signals are subsequently employed to achieve fine-grained control, guiding the entire generation process to align seamlessly with user expectations.
This hierarchical controllable approach not only enhances the interpretability of user input but also enables real-time adjustments during the generation, ensuring that the final short video output meets both semantic and aesthetic requirements. The remainder of this paper is organized as follows. The related works of the AIGC-based short video generation are presented in
Section 2. Then characteristics of the short video generation are analyzed and the proposed framework is provided in
Section 3. Experimental results and subjective quality evaluations are reported in
Section 4. The conclusions of this work are summarized finally in
Section 5.
2. Related Work
2.1. AIGC Technology
Novel frameworks and cutting-edge methodologies have driven substantial advancements in AIGC, particularly within the specialized domain of foundation model adaptation for multimodal tasks. Xu et al. [
2] pioneered a systematic edge-cloud collaborative AIGC framework that delineates five core characteristics, with its architecture incorporating federated learning and differential privacy to address latency and security challenges, while proposing a holistic lifecycle model spanning data collection, pre-training, fine-tuning, and inference. Feng et al. [
3] put forward an edge-user collaborative inference framework (EUCI) for diffusion models to tackle the challenges of edge deployment, and this framework attains a 42% reduction in latency and 58% energy savings through model partitioning and dynamic task offloading. Related to cross-modal generation, the establishment of a comprehensive architecture for text-to-image generation in [
4], which seamlessly reconciles the strengths of GANs, autoregressive Transformers, and diffusion models, marks a pivotal advancement in cross-modal generation. At its core, this framework prioritizes the dominance of large multimodal encoders such as CLIP over incremental architectural refinements, a claim validated by robust zero-shot performance on benchmarks like LAION-5B and MS-COCO. For semantic segmentation enhancement, Bevacqua et al. [
5] enhanced semantic segmentation through ControlNet-based synthetic image generation, an approach that leverages multimodal control with depth maps, Canny edges, and segmentation maps to achieve a 3.17% increase in mean intersection over union (mIoU) on the Cityscapes dataset. Concurrently, Yin et al. [
6] conducted a systematic analysis of the implications of AIGC in media production and proposed targeted strategies that have demonstrated tangible effects, such as a 23% reduction in non-compliant content achieved by mitigating risks like algorithmic bias. AIGC is currently in a stage of rapid development, with its models continuously enhancing performance to tackle complex problems and generate more creative content. Further progress is also needed to understand human emotions and contextual nuances better to enable broader adoption across more application scenarios.
2.2. AIGC-Driven Video Generation
Driven by advances in deep learning, multimodal understanding, and computational efficiency, research on AIGC-driven video generation has achieved remarkable progress. A core challenge in this field is to achieve high semantic alignment between input prompts and generated visual content, which also serves as a key approach to breaking through the bottlenecks in quality and efficiency of video generation. Zhang et al. [
7] proposed a temporal residual learning framework specifically designed for image-to-video generation tasks. By leveraging an innovative dual-path noise prediction mechanism to preserve image prior information, this framework achieved a frame consistency rate of 95.36% on the WebVid-10M dataset, which represents a 3.77% improvement over Video Composer. Despite the advances in general-purpose video synthesis, the requirements of short-form videos pose distinct challenges, such as fast pacing, strong narrative appeal, and high production quality within a constrained duration.
To address these issues by enhancing computational efficiency, fine-grained control, and contextual coherence, Wang et al. [
8] proposed the LinGen framework, which achieves linear computational complexity via a novel Mamba-based architecture. This breakthrough significantly reduces the cost of long-context video generation and enables the production of high-resolution, minute-level AIGC clips on a single GPU, a capability that serves as the foundation for practical short-form video creation. Beyond the efficiency and duration constraints addressed above, maintaining internal consistency in short-form videos is also critical. Park et al. [
9] proposed the Context-Aware Lip Synchronization (CALS) framework, which leverages speech context and masked learning to generate accurate, spatiotemporally aligned lip movements from audio signals. This framework effectively enhances contextual relevance and temporal consistency, thereby elevating the perceptual realism of AIGC-driven video characters.
In the context of narrative control and variation within the constrained duration of short-form videos, traditional single-prompt paradigms often struggle to deliver dynamic storylines. So, Hong et al. [
10] proposed the DirecT2V approach, which leverages a Large Language Model (LLM) as a frame-level director to decompose abstract user prompts into logically coherent sequences of frame-level descriptions. This methodological design enables zero-shot generation of short-form narratives featuring plot progression, character entrances and exits, and dynamic scene transitions. Notably, this work represents a pivotal breakthrough in controllable, story-driven AIGC short-form video generation. However, DirecT2V exhibits a high dependence on the prompt decomposition capability of upstream large language models (LLMs). Moreover, it lacks explicit constraint mechanisms throughout the generation process, leading to inadequate control over fine-grained elements (e.g., motion trajectories and scene transitions). Collectively, these methods lack a hierarchical control architecture that is capable of guiding the entire end-to-end generation workflow.
2.3. Technical Paradigms of AIGC-Driven Video Generation
AIGC-driven technical paradigms enable the rapid generation of customized short video content tailored to specific target audiences, while significantly reducing the time and computational resources traditionally required for content production. Among existing solutions, foundational models such as Midjourney and Stable Diffusion have demonstrated exceptional efficacy in text-to-image (T2I) tasks, which serve as a core component of AIGC-driven video generation pipelines.
As recently highlighted in related Electronics studies, the ongoing evolution of latent diffusion models (LDMs) provides a robust technical foundation for AIGC applications, particularly in maintaining high stylistic and visual consistency during complex generative processes [
11]. Specifically, Midjourney is constructed based on this advanced LDM architecture. It employs a bimodal encoder pre-trained on large-scale image-text paired datasets to map natural language prompts into high-dimensional semantic vectors, which in turn drive an iterative denoising process within the low-dimensional latent space. The generative pipeline initiates with a random noise tensor and leverages a U-Net-structured denoising network to progressively reconstruct image content through multi-step inference. The entire generative process is strictly conditioned by text embeddings, ensuring high-fidelity alignment between the generated content and the semantic intent of the input prompts. Furthermore, Midjourney supports the injection of auxiliary conditional information via uploaded reference images, adopting implicit control mechanisms analogous to ControlNet to achieve fine-grained regulation of visual composition, color schemes, and stylistic attributes. The final output is synthesized through a dedicated decoder and post-processing modules, yielding high-resolution and artistically expressive visual content. Additionally, based on the uploaded reference images, Midjourney provides adjustable parameters to maintain stylistic consistency and character continuity throughout the generated content.
Stable Diffusion operates through a process in which Gaussian noise is progressively introduced to degrade training data, after which this noise-infused process is reversed to recover the original data. Once trained, the model acquires the ability to synthesize novel data from random inputs, thereby achieving a form of algorithmic innovation. A core functionality of Stable Diffusion lies in its capacity to generate images that align with natural language descriptions, though its outputs exhibit considerable variability, a characteristic stemming from the stochastic nature inherent to diffusion models. Subsequently, ControlNet [
12], a plugin developed for Stable Diffusion, addresses the need for enhanced precision in guiding image generation. Its key capability lies in exerting finer control over the generative process, enabling more accurate alignment of outcomes.
Furthermore, Runway and Pika, as representative tools supporting text-to-video (T2V) and image-to-video (I2V) generation tasks, have achieved significant functional breakthroughs in recent years that redefine the paradigm of visual content creation. Leveraging advanced generative adversarial networks and diffusion-based architectures optimized for temporal coherence, these platforms empower creators to convert abstract text descriptions or static reference images into high-fidelity, smooth video clips within mere minutes. Beyond the core task of video synthesis, they integrate a suite of refined features including dynamic motion interpolation, adaptive detail enhancement for textures and lighting, and real-time preview capabilities that allow for iterative adjustments during the generation process.
To validate the applicability of such T2V/I2V tools in AIGC-driven short video production, we first utilized Midjourney and Stable Diffusion to generate images under identical text prompt conditions, with the results illustrated in
Figure 1. It should be noted that it reveals subtle inconsistency in stylistic details among the generated images, which may originate from the stochastic nature of diffusion models. Subsequently, we took these Midjourney-generated images as inputs and the same text prompts to conduct separate tests on the I2V capabilities of Runway and Pika; selected frames from the resulting video sequences are presented in
Figure 2. It can be observed that the video frames generated by both tools have achieved temporal continuity and maintained the core semantic features of the input images. First, the scene transition regions retain recognizable visual information of adjacent frames, without complete loss of content details during the interpolation process. Second, the motion trajectories of key objects can be matched with the semantic intent implied by the text prompts, realizing the mapping from static image features to dynamic motion states. In addition, the generated video content presents a clear division between local segments, which lays a foundation for the construction of global logical coherence. Specifically, existing schemes often rely on frame interpolation, a fundamental video processing technology that generates intermediate frames between consecutive original frames.
Although AIGC-driven technical paradigms have indeed revolutionized content creation by accelerating production workflows, their practical application in short video production remains constrained. When adjacent frames exhibit significant content differences or belong to distinct scenes, these approaches often result in blurred frames and scene discontinuity. Additionally, although multimodal solutions support text-to-video generation, they are hindered by the lack of fine-grained control mechanisms over motion trajectories and scene transitions, which makes it challenging to generate content-rich short videos. To address the aforementioned issues, in this paper, we propose a controllable multimodal fusion-based generation architecture. This architecture employs a three-tier constraint mechanism to achieve precise parsing and mapping of user intent, utilizes hierarchical multimodal attention fusion to enhance cross-modal semantic alignment and temporal consistency, and introduces a scene graph-driven multi-segment collaborative modeling mechanism to ensure visual and logical coherence from local to global levels, thereby improving generation quality while significantly enhancing controllability.
4. Experiments
4.1. AIGC-Driven Short Video
To demonstrate the effectiveness of the proposed framework, a complete short video generation pipeline is designed to specifically address three core challenges in video synthesis: scene inconsistency, unnatural motion, and blurred transitions. The experimental objective is to generate a coherent short video depicting a sequential scenario: a puppy running across a grassy field, followed by chasing a frisbee on a street. An overview of the multimodal inputs and hierarchical constraints employed in this experiment is presented in
Figure 11.
The three-tier constraint mechanism processes text descriptions, reference images, and audio inputs. The Intent Parsing Layer constructs a global scene graph that encodes logical relationships across segments, such as the puppy’s movement from the field to the street. It also extracts local scene constraints from reference images, including details like grass color and street lighting, while analyzing audio to derive ambiance features such as rhythm and emotional tone. This enables comprehensive multimodal scene understanding.
The Hierarchical Multimodal Attention Fusion mechanism dynamically integrates action logic from text with visual style from reference images. The global scene graph ensures logical transitions between segments, such as the naturalness of the puppy’s movement between the field and street scenes. Local scene constraints maintain temporal and stylistic consistency within each segment, such as lighting continuity. User control further allows fine-grained adjustments, such as modifying camera motion or duration, to align with creative intent. The proposed framework ultimately generates a high-quality, coherent short video with layered content.
Specifically, the framework produces two initial video segments: Segment 1 depicts a puppy running on the grassy field with a duration of 5 s, and Segment 2 shows the puppy chasing a frisbee on the street, also lasting 5 s. Style consistency is maintained in accordance with user inputs, such as grass color and street lighting parameters. Moreover, user-directed transitions (e.g., a 3 s camera pan from the field to the street) are seamlessly integrated between the two segments, which ensures visual continuity and smooth narrative flow. Corresponding key frame sequences for each stage of video generation are presented in
Figure 12.
4.2. Experimental Settings and Runtime Analysis
The experiments were conducted on a unified multimodal short-video dataset constructed from publicly available video-caption data. A total of 1200 clips were selected from the WebVid-10M dataset according to four criteria, namely clear subject motion, the presence of at least one scene transition or viewpoint change, semantically meaningful captions, and valid audio tracks. For each clip, a representative reference image was extracted from the first second of the video, while the original caption and audio track were retained as the text and audio conditions, respectively.
The dataset was divided into 960 training clips, 120 validation clips, and 120 test clips. All videos were normalized to a resolution of 1024 × 576 at 24 fps. To facilitate multi-segment consistency modeling, each training sample was further segmented into 2–3 semantically coherent sub-clips. The same segmentation protocol was adopted for all internal variants in the comparative experiments.
The text encoder was initialized with BERT-base, the image encoder with CLIP ViT-B/32, and the audio encoder with Wav2Vec2-base. The multimodal fusion module used a hidden dimension of 512 with 4 attention heads, while the scene-graph reasoning block was implemented as a 2-layer graph attention network. The training procedure consisted of two stages. In Stage I, the fusion module, the scene-graph module, and the three-tier constraint mechanism were optimized for 20 epochs while the backbone generator remained frozen. In Stage II, the full model was jointly fine-tuned for an additional 10 epochs. AdamW was adopted as the optimizer, with an initial learning rate of
in Stage I and
in Stage II, a batch size of 8, and a cosine decay schedule with 1000 warm-up steps. The overall training objective was defined as
, where
denotes the backbone generation loss,
represents the temporal structural and perceptual coherence objective defined in Equation (2), and
denotes the biomechanical regularization term described in
Section 3.2.3. On the validation set,
was fixed to 0.5, while the internal coefficients of the consistency term were set to
and
. The coefficient
was selected as 0.1 based on validation-set performance.
All experiments were conducted on a workstation equipped with two NVIDIA RTX PRO 6000 GPUs (96 GB each). In the proposed framework, the additional computational overhead mainly arises from the scene-graph reasoning process, the dynamic multimodal weighting mechanism, and the process-level constraint regularization, whereas the generation backbone remains the dominant component of the overall computational cost. Under this hardware setting, the framework was able to complete multi-segment short-video generation with acceptable runtime and memory consumption, which supports its applicability to offline short-video content creation.
As summarized in
Table 1, the additional controllable modules introduce a moderate increase in inference time and memory usage, while the overall computational burden remains dominated by the backbone generator. From a computational complexity perspective, the additional cost introduced by the proposed control modules is additive relative to the backbone generator. Let
M denote the number of multimodal tokens,
|V| the number of graph nodes,
|E| the number of graph edges,
H the number of attention heads, and
d the hidden dimension. The hierarchical multimodal attention module introduces an additional interaction cost on the order of
O (
), while the two-layer graph attention block contributes approximately
O (|E|Hd + |V|). The process-level biomechanical regularization term is computed on sparse pose trajectories and therefore scales approximately linearly with the number of tracked joints and generated frames. Under a fixed token budget and sparse graph structure, the overall additional overhead scales approximately linearly with the number of generated segments, whereas the generation backbone remains the dominant component of the end-to-end computational burden.
4.3. Performance Comparison
To provide an external reference for performance comparison, the proposed framework was compared with two representative commercial tools, Pika 1.0 and Runway Gen-2, as well as the academic method TRIP [
7]. Since the commercial tools are closed-source and do not provide standardized academic benchmark results, their metrics were estimated from a customized set of 100 video clips collected from their official public demonstrations under prompts and reference conditions aligned as closely as possible with those used in this study. For the TRIP method, the reported metrics on the WebVid-10M dataset were adopted as a reference. Therefore, these results are intended as an external reference comparison, since the compared methods were not evaluated under fully identical settings. Unless otherwise noted, all self-computed metrics reported in
Table 2 were evaluated on videos normalized to a resolution of 1024 × 576 at 24 fps.
The evaluation was conducted from five aspects to characterize the overall generation quality. Inter-frame structural consistency was measured using 4-frame SSIM, which computes the mean Structural Similarity Index over a sliding window of four consecutive frames [
18,
19,
20]. Motion plausibility was evaluated by Kinematic Error, defined as the average angular deviation with respect to biomechanical constraints derived from the Human3.6M dataset. Style and semantic consistency were assessed using Fréchet Inception Distance (FID) and Costume
ΔE in the CIELAB color space, respectively. In addition, long-term temporal stability was evaluated using the F-Consistency metric adopted from [
7], which measures temporal coherence across frames. To better evaluate identity consistency and cross-segment coherence, Costume
ΔE and F-Consistency were further adopted as dedicated quantitative indicators.
As presented in
Table 2 (where ↑ denotes higher values and ↓ denotes lower values), the proposed framework exhibits competitive performance across multiple metrics related to structural consistency, motion plausibility, style preservation, and temporal stability. Given that the compared methods were not all evaluated under fully identical experimental conditions, these results are interpreted as indicative rather than conclusive. In terms of structural quality, the proposed framework achieved a 4-frame SSIM of 0.92, which is higher than that of Pika (0.73) and Runway (0.68) under the present reference setting. The higher SSIM value is consistent with the observed reduction in motion blur relative to these commercial baselines. Furthermore, the incorporation of biomechanical constraints within the Process Constraint Layer reduces the Kinematic Error to 0.18 rad, indicating improved motion plausibility and a lower degree of physically implausible distortion relative to the compared methods. Benefiting from the Global Constraint Graph, the proposed framework achieves the lowest Costume
ΔE (4.3) and Style FID (18.2) among the compared methods, suggesting stronger identity preservation and style consistency over time. Finally, the F-Consistency reaches 94.82%, which is close to that of TRIP (95.36%), suggesting that the introduction of multimodal control does not lead to an obvious degradation in temporal stability.
4.4. Ablation Study
To quantify the contribution of each major component, ablation experiments were conducted under the same training and evaluation settings described in
Section 4.2. Specifically, we considered four core variants by removing the scene-graph consistency module, disabling the three-tier constraint mechanism, replacing the dynamic weighting strategy with fixed equal weights, and removing the refinement stage. In addition, to further isolate the role of process-level motion regularization, we evaluated an additional variant without biomechanical constraints.
As shown in
Table 3, removing the scene-graph module leads to the most evident degradation in cross-segment consistency-related metrics, including Costume
ΔE and F-Consistency, which confirms its importance in preserving scene continuity and identity stability across multiple segments. Disabling the three-tier constraint mechanism produces a broader performance decline, especially in motion plausibility and subjective quality, indicating that hierarchical control signals play a key role in aligning generation with user intent. Replacing dynamic weighting with fixed equal weights also degrades overall performance, suggesting that adaptive modality balancing is beneficial for multimodal coordination under varying scene conditions. In contrast, removing the refinement stage causes a relatively smaller drop in objective metrics but still leads to a noticeable decline in MOS, implying that the refinement stage mainly contributes to perceptual smoothness and user-perceived quality.
Moreover, the variant without biomechanical constraints exhibits the largest increase in Kinematic Error, which further verifies the effectiveness of process-level motion regularization in suppressing physically implausible distortions. Overall, the ablation results demonstrate that the proposed framework benefits from the joint contribution of scene-graph reasoning, hierarchical control, adaptive multimodal weighting, and process-level refinement. All reported objective results were averaged over three independent runs, and statistical significance was assessed using paired two-sided tests with .
4.5. Subjective Quality Evaluation
Additionally, we conducted a double-blind subjective evaluation with 30 professional video creators using a 5-point Mean Opinion Score (MOS) scale (ranging from 1 to 5). From the test split, 40 prompts were randomly selected, and each method generated one corresponding short video for each prompt. All videos were anonymized and presented to the evaluators in randomized order under the same viewing conditions. Each video was assessed independently from three aspects, namely visual quality, user control precision, and multimodal fusion effect, using the same 5-point scoring scale. To reduce subjective bias, written scoring guidelines and several warm-up examples were provided before the formal evaluation, and each video was rated by at least 15 participants. The subjective scores were reported as mean values with 95% confidence intervals.
As statistically summarized in
Table 4, the proposed framework achieved consistently strong subjective performance across the three evaluated dimensions, namely Visual Quality, User Control Precision, and Multimodal Fusion Effect. When averaged over these three dimensions, the proposed method attained an overall MOS of 4.34 ± 0.22 (95% CI), outperforming the commercial baselines, Pika (3.02 ± 0.41) and Runway (3.17 ± 0.38), by approximately 43% and 36%, respectively. The narrower confidence interval of the overall score further suggests the relative stability of the proposed framework in subjective perceptual quality.
Specifically, the framework achieved its highest rating in Visual Quality (4.40), which is consistent with the role of the biomechanical constraints in the Process Constraint Layer in reducing visually implausible motion artifacts such as limb distortion and unnatural transitions. It also obtained a high score in User Control Precision (4.38), indicating that the proposed framework can follow relatively complex user instructions more accurately than the commercial baselines.
However, a trade-off was also observed: 23% of evaluators reported initial operational complexity during constraint specification. This suggests that the gain in control precision comes at the cost of a higher cognitive load during interaction. This initial complexity mainly arose from the need to coordinate multiple control dimensions, including identity consistency, scene transition, and motion plausibility, during constraint specification. It also suggests that more intuitive parameter guidance and preset control templates may be needed to reduce the learning burden for users who are less familiar with multi-parameter editing. In terms of Multimodal Fusion Effect, the proposed method achieved a score of 4.25, suggesting that the scene graph-driven consistency modeling and multimodal attention mechanism effectively improve cross-modal coordination and multi-segment coherence.
5. Conclusions
This study presented a controllable multimodal fusion architecture for AIGC-driven short-video generation. To overcome the prevalent issues of intra-clip motion distortion and inter-clip identity drift, we developed a framework that integrates a hierarchical multimodal attention mechanism, scene graph-driven consistency modeling, and a three-tier constraint mechanism with biomechanical regularization. Experimental evaluations confirmed that this approach successfully suppresses physically implausible motion artifacts while maintaining high structural consistency and style preservation across generated segments.
However, the proposed architecture currently exhibits several limitations. The reliance on graph reasoning and constraint optimization inevitably increases the computational cost of the generation pipeline. Additionally, the requirement for users to specify multiple control parameters introduces a noticeable learning curve, which was reflected as initial operational complexity in our subjective evaluations. Furthermore, the framework’s performance has primarily been validated on short-form multi-segment generation, leaving its scalability to extended narrative sequences largely unexplored.
Future research will therefore focus on optimizing the computational efficiency of the graph-based control modules to facilitate faster inference. We also plan to develop more intuitive, template-driven interaction mechanisms to better balance fine-grained controllability with ease of use. Finally, expanding the evaluation to larger-scale unified benchmarks involving long-range narrative video generation will be a critical next step to further validate the framework’s robustness.