AIGC-Driven Short Video Generation Based on the Controllable Multimodal Fusion Architecture

Zhu, Yan; Li, Wei; Fan, Caixia; Yu, Lu

doi:10.3390/electronics15091783

Open AccessArticle

AIGC-Driven Short Video Generation Based on the Controllable Multimodal Fusion Architecture

by

Yan Zhu

,

Wei Li

^*,

Caixia Fan

and

Lu Yu

Department of Information Science, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(9), 1783; https://doi.org/10.3390/electronics15091783

Submission received: 8 March 2026 / Revised: 2 April 2026 / Accepted: 17 April 2026 / Published: 22 April 2026

(This article belongs to the Topic Advanced Development and Applications of AI-Generated Content (AIGC))

Download

Browse Figures

Versions Notes

Abstract

The utilization of Artificial Intelligence-Generated Content (AIGC) has attracted widespread attention in video content creation. To generate high-quality videos, this paper presents a controllable multimodal fusion architecture for AIGC-driven short-video production. This architecture employs hierarchical constraint mechanisms and a multimodal attention fusion mechanism to enhance video content coherence and user controllability. Specifically, a scene coherence scheme is first designed to construct graph-based global and transition-level constraints by integrating text descriptions, reference images, and audio features. By leveraging the extracted style vector data, preliminary video clips are then generated through a combination of the cross-modal fusion unit and the spatio-temporal consistency unit. Finally, a fine-grained adjustment mechanism is implemented to ensure logical consistency and stylistic uniformity in the AIGC-generated videos. Experimental results indicate that the proposed architecture improves generation quality, controllability, and cross-segment coherence under the adopted evaluation settings.

Keywords:

AIGC; short video generation; multimodal fusion; user-intention constraints

1. Introduction

With the rapid advancement of network communication and multimedia technologies, short video services have experienced exponential growth in various fields, such as entertainment, industry, agriculture, sports, healthcare, and transportation. These services can be easily accessed through ubiquitous smart devices like mobile phones and desktop computers, regardless of temporal and spatial constraints. While short videos afford users abundant knowledge across diverse domains and rich entertainment experiences, they entail several critical challenges, namely production cost constraints, quality limitations, and creativity deficits.

The emergence of generative artificial intelligence (AI) technology presents a paradigmatic shift in human–computer interaction, which is enabled by deep learning architectures trained on massive datasets to achieve different content generation of texts, images, and videos. By integrating advanced capabilities in natural language processing, computer vision, and synthesis reasoning, generative AI models show outstanding abilities in outputs with semantic coherence, superior to those of traditional rule-based models. Among these technological advancements, artificial Intelligence generated Content (AIGC), a specialized subset of generative AI, explicitly focuses on autonomous content creation as a critical research frontier.

The AIGC is capable of modelling intricate data distributions and analyzing sophisticated contexts for producing multimodal sequences with awareness, allowing for semantic consistency and content coherence. Built upon advanced neural network architectures that capture long-range dependencies in data and feature a competitive learning process structure, AIGC offers superior output quality and diversity in autonomous multimedia content generation. Moreover, it can automatically adapt to consumer preferences by customizing proficiency levels, a mechanism that is pivotal for ensuring content relevance and effectiveness in seamless integration scenarios. By blending art styles from historical and modern design paradigms, AIGC enhances operational efficiency, creativity, and user-friendliness, consequently overcoming limitations such as high costs, manual labor intensity, stylistic uniformity, and professional barriers.

To meet the ever-increasing demand for short video services characterized by diversified and personalized content, the AIGC has the potential to breakthrough the bottlenecks in short videos creation [1]. Utilizing deep learning algorithms, AIGC can identify latent patterns within many short videos and synthesize innovative concepts through generative models, thereby significantly enhancing creative output. Additionally, the AIGC outperforms traditional production capacities in real-time personalization and cross-cultural adaptability. By analyzing user behavior data, it can generate location-specific video variants at scale, lessening the gap between supply and evolving consumer needs. Despite the notable success of AIGC in high-quality video production, state-of-the-art approaches still suffer from two critical drawbacks: intra-clip temporal incoherence and inter-clip semantic jumps, both of which significantly compromise content efficiency.

On the one hand, Diffusion models in AIGC often fail to capture temporal dynamics effectively, while powerful in spatial generation. It could lead to single clips exhibiting unnatural motion patterns and blurry motion transitions, both of which disrupt the fluidity of actions and reduce visual realism. On the other hand, when users manually splice multiple generated video clips, the absence of explicit scene-evolution constraints brings a cascade of issues stemming from the visual style inconsistency and character traits disconnection. Directly splicing clips often results in narrative rhythm disorder and abrupt audio-visual jumps, seriously damaging content coherence and interfering with the audience’s ability to follow the narrative naturally.

This paper proposes a controllable multimodal short-video generation framework grounded in constrained diffusion models, which features several pivotal contributions.

Multimodal attention fusion: The proposed framework leverages a multimodal attention fusion mechanism that dynamically integrates text, image, and audio features, resulting in the generation of short videos with richly textured and layered content. By adaptively assigning varying degrees to different modalities at each stage of video generation, this mechanism not only significantly improves the semantic coherence of the generated short videos but also enriches the overall user experience, producing outputs that are both engaging and contextually relevant.
Scene graph-driven multi-segment consistency modeling: The scene graph-driven multi-segment consistency modeling approach serves as an emphasis on maintaining the integrity of short-video compositions, where these two types of scene graphs work in tandem. The global scene graph sets the global guidelines, while the local scene graphs handle the segment-specific refinements. By imposing constraints derived from both global and local scene graphs, it effectively guarantees scene consistency and logical coherence during the concatenation of multiple video segments.
Three-tier constraint mechanism: The proposed framework incorporates a three-tier constraint mechanism, which plays a crucial role in ensuring precise and adaptable short-video generation. User intent is thoroughly analyzed and understood for extracting key semantic elements, which are then transformed into explicit control signals. These signals are subsequently employed to achieve fine-grained control, guiding the entire generation process to align seamlessly with user expectations.

This hierarchical controllable approach not only enhances the interpretability of user input but also enables real-time adjustments during the generation, ensuring that the final short video output meets both semantic and aesthetic requirements. The remainder of this paper is organized as follows. The related works of the AIGC-based short video generation are presented in Section 2. Then characteristics of the short video generation are analyzed and the proposed framework is provided in Section 3. Experimental results and subjective quality evaluations are reported in Section 4. The conclusions of this work are summarized finally in Section 5.

2. Related Work

2.1. AIGC Technology

Novel frameworks and cutting-edge methodologies have driven substantial advancements in AIGC, particularly within the specialized domain of foundation model adaptation for multimodal tasks. Xu et al. [2] pioneered a systematic edge-cloud collaborative AIGC framework that delineates five core characteristics, with its architecture incorporating federated learning and differential privacy to address latency and security challenges, while proposing a holistic lifecycle model spanning data collection, pre-training, fine-tuning, and inference. Feng et al. [3] put forward an edge-user collaborative inference framework (EUCI) for diffusion models to tackle the challenges of edge deployment, and this framework attains a 42% reduction in latency and 58% energy savings through model partitioning and dynamic task offloading. Related to cross-modal generation, the establishment of a comprehensive architecture for text-to-image generation in [4], which seamlessly reconciles the strengths of GANs, autoregressive Transformers, and diffusion models, marks a pivotal advancement in cross-modal generation. At its core, this framework prioritizes the dominance of large multimodal encoders such as CLIP over incremental architectural refinements, a claim validated by robust zero-shot performance on benchmarks like LAION-5B and MS-COCO. For semantic segmentation enhancement, Bevacqua et al. [5] enhanced semantic segmentation through ControlNet-based synthetic image generation, an approach that leverages multimodal control with depth maps, Canny edges, and segmentation maps to achieve a 3.17% increase in mean intersection over union (mIoU) on the Cityscapes dataset. Concurrently, Yin et al. [6] conducted a systematic analysis of the implications of AIGC in media production and proposed targeted strategies that have demonstrated tangible effects, such as a 23% reduction in non-compliant content achieved by mitigating risks like algorithmic bias. AIGC is currently in a stage of rapid development, with its models continuously enhancing performance to tackle complex problems and generate more creative content. Further progress is also needed to understand human emotions and contextual nuances better to enable broader adoption across more application scenarios.

2.2. AIGC-Driven Video Generation

Driven by advances in deep learning, multimodal understanding, and computational efficiency, research on AIGC-driven video generation has achieved remarkable progress. A core challenge in this field is to achieve high semantic alignment between input prompts and generated visual content, which also serves as a key approach to breaking through the bottlenecks in quality and efficiency of video generation. Zhang et al. [7] proposed a temporal residual learning framework specifically designed for image-to-video generation tasks. By leveraging an innovative dual-path noise prediction mechanism to preserve image prior information, this framework achieved a frame consistency rate of 95.36% on the WebVid-10M dataset, which represents a 3.77% improvement over Video Composer. Despite the advances in general-purpose video synthesis, the requirements of short-form videos pose distinct challenges, such as fast pacing, strong narrative appeal, and high production quality within a constrained duration.

To address these issues by enhancing computational efficiency, fine-grained control, and contextual coherence, Wang et al. [8] proposed the LinGen framework, which achieves linear computational complexity via a novel Mamba-based architecture. This breakthrough significantly reduces the cost of long-context video generation and enables the production of high-resolution, minute-level AIGC clips on a single GPU, a capability that serves as the foundation for practical short-form video creation. Beyond the efficiency and duration constraints addressed above, maintaining internal consistency in short-form videos is also critical. Park et al. [9] proposed the Context-Aware Lip Synchronization (CALS) framework, which leverages speech context and masked learning to generate accurate, spatiotemporally aligned lip movements from audio signals. This framework effectively enhances contextual relevance and temporal consistency, thereby elevating the perceptual realism of AIGC-driven video characters.

In the context of narrative control and variation within the constrained duration of short-form videos, traditional single-prompt paradigms often struggle to deliver dynamic storylines. So, Hong et al. [10] proposed the DirecT2V approach, which leverages a Large Language Model (LLM) as a frame-level director to decompose abstract user prompts into logically coherent sequences of frame-level descriptions. This methodological design enables zero-shot generation of short-form narratives featuring plot progression, character entrances and exits, and dynamic scene transitions. Notably, this work represents a pivotal breakthrough in controllable, story-driven AIGC short-form video generation. However, DirecT2V exhibits a high dependence on the prompt decomposition capability of upstream large language models (LLMs). Moreover, it lacks explicit constraint mechanisms throughout the generation process, leading to inadequate control over fine-grained elements (e.g., motion trajectories and scene transitions). Collectively, these methods lack a hierarchical control architecture that is capable of guiding the entire end-to-end generation workflow.

2.3. Technical Paradigms of AIGC-Driven Video Generation

AIGC-driven technical paradigms enable the rapid generation of customized short video content tailored to specific target audiences, while significantly reducing the time and computational resources traditionally required for content production. Among existing solutions, foundational models such as Midjourney and Stable Diffusion have demonstrated exceptional efficacy in text-to-image (T2I) tasks, which serve as a core component of AIGC-driven video generation pipelines.

As recently highlighted in related Electronics studies, the ongoing evolution of latent diffusion models (LDMs) provides a robust technical foundation for AIGC applications, particularly in maintaining high stylistic and visual consistency during complex generative processes [11]. Specifically, Midjourney is constructed based on this advanced LDM architecture. It employs a bimodal encoder pre-trained on large-scale image-text paired datasets to map natural language prompts into high-dimensional semantic vectors, which in turn drive an iterative denoising process within the low-dimensional latent space. The generative pipeline initiates with a random noise tensor and leverages a U-Net-structured denoising network to progressively reconstruct image content through multi-step inference. The entire generative process is strictly conditioned by text embeddings, ensuring high-fidelity alignment between the generated content and the semantic intent of the input prompts. Furthermore, Midjourney supports the injection of auxiliary conditional information via uploaded reference images, adopting implicit control mechanisms analogous to ControlNet to achieve fine-grained regulation of visual composition, color schemes, and stylistic attributes. The final output is synthesized through a dedicated decoder and post-processing modules, yielding high-resolution and artistically expressive visual content. Additionally, based on the uploaded reference images, Midjourney provides adjustable parameters to maintain stylistic consistency and character continuity throughout the generated content.

Stable Diffusion operates through a process in which Gaussian noise is progressively introduced to degrade training data, after which this noise-infused process is reversed to recover the original data. Once trained, the model acquires the ability to synthesize novel data from random inputs, thereby achieving a form of algorithmic innovation. A core functionality of Stable Diffusion lies in its capacity to generate images that align with natural language descriptions, though its outputs exhibit considerable variability, a characteristic stemming from the stochastic nature inherent to diffusion models. Subsequently, ControlNet [12], a plugin developed for Stable Diffusion, addresses the need for enhanced precision in guiding image generation. Its key capability lies in exerting finer control over the generative process, enabling more accurate alignment of outcomes.

Furthermore, Runway and Pika, as representative tools supporting text-to-video (T2V) and image-to-video (I2V) generation tasks, have achieved significant functional breakthroughs in recent years that redefine the paradigm of visual content creation. Leveraging advanced generative adversarial networks and diffusion-based architectures optimized for temporal coherence, these platforms empower creators to convert abstract text descriptions or static reference images into high-fidelity, smooth video clips within mere minutes. Beyond the core task of video synthesis, they integrate a suite of refined features including dynamic motion interpolation, adaptive detail enhancement for textures and lighting, and real-time preview capabilities that allow for iterative adjustments during the generation process.

To validate the applicability of such T2V/I2V tools in AIGC-driven short video production, we first utilized Midjourney and Stable Diffusion to generate images under identical text prompt conditions, with the results illustrated in Figure 1. It should be noted that it reveals subtle inconsistency in stylistic details among the generated images, which may originate from the stochastic nature of diffusion models. Subsequently, we took these Midjourney-generated images as inputs and the same text prompts to conduct separate tests on the I2V capabilities of Runway and Pika; selected frames from the resulting video sequences are presented in Figure 2. It can be observed that the video frames generated by both tools have achieved temporal continuity and maintained the core semantic features of the input images. First, the scene transition regions retain recognizable visual information of adjacent frames, without complete loss of content details during the interpolation process. Second, the motion trajectories of key objects can be matched with the semantic intent implied by the text prompts, realizing the mapping from static image features to dynamic motion states. In addition, the generated video content presents a clear division between local segments, which lays a foundation for the construction of global logical coherence. Specifically, existing schemes often rely on frame interpolation, a fundamental video processing technology that generates intermediate frames between consecutive original frames.

Although AIGC-driven technical paradigms have indeed revolutionized content creation by accelerating production workflows, their practical application in short video production remains constrained. When adjacent frames exhibit significant content differences or belong to distinct scenes, these approaches often result in blurred frames and scene discontinuity. Additionally, although multimodal solutions support text-to-video generation, they are hindered by the lack of fine-grained control mechanisms over motion trajectories and scene transitions, which makes it challenging to generate content-rich short videos. To address the aforementioned issues, in this paper, we propose a controllable multimodal fusion-based generation architecture. This architecture employs a three-tier constraint mechanism to achieve precise parsing and mapping of user intent, utilizes hierarchical multimodal attention fusion to enhance cross-modal semantic alignment and temporal consistency, and introduces a scene graph-driven multi-segment collaborative modeling mechanism to ensure visual and logical coherence from local to global levels, thereby improving generation quality while significantly enhancing controllability.

3. Method

3.1. Characteristics of Generated Short Videos

AIGC-driven short videos, as a form of digital content, are characterized by brief duration, lightweight production, fragmented consumption, and high transmissibility. Through intuitive visual presentation, concise narrative structure, and diverse format design, such videos efficiently capture user attention, which aligns well with the modern demand for rapid information acquisition amid fast-paced lifestyles. Technically, AIGC empowers automatic visual creation, scene matching, and dynamic effect generation for short videos, thereby expanding creative boundaries and significantly improving content production efficiency. Nevertheless, the AIGC-driven generation of short videos still presents several issues that demand further refinement and optimization of the generation framework. It should be noted that the generative models utilized for the case studies in this section, including the Midjourney model (Midjourney, Inc., San Francisco, CA, USA) for initial image generation, as well as the Runway Gen-2 (Runway AI, Inc., New York, NY, USA) and Pika (Pika Labs, Inc., Palo Alto, CA, USA) models for subsequent video generation, were all accessed and operated through a unified third-party web-based interface (https://mid.mjdraw.cn).

3.1.1. Motions and Transitions

The motions and transitions of the generated video are characterized by the continuity of motion sequences and the smoothness of frame transitions, which are core indicators to evaluate the quality of image-to-video generation results. Specifically, high-quality generated videos should maintain consistent motion logic of objects and natural transition effects between adjacent frames, without abrupt changes or blurred artifacts. However, current AIGC-driven short video approaches face significant challenges in temporal modeling when handling image-to-video tasks. While diffusion models excel at generating high-quality single frames, they struggle to maintain coherent motion sequences over time, a difficulty that results in unnatural movements and blurred transitions.

Generated characters often exhibit abrupt limb motions and discontinuous gait patterns during walking animations, a phenomenon that severely compromises visual fluidity. To perform a detailed case analysis of these typical failures, we generated representative examples using the Runway Gen-2 model (via its official web interface, May 2024). The experiments employed an “image-to-video” mode, with two distinct cases using high-quality static images generated by Midjourney as input. These cases were paired with the text prompts “a man picks up a piece of paper” and “a person climbs stairs,” respectively. For a rigorous evaluation, we employed the consistent protocol for the two generated cases wherein each video was processed to extract frames at a frame rate of 10 frames per second. Subsequently, two independent annotators conducted meticulous manual evaluations of every frame in accordance with strict pre-defined anatomical plausibility criteria, and the inter-annotator agreement measured via Cohen’s Kappa reached 0.85, which indicates high consistency. Detailed analysis of the two specific case studies is presented below.

One representative case study, which involves a paper-grasping action, is illustrated in Figure 3. For this generated video clip that has a duration of 4 s and a total of 40 frames, the evaluation revealed that approximately 23% of the frames (9 out of 40) exhibited severe finger distortions, including the absence of critical phalanges such as the lack of distal joints in the index finger. Additionally, motion blur artifacts were observed during the phases of hand-paper contact, during which finger trajectories were identified to deviate from established biomechanical norms.

Similarly, Figure 4 depicts another case study focusing on stair-climbing motion, to which the same frame extraction and evaluation protocol was applied for a clip of identical length. In this case, anatomical incoherence, such as abrupt calf truncation, was identified in approximately 18% of the frames (7 out of 40). The knee flexion angles in these frames were measured and found to deviate by an average of 34° ± 5° from the biomechanical benchmarks established in the Human3.6M dataset, a finding that indicates a clear violation of physical constraints. Furthermore, the generated stepping sequence lacked kinematic continuity, which resulted in twisted leg movements.

The case-based analysis reveals fundamental limitations in existing architectures: first, the weak propagation of inter-frame motion features during latent space denoising; second, the insufficient preservation of high-frequency kinematic details during dynamic compression.

3.1.2. Style Consistency and Character Continuity

Style consistency and character continuity are not only pivotal prerequisites for ensuring the visual and narrative coherence of short video content but also recognized as key determinants of quality in AIGC-driven short video generation. Style consistency refers to the uniformity of visual attributes across different video clips, including lighting and spatial layout. Character continuity denotes the stable maintenance of character identity features throughout narrative progression, such as facial characteristics and clothing attributes. With the rapid development of AIGC technologies, short video generation has become increasingly automated and efficient, yet the realization of coherent and immersive viewing experiences relies heavily on the effective maintenance of style consistency and character continuity, thereby rendering them indispensable research focuses in the field.

Style consistency, particularly in scene transitions, constitutes a prominent challenge in AIGC-driven short video generation. This issue primarily stems from the lack of cross-clip scene evolution constraints in mainstream AIGC video generation frameworks. Without such constraints, the generated video clips often exhibit abrupt visual transitions between scenes, including sudden shifts in lighting intensity and color temperature, mismatched spatial layouts, and disjointed color tones. These abrupt changes directly violate the principle of visual continuity, disrupt the viewing experience, and undermine the overall coherence of the video narrative. Unlike traditional video production, where style consistency is manually controlled through unified shooting and post-editing standards, AIGC-driven generation relies on algorithmic modeling of visual features, and the absence of effective cross-clip constraint mechanisms makes style inconsistency a prevalent problem. To investigate the aforementioned visual discontinuity in cross-clip scene consistency of AIGC-generated videos, a test protocol for dynamic scene changes was designed. First, Midjourney was used to generate a high-quality static café image as the unified scene reference, with the prompt: “Sunlight streams through the left window into a quiet modern café interior, with a steaming cup of coffee placed on the table.” Based on this reference, two temporally correlated video clips (Clip A and Clip B) were generated via Runway Gen-2. The prompt for Clip A was: “Café interior, sunlight, camera slowly pans toward the coffee on the table, with steam rising gently”; Clip B’s prompt was: “Same café scene, the sky outside darkens, indoor lights turn on automatically, and coffee steam continues to drift.” As shown in Figure 5, test results indicated that even with the same reference image, logically consistent scene descriptions and post-processing steps such as cross-fading to smooth transitions, the generated video sequence still exhibited severe visual discontinuities. Comparative analysis of three key frames (a, b, c) identified two major violations of visual continuity. In terms of spatial structure, a comparison between Frame a and Frame b showed that furniture items such as sofas that were absent in the reference image suddenly appeared at the original wall position, causing abrupt changes in the indoor spatial layout. In terms of scene content, a comparison between Frame a and Frame c revealed that the clear street view outside the window in Frame a was abnormally transformed into an indoor space in Frame c, thus breaking the fundamental logic of the scene. These two types of abrupt transitions highlight the inherent deficiency of mainstream AIGC short video generation frameworks in maintaining cross-clip visual consistency, which validates the core challenge addressed in this study.

Character continuity is another critical challenge plaguing existing AIGC-driven short video generation frameworks. The absence of robust semantic constraints on character features in current methodologies gives rise to cross-clip identity drift, thereby leading to character discontinuity across distinct scenes. While mainstream generation technical paradigms now enable character face stabilization by leveraging a single reference image, this functionality is generally confined to the specific scene depicted in that reference. It thus cannot support the cross-clip scene transitions and narrative progression essential for cohesive video storytelling. To construct time-spanning narratives, creators frequently deploy distinct reference images for different video clips. Without a dedicated mechanism to model semantic identity across multiple reference images, this approach is effective in enhancing scene diversity yet induces uncontrolled character identity drift. Such drift is manifested as inconsistent facial features and clothing attributes during background transitions.

To verify this issue, we leveraged the character reference (cref) functions of Midjourney to generate two images. The generation process was based on a single character reference image paired with two different scene reference images. As shown in Figure 6, the results demonstrate that even with consistent character settings, the generated frames exhibit noticeable facial inconsistencies. This frame-level inconsistency propagates to video clips, resulting in discontinuous character identity transitions that directly reflect the inherent flaw of current models in cross-scene identity preservation. Such identity discontinuity was further quantified through objective metrics. The Fréchet Inception Distance (FID) [13] for facial features increased by 18% relative to the baseline, and the ΔCIELAB [14] color difference in clothing regions exceeded a threshold of 15 units. These discrepancies substantially undermine the visual coherence of generated video sequences. Based on the two aforementioned images, we generated and stitched video clips, then conducted a subjective evaluation of the synthesized video. The results reveal that 63% of the stitched videos were judged by most observers to exhibit significant character discontinuities or identity incoherence. This finding confirms the inadequacy of current methods in maintaining cross-clip character consistency.

The above challenges stem from two fundamental limitations. (1) The propagation of temporal features is flawed, meaning that the resulting feature mismatches cannot be resolved solely through post-editing. (2) Current tools, such as Pika and Runway, suffer from insufficient global scene modeling capabilities because they adopt a clip-independent processing approach. The lack of a hierarchical scene graph for encoding spatiotemporal dependencies leads to semantic inconsistencies in scene elements across video clips. These inconsistencies affect critical aspects, including lighting conditions, object placement, and spatial structure. This highlights the critical need to combine global scene constraints with adaptive temporal consistency modeling.

3.2. Proposed Controllable Multimodal Fusion Architecture

To address these issues systematically, we propose a controllable multimodal fusion-based generation architecture. This architecture implements end-to-end control from user intent parsing to video generation through a three-tier constraint mechanism; dynamically coordinates textual, visual, and audio features using hierarchical multimodal attention fusion to enhance semantic alignment; and adopts a scene graph-driven multi-clip collaborative modeling mechanism to ensure visual and logical coherence at both local and global levels. Consequently, this approach not only improves generation quality but also significantly enhances the controllability of the short-form video generation process.

3.2.1. Hierarchical Multimodal Attention Fusion

Recent advances published in Electronics demonstrate that advanced multimodal fusion algorithms can effectively overcome the limitations of single-visual information in video understanding [15], and employing stepwise or hierarchical encoding strategies significantly enhances cross-modal semantic alignment while managing computational complexity [16]. Building upon these insights, and to address the spatial-temporal misalignment issues commonly observed in existing methods when processing multimodal features independently—such as the desynchronization between action sequences and musical rhythms or the deviation of visual styles from textual descriptions—this paper proposes a Hierarchical Multimodal Attention Fusion Mechanism. The complete architectural workflow, illustrated in Figure 7, systematically processes multimodal inputs through four key stages: Feature Projection for alignment, Across Model Attention for interaction modeling, Dynamic Weight Allocation for adaptive balancing, and Constraint Injection for final control signal synthesis.

First, to reduce the semantic gap between heterogeneous data, the Feature Projection layer is utilized. As shown in Figure 7, it maps input features—including 768D Text Features (Action Logic Vector), 512D Image Features (CLS Token), and 768D Audio Features (Spectral Features)—into a unified 512D Hidden Space. This mapping process resolves dimensional inconsistencies that may hinder cross-modal interaction. On this basis, the Across Model Attention module (utilizing a 4-head Attention mechanism) is introduced to model deep inter-modal interactions. Taking the learned representation of the generative task as the query vector, this module computes an interaction score matrix among the three modalities, which explicitly captures cross-modal correlations. These correlations include the causal link between action descriptions and scene elements, as well as the emotional correspondence between musical rhythm and visual style, thereby establishing a solid semantic foundation for subsequent fusion control.

Subsequent to feature interaction, the Dynamic Weight Allocation module serves as the core control component, which adaptively adjusts the contribution weight of each modality according to the type of video segment that is being generated. Through dynamic weight allocation, the model flexibly adjusts the contribution of each modality based on task requirements specific to different generation contexts. This weight

W_{i}

is designed as:

W_{i} = \frac{\exp (A t t e n t i o n (Q, K_{i}))}{\sum_{j} \exp (A t t e n t i o n (Q, K_{j}))}

(1)

where Q is the query vector representing the contextual information of the current generation task. K_i is the key vector of the i-th modality, representing its feature representation. Attention (Q, K_i) is the attention score, measuring the relevance between the query vector and the key vector. The normalized modality weights are then used to construct the final conditioning representation for multimodal generation. Specifically, let

f_{t}

,

f_{i}

, and

f_{a}

denote the projected text, image, and audio features, respectively. The fused representation is computed as

f = W_{t} f_{t} + W_{i} f_{i} + W_{a} f_{a}

, where

W_{t} + W_{i} + W_{a} = 1

. During training, the weight allocation module is optimized jointly with the multimodal fusion layers, and the same weighting mechanism is retained during inference unless the user explicitly overrides the default allocation strategy.

To validate the effectiveness of the mechanism, we conducted a controlled experiment using the “Puppy Chasing Frisbee” scenario as a case study. Figure 8 visually presents the experimental setup and results. To ensure rigorous comparison, the experiment first established a consistent set of Parsed Features as the Input Conditions: Text Actions were identified as [‘sprint’, ‘leap’, ‘mid-air twist’]; Image Visuals included [‘fluffy white puppy’, ‘vibrant green grass with morning dew’, ‘sunlit asphalt road’]; and Audio Sounds comprised [‘rhythmic paw taps’, ‘high-pitched excited yips’, ‘whirring frisbee’]. Based on these fixed inputs, two distinct Fusion Configurations were tested. In the Text-Dominant Fusion configuration (weights: Text = 0.5, Image = 0.3, Audio = 0.2), the generated content prioritizes the accurate expression of dynamic actions, resulting in the “Dynamic Action Sequence” description. In contrast, under the Image-Dominant Fusion configuration (weights: Text = 0.2, Image = 0.5, Audio = 0.2), the system shifts focus to emphasize visual composition details, producing output characterized by a “soft golden halo” and “cerulean hue”.

Finally, consistent with the specific parameters in Figure 7, the system adopts differentiated weight allocation strategies for different segments. Specifically, in Action Segments, the weights are distributed as Text: 0.6, Image: 0.2, Audio: 0.2 to ensure kinematic logic. For Scene Transitions, the focus shifts to visual continuity with weights adjusted to Text: 0.1, Image: 0.7, Audio: 0.2. In Climax Segments, the audio prominence increases with weights set to Text: 0.3, Image: 0.3, Audio: 0.4. Through Constraint Injection, these weighted fused features are synthesized into a final Constraint signal (manifested as a 128D Control Vector), which is then injected into ControlNet. This injection enables the application of fine-grained multimodal constraints during the generation process, ensuring biomechanical plausibility, spatial consistency, or audio-visual synchronization under different dominant modalities.

3.2.2. Scene Graph-Driven Multi-Segment Consistency Modeling

Aiming at the problems in multi-segment splicing, namely subject identity drift manifested by a significant increase in facial FID scores and scene element misalignment characterized by an Intersection over Union below 0.4, this paper proposes a scene graph-driven multi-segment consistency modeling framework. The detailed structure is illustrated in Figure 9.

Input Processing and Graph Construction: The framework takes Facial Feature Encoding and Scene Elements, obtained by converting multimodal inputs into structured semantic representations, as dual inputs. These heterogeneous features are distributed into parallel processing paths through the Input Processing and Distribution Module. Specifically, for global consistency, facial features are first passed through the Identity-Preserving Module (e.g., IP-Adapter) before entering the Global Constraint Graph. This graph constructs a topology composed of Subject ID Nodes and Key Environment Nodes to generate Long-term Invariant Templates. These templates bypass intermediate reasoning and are directly injected into the Video Generation Model as rigid anchors, ensuring consistent subject identity and key environmental elements across all segments. In parallel, the Transition Constraint Graph focuses on dynamic evolution by explicitly modeling the temporal progression between Action Node t and Action Node t + 1. By internally deriving transition logic, it forms a spatio-temporal Constraint Model that governs the immediate context of character motion and camera shift within transition regions.

To implement these two graph types, the graph reasoning module is realized as a two-layer Graph Attention Network (GAT). Subject nodes, environment nodes, and action nodes are represented as typed vertices, each initialized with a 512-dimensional feature vector projected from the corresponding multimodal encoder output. Edges are associated with relation-type embeddings encoding identity consistency, spatial association, temporal transition, and action dependency. The first GAT layer employs four attention heads with 128 hidden units per head, followed by feature concatenation, layer normalization, ReLU activation, and dropout (0.1). The second layer performs single-head aggregation to refine the propagated node representations, while residual connections are adopted to stabilize message passing across varying segment lengths. Although the Global Constraint Graph and the Transition Constraint Graph share the same architecture, they maintain separate parameters and are trained end-to-end with the fusion and generation components.

In both graphs, the nodes represent subject entities, scene attributes, action states, and transition cues, while the edges capture the corresponding structural and temporal dependencies. After graph propagation, the node features are mean-pooled to form a 512-dimensional guidance vector for each graph. These graph-level priors are then passed to the Hierarchical Constraints Aggregation block and the Multimodal Attention Fusion Module, where they provide structural guidance for downstream control.

Based on these graph-level priors, the Hierarchical Constraints Aggregation block integrates the encoded multimodal inputs and generates Prior Information (Guidance) for the Multimodal Attention Fusion Module. Through this mechanism, the model adaptively regulates modality importance in accordance with scene continuity, character consistency, and transition coherence, thereby preserving both global consistency and local transition smoothness during multi-segment video generation. The fused control signals then guide the generation process by combining the rigid Long-term Invariant Templates from the Global Constraint Graph with adaptive spatiotemporal priors from the Transition Constraint Graph. Finally, the entire process is refined through a Temporal Consistency Optimization feedback loop to ensure high-fidelity output. In this way, the proposed multi-segment consistency modeling framework captures logical dependencies across clips through the global scene graph, while local scene constraints (e.g., style vectors and temporal constraints) preserve the stylistic and temporal coherence of each segment [17]. Inter-frame consistency is further enhanced by combining Structural Similarity (SSIM) to measure global consistency with Perceptual Loss to capture fine-grained detail differences, thereby improving the coherence of the generated video [18,19,20].

To mathematically enforce the spatiotemporal coherence reasoned by the scene graph and guarantee high-fidelity generation, we introduce a specific consistency optimization objective. The loss function is defined as:

L_{c o n s i s t e n c y} = λ_{1} \cdot S S I M (I_{t}, I_{t + 1}) + λ_{2} \cdot L_{p e r c e p t u a l} (I_{t}, I_{t + 1})

(2)

where

S S I M (I_{t}, I_{t + 1})

measures global structural consistency and

L_{p e r c e p t u a l} (I_{t}, I_{t + 1})

captures local detail discrepancies between adjacent frames. In all experiments,

λ_{1}

and

λ_{2}

were selected on the validation set from

{0.3, 0.4, 0.5, 0.6, 0.7}

under the constraint

λ_{1} + λ_{2} = 1

. The final setting was

λ_{1} = 0.6

and

λ_{2} = 0.4

, which achieved the best balance between temporal smoothness and detail preservation.

3.2.3. Three-Tier Constraint Mechanism

This paper proposes a three-tier constraint mechanism, comprising an Intent Parsing Layer, a Process Constraint Layer, and a User Refinement Layer. Adopting a hierarchical and progressive design paradigm, this mechanism enables the transformation of abstract user intent into precise and executable control signals. The comprehensive structural composition of this mechanism is depicted in Figure 10.

The Intent Parsing Layer is designed to convert Multimodal Input into structured semantic representations capable of capturing underlying intent. This layer integrates three core functional modules. The Text Parsing module constructs a Temporal Action Seq. Semantic Graph based on the BERT-NER framework. The Image Parsing module utilizes CLIP to extract multi-dimensional visual features, specifically Scene Elements, Color, and Facial Encoding. Additionally, the Audio Parsing module employs the Wav2Vec2 model to identify Rhythm Type and Emotional Tone. Collectively, these three parts enable comprehensive parsing of multimodal inputs and the construction of unified Structured Semantic Representations, which serve as High-Quality Inputs for the subsequent layer.

The Process Constraint Layer, which builds on the structured semantic representations, employs a Scene Graph-based Multi-segment Consistency Modeling Framework to regulate the video generation process. Specifically, this framework functions as a Prior Information Provider and is functionally divided into two key sub-graphs. The Global Constraint Graph is primarily responsible for establishing long-term invariant constraints, including Subject ID and Scene Tone, to ensure global coherence. Simultaneously, the Transition Constraint Graph focuses on local temporal evolution and performs fine-grained control over short-term dynamic variations, specifically Camera Motion and Brightness Gradient. Within this framework, implicit biomechanical constraints are also integrated to prevent physical anomalies. The logical relationships defined by these graphs generate Prior Information (Guidance) that feeds into the Multimodal Attention Fusion Module. This guidance enables the fusion module to Dynamically Adjust Weights of the multimodal features, ensuring that the fused feature vectors comply with logical constraints while preserving stylistic consistency.

In particular, the Process Constraint Layer is designed to provide process-level regularization for motion plausibility during video generation. In this layer, biomechanical constraints are introduced as soft regularization signals to suppress physically implausible limb deformation and discontinuous pose transitions. Specifically, these constraints operate on three aspects, namely joint-angle range, temporal motion smoothness, and contact consistency between the subject and the surrounding environment. In this way, the Process Constraint Layer complements the semantic constraints derived from the Intent Parsing Layer and the user-oriented adjustments provided by the User Refinement Layer.

To further formalize the role of the Process Constraint Layer, the biomechanical regularization term is defined as

L_{b i o} = α L_{j o i n t} + β L_{s m o o t h} + γ L_{c o n t a c t}

(3)

where

L_{j o i n t}

,

L_{s m o o t h}

, and

L_{c o n t a c t}

denote the joint-range penalty, the temporal smoothness penalty, and the contact-consistency penalty, respectively. Here,

L_{j o i n t}

constrains anatomically unreasonable joint bending,

L_{s m o o t h}

penalizes abrupt frame-to-frame changes in limb orientation and body displacement, and

L_{c o n t a c t}

encourages stable interaction patterns between the moving subject and the local environment. Accordingly, the overall optimization objective can be written as

L = L_{g e n} + λ_{c} L_{c o n s i s t e n c y} + λ_{b} L_{b i o}

(4)

where

L_{g e n}

is the backbone generation loss,

L_{c o n s i s t e n c y}

is the temporal consistency objective defined in Equation (2), and

L_{b i o}

provides process-level motion regularization.

Ultimately, the fused signals undergo Signal Injection via the ControlNet architecture. ControlNet can stabilize compositions, define poses, and outline contours, thereby generating refined images from a single line art sketch. Based on the ControlNet architecture, the proposed framework achieves precise propagation of user constraints, formulated as:

x_{t + 1} = f (x_{t}, c)

(5)

where x_t₊₁ is the generated result at step t + 1. c is the user constraint condition vector. f is the generation function based on diffusion models:

f (x_{t}, c) = x_{t} - η \nabla_{x_{t}} L (x_{t}, c)

(6)

where η is the learning rate, and L(x_t,c) is the loss function.

The User Refinement Layer operates as an External Loop to facilitate real-time optimization. It begins with User Interaction (Feedback) based on the generation results. This feedback is processed by the Interaction Interface Module for Fine-grained Correction. Subsequently, the Real-time Optimization Module performs Backpropagation and Parameter Adjustment. The optimized parameters are then fed back into the signal injection module to continuously refine the alignment between the generated results and user expectations.

4. Experiments

4.1. AIGC-Driven Short Video

To demonstrate the effectiveness of the proposed framework, a complete short video generation pipeline is designed to specifically address three core challenges in video synthesis: scene inconsistency, unnatural motion, and blurred transitions. The experimental objective is to generate a coherent short video depicting a sequential scenario: a puppy running across a grassy field, followed by chasing a frisbee on a street. An overview of the multimodal inputs and hierarchical constraints employed in this experiment is presented in Figure 11.

The three-tier constraint mechanism processes text descriptions, reference images, and audio inputs. The Intent Parsing Layer constructs a global scene graph that encodes logical relationships across segments, such as the puppy’s movement from the field to the street. It also extracts local scene constraints from reference images, including details like grass color and street lighting, while analyzing audio to derive ambiance features such as rhythm and emotional tone. This enables comprehensive multimodal scene understanding.

The Hierarchical Multimodal Attention Fusion mechanism dynamically integrates action logic from text with visual style from reference images. The global scene graph ensures logical transitions between segments, such as the naturalness of the puppy’s movement between the field and street scenes. Local scene constraints maintain temporal and stylistic consistency within each segment, such as lighting continuity. User control further allows fine-grained adjustments, such as modifying camera motion or duration, to align with creative intent. The proposed framework ultimately generates a high-quality, coherent short video with layered content.

Specifically, the framework produces two initial video segments: Segment 1 depicts a puppy running on the grassy field with a duration of 5 s, and Segment 2 shows the puppy chasing a frisbee on the street, also lasting 5 s. Style consistency is maintained in accordance with user inputs, such as grass color and street lighting parameters. Moreover, user-directed transitions (e.g., a 3 s camera pan from the field to the street) are seamlessly integrated between the two segments, which ensures visual continuity and smooth narrative flow. Corresponding key frame sequences for each stage of video generation are presented in Figure 12.

4.2. Experimental Settings and Runtime Analysis

The experiments were conducted on a unified multimodal short-video dataset constructed from publicly available video-caption data. A total of 1200 clips were selected from the WebVid-10M dataset according to four criteria, namely clear subject motion, the presence of at least one scene transition or viewpoint change, semantically meaningful captions, and valid audio tracks. For each clip, a representative reference image was extracted from the first second of the video, while the original caption and audio track were retained as the text and audio conditions, respectively.

The dataset was divided into 960 training clips, 120 validation clips, and 120 test clips. All videos were normalized to a resolution of 1024 × 576 at 24 fps. To facilitate multi-segment consistency modeling, each training sample was further segmented into 2–3 semantically coherent sub-clips. The same segmentation protocol was adopted for all internal variants in the comparative experiments.

The text encoder was initialized with BERT-base, the image encoder with CLIP ViT-B/32, and the audio encoder with Wav2Vec2-base. The multimodal fusion module used a hidden dimension of 512 with 4 attention heads, while the scene-graph reasoning block was implemented as a 2-layer graph attention network. The training procedure consisted of two stages. In Stage I, the fusion module, the scene-graph module, and the three-tier constraint mechanism were optimized for 20 epochs while the backbone generator remained frozen. In Stage II, the full model was jointly fine-tuned for an additional 10 epochs. AdamW was adopted as the optimizer, with an initial learning rate of

1 \times 10^{- 4}

in Stage I and

5 \times 10^{- 5}

in Stage II, a batch size of 8, and a cosine decay schedule with 1000 warm-up steps. The overall training objective was defined as

L = L_{g e n} + λ_{c} L_{c o n s i s t e n c y} + λ_{b} L_{b i o}

, where

L_{g e n}

denotes the backbone generation loss,

L_{c o n s i s t e n c y}

represents the temporal structural and perceptual coherence objective defined in Equation (2), and

L_{b i o}

denotes the biomechanical regularization term described in Section 3.2.3. On the validation set,

λ_{c}

was fixed to 0.5, while the internal coefficients of the consistency term were set to

λ_{1} = 0.6

and

λ_{2} = 0.4

. The coefficient

λ_{b}

was selected as 0.1 based on validation-set performance.

All experiments were conducted on a workstation equipped with two NVIDIA RTX PRO 6000 GPUs (96 GB each). In the proposed framework, the additional computational overhead mainly arises from the scene-graph reasoning process, the dynamic multimodal weighting mechanism, and the process-level constraint regularization, whereas the generation backbone remains the dominant component of the overall computational cost. Under this hardware setting, the framework was able to complete multi-segment short-video generation with acceptable runtime and memory consumption, which supports its applicability to offline short-video content creation.

As summarized in Table 1, the additional controllable modules introduce a moderate increase in inference time and memory usage, while the overall computational burden remains dominated by the backbone generator. From a computational complexity perspective, the additional cost introduced by the proposed control modules is additive relative to the backbone generator. Let M denote the number of multimodal tokens, |V| the number of graph nodes, |E| the number of graph edges, H the number of attention heads, and d the hidden dimension. The hierarchical multimodal attention module introduces an additional interaction cost on the order of O (

M^{2} d

), while the two-layer graph attention block contributes approximately O (|E|Hd + |V|

d^{2}

). The process-level biomechanical regularization term is computed on sparse pose trajectories and therefore scales approximately linearly with the number of tracked joints and generated frames. Under a fixed token budget and sparse graph structure, the overall additional overhead scales approximately linearly with the number of generated segments, whereas the generation backbone remains the dominant component of the end-to-end computational burden.

4.3. Performance Comparison

To provide an external reference for performance comparison, the proposed framework was compared with two representative commercial tools, Pika 1.0 and Runway Gen-2, as well as the academic method TRIP [7]. Since the commercial tools are closed-source and do not provide standardized academic benchmark results, their metrics were estimated from a customized set of 100 video clips collected from their official public demonstrations under prompts and reference conditions aligned as closely as possible with those used in this study. For the TRIP method, the reported metrics on the WebVid-10M dataset were adopted as a reference. Therefore, these results are intended as an external reference comparison, since the compared methods were not evaluated under fully identical settings. Unless otherwise noted, all self-computed metrics reported in Table 2 were evaluated on videos normalized to a resolution of 1024 × 576 at 24 fps.

The evaluation was conducted from five aspects to characterize the overall generation quality. Inter-frame structural consistency was measured using 4-frame SSIM, which computes the mean Structural Similarity Index over a sliding window of four consecutive frames [18,19,20]. Motion plausibility was evaluated by Kinematic Error, defined as the average angular deviation with respect to biomechanical constraints derived from the Human3.6M dataset. Style and semantic consistency were assessed using Fréchet Inception Distance (FID) and Costume ΔE in the CIELAB color space, respectively. In addition, long-term temporal stability was evaluated using the F-Consistency metric adopted from [7], which measures temporal coherence across frames. To better evaluate identity consistency and cross-segment coherence, Costume ΔE and F-Consistency were further adopted as dedicated quantitative indicators.

As presented in Table 2 (where ↑ denotes higher values and ↓ denotes lower values), the proposed framework exhibits competitive performance across multiple metrics related to structural consistency, motion plausibility, style preservation, and temporal stability. Given that the compared methods were not all evaluated under fully identical experimental conditions, these results are interpreted as indicative rather than conclusive. In terms of structural quality, the proposed framework achieved a 4-frame SSIM of 0.92, which is higher than that of Pika (0.73) and Runway (0.68) under the present reference setting. The higher SSIM value is consistent with the observed reduction in motion blur relative to these commercial baselines. Furthermore, the incorporation of biomechanical constraints within the Process Constraint Layer reduces the Kinematic Error to 0.18 rad, indicating improved motion plausibility and a lower degree of physically implausible distortion relative to the compared methods. Benefiting from the Global Constraint Graph, the proposed framework achieves the lowest Costume ΔE (4.3) and Style FID (18.2) among the compared methods, suggesting stronger identity preservation and style consistency over time. Finally, the F-Consistency reaches 94.82%, which is close to that of TRIP (95.36%), suggesting that the introduction of multimodal control does not lead to an obvious degradation in temporal stability.

4.4. Ablation Study

To quantify the contribution of each major component, ablation experiments were conducted under the same training and evaluation settings described in Section 4.2. Specifically, we considered four core variants by removing the scene-graph consistency module, disabling the three-tier constraint mechanism, replacing the dynamic weighting strategy with fixed equal weights, and removing the refinement stage. In addition, to further isolate the role of process-level motion regularization, we evaluated an additional variant without biomechanical constraints.

As shown in Table 3, removing the scene-graph module leads to the most evident degradation in cross-segment consistency-related metrics, including Costume ΔE and F-Consistency, which confirms its importance in preserving scene continuity and identity stability across multiple segments. Disabling the three-tier constraint mechanism produces a broader performance decline, especially in motion plausibility and subjective quality, indicating that hierarchical control signals play a key role in aligning generation with user intent. Replacing dynamic weighting with fixed equal weights also degrades overall performance, suggesting that adaptive modality balancing is beneficial for multimodal coordination under varying scene conditions. In contrast, removing the refinement stage causes a relatively smaller drop in objective metrics but still leads to a noticeable decline in MOS, implying that the refinement stage mainly contributes to perceptual smoothness and user-perceived quality.

Moreover, the variant without biomechanical constraints exhibits the largest increase in Kinematic Error, which further verifies the effectiveness of process-level motion regularization in suppressing physically implausible distortions. Overall, the ablation results demonstrate that the proposed framework benefits from the joint contribution of scene-graph reasoning, hierarchical control, adaptive multimodal weighting, and process-level refinement. All reported objective results were averaged over three independent runs, and statistical significance was assessed using paired two-sided tests with

p < 0.05

.

4.5. Subjective Quality Evaluation

Additionally, we conducted a double-blind subjective evaluation with 30 professional video creators using a 5-point Mean Opinion Score (MOS) scale (ranging from 1 to 5). From the test split, 40 prompts were randomly selected, and each method generated one corresponding short video for each prompt. All videos were anonymized and presented to the evaluators in randomized order under the same viewing conditions. Each video was assessed independently from three aspects, namely visual quality, user control precision, and multimodal fusion effect, using the same 5-point scoring scale. To reduce subjective bias, written scoring guidelines and several warm-up examples were provided before the formal evaluation, and each video was rated by at least 15 participants. The subjective scores were reported as mean values with 95% confidence intervals.

As statistically summarized in Table 4, the proposed framework achieved consistently strong subjective performance across the three evaluated dimensions, namely Visual Quality, User Control Precision, and Multimodal Fusion Effect. When averaged over these three dimensions, the proposed method attained an overall MOS of 4.34 ± 0.22 (95% CI), outperforming the commercial baselines, Pika (3.02 ± 0.41) and Runway (3.17 ± 0.38), by approximately 43% and 36%, respectively. The narrower confidence interval of the overall score further suggests the relative stability of the proposed framework in subjective perceptual quality.

Specifically, the framework achieved its highest rating in Visual Quality (4.40), which is consistent with the role of the biomechanical constraints in the Process Constraint Layer in reducing visually implausible motion artifacts such as limb distortion and unnatural transitions. It also obtained a high score in User Control Precision (4.38), indicating that the proposed framework can follow relatively complex user instructions more accurately than the commercial baselines.

However, a trade-off was also observed: 23% of evaluators reported initial operational complexity during constraint specification. This suggests that the gain in control precision comes at the cost of a higher cognitive load during interaction. This initial complexity mainly arose from the need to coordinate multiple control dimensions, including identity consistency, scene transition, and motion plausibility, during constraint specification. It also suggests that more intuitive parameter guidance and preset control templates may be needed to reduce the learning burden for users who are less familiar with multi-parameter editing. In terms of Multimodal Fusion Effect, the proposed method achieved a score of 4.25, suggesting that the scene graph-driven consistency modeling and multimodal attention mechanism effectively improve cross-modal coordination and multi-segment coherence.

5. Conclusions

This study presented a controllable multimodal fusion architecture for AIGC-driven short-video generation. To overcome the prevalent issues of intra-clip motion distortion and inter-clip identity drift, we developed a framework that integrates a hierarchical multimodal attention mechanism, scene graph-driven consistency modeling, and a three-tier constraint mechanism with biomechanical regularization. Experimental evaluations confirmed that this approach successfully suppresses physically implausible motion artifacts while maintaining high structural consistency and style preservation across generated segments.

However, the proposed architecture currently exhibits several limitations. The reliance on graph reasoning and constraint optimization inevitably increases the computational cost of the generation pipeline. Additionally, the requirement for users to specify multiple control parameters introduces a noticeable learning curve, which was reflected as initial operational complexity in our subjective evaluations. Furthermore, the framework’s performance has primarily been validated on short-form multi-segment generation, leaving its scalability to extended narrative sequences largely unexplored.

Future research will therefore focus on optimizing the computational efficiency of the graph-based control modules to facilitate faster inference. We also plan to develop more intuitive, template-driven interaction mechanisms to better balance fine-grained controllability with ease of use. Finally, expanding the evaluation to larger-scale unified benchmarks involving long-range narrative video generation will be a critical next step to further validate the framework’s robustness.

Author Contributions

Conceptualization, Y.Z. and W.L.; Methodology, Y.Z.; Investigation, Y.Z.; Writing—original draft preparation, Y.Z. and W.L.; Writing—review and editing, C.F. and L.Y.; Funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Basic Research Program of Shaanxi (No. 2024JC-YBQN-0727), and the Research Program of School-enterprise cooperation (No. 441223064).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Q.; Chen, S. New era of short video creation: The empowering value and challenges of AIGC. Film Telev. Prod. 2024, 30, 94–99. [Google Scholar]
Xu, M.; Du, H.; Niyato, D.; Kang, J.; Xiong, Z.; Mao, S.; Han, Z.; Jamalipour, A.; Kim, D.I.; Shen, X.; et al. Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services. IEEE Commun. Surv. Tutor. 2024, 26, 1127–1170. [Google Scholar] [CrossRef]
Feng, W.; Zhang, R.; Zhu, Y.; Wang, C.; Sun, C.; Zhu, X.; Li, X.; Taleb, T. Exploring Collaborative Diffusion Model Inferring for AIGC-Enabled Edge Services. IEEE Trans. Cognit. Commun. Netw. 2025, 11, 946–960. [Google Scholar] [CrossRef]
Bie, F.; Yang, Y.; Zhou, Z.; Ghanem, A.; Zhang, M.; Yao, Z.; Wu, X.; Holmes, C.; Golnari, P.; Clifton, D.A.; et al. RenAIssance: A Survey Into AI Text-to-Image Generation in the Era of Large Model. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2212–2231. [Google Scholar] [CrossRef] [PubMed]
Bevacqua, A.; Singha, T.; Pham, D.-S. Enhancing Semantic Segmentation with Synthetic Image Generation: A Novel Approach Using Stable Diffusion and ControlNet. In Proceedings of the 2024 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Perth, Australia, 2024; IEEE: New York, NY, USA, 2024; pp. 685–692. [Google Scholar] [CrossRef]
Yin, R.; Liu, X. Enabling media production with AIGC and its ethical considerations. In Proceedings of the 2024 IEEE 10th International Conference on Edge Computing and Scalable Cloud (EdgeCom), Shanghai, China, 2024; IEEE: New York, NY, USA, 2024; pp. 100–105. [Google Scholar] [CrossRef]
Zhang, Z.; Long, F.; Pan, Y.; Qiu, Z.; Yao, T.; Cao, Y.; Mei, T. TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2024; IEEE: New York, NY, USA, 2024; pp. 8671–8681. [Google Scholar] [CrossRef]
Wang, H.; Ma, C.-Y.; Liu, Y.-C.; Hou, J.; Xu, T.; Wang, J.; Juefei-Xu, F.; Luo, Y.; Zhang, P.; Hou, T.; et al. LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2025; IEEE: New York, NY, USA, 2025; pp. 2578–2588. [Google Scholar] [CrossRef]
Park, S.J.; Kim, M.; Choi, J.; Ro, Y.M. Exploring Phonetic Context-Aware Lip-Sync for Talking Face Generation. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 2024; IEEE: New York, NY, USA, 2024; pp. 4325–4329. [Google Scholar] [CrossRef]
Hong, S.; Seo, J.; Shin, H.; Hong, S.; Kim, S. DirecT2V: Large language models are frame-level directors for zero-shot text-to-video generation. arXiv 2023, arXiv:2305.14330. [Google Scholar] [CrossRef]
Lu, Y.; Wu, J.; Wang, M.; Fu, J.; Xie, W.; Wang, P.; Zhao, P. Design Transformation Pathways for AI-Generated Images in Chinese Traditional Architecture. Electronics 2025, 14, 282. [Google Scholar] [CrossRef]
Wen, Y. Control application analysis of ControlNet plugin in AI-generated images. Film Telev. Prod. 2024, 30, 57–62. [Google Scholar]
Jayasumana, S.; Ramalingam, S.; Veit, A.; Glasner, D.; Chakrabarti, A.; Kumar, S. Rethinking FID: Towards a Better Evaluation Metric for Image Generation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2024; IEEE: New York, NY, USA, 2024; pp. 9307–9315. [Google Scholar] [CrossRef]
Wyszecki, G.; Stiles, W.S. Color Science: Concepts and Methods, Quantitative Data and Formulae, 2nd ed.; Wiley: New York, NY, USA, 2000. [Google Scholar]
Li, M.; Zhang, H.; Xu, C.; Yan, C.; Liu, H.; Li, X. MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion. Electronics 2022, 11, 2999. [Google Scholar] [CrossRef]
Liu, Z.; Wu, X.; Yu, Y. Multi-Task Video Captioning with a Stepwise Multimodal Encoder. Electronics 2022, 11, 2639. [Google Scholar] [CrossRef]
Zhang, C.; Yu, Z.; Wang, X.; Chen, Z.-J.; Deng, C. Temporal-constrained parallel graph neural networks for recognizing motion patterns and gait phases in class-imbalanced scenarios. Eng. Appl. Artif. Intell. 2025, 143, 110106. [Google Scholar] [CrossRef]
Martini, M. A Simple Relationship Between SSIM and PSNR for DCT-Based Compressed Images and Video: SSIM as Content-Aware PSNR. In Proceedings of the 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP), Poitiers, France, 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Bakurov, I.; Buzzelli, M.; Schettini, R.; Castelli, M.; Vanneschi, L. Structural similarity index (SSIM) revisited: A data-driven approach. Expert Syst. Appl. 2022, 189, 116087. [Google Scholar] [CrossRef]
Wei, P.; Wang, L.; Gan, J.; Shi, X.; Shang, M. Incorporation of Structural Similarity Index and Regularization Term into Neighbor2Neighbor Unsupervised Learning Model for Efficient Ultrasound Image Data Denoising. Appl. Sci. 2024, 14, 7988. [Google Scholar] [CrossRef]

Figure 1. Comparison of Image Generation Results between Midjourney and Stable Diffusion. The text prompt used for generation is: “In a dreamlike forest, a sprite with gradient-colored wings shimmers with warm golden specks, in a fairy-tale style.”.

Figure 2. Comparison of Video Frame Sequences Generated by Runway and Pika. The text prompt used for generation is: “A sprite in the dreamlike forest flutters its iridescent wings, spins through the air, and reaches out to touch the fireflies.”.

Figure 3. Case Study: Analysis of Finger Distortion and Motion Blur in Paper-Grasping Action.

Figure 4. Case Study: Anatomical Incoherence and Physical Constraint Violations in Stair-Climbing.

Figure 5. Case Study on Visual Discontinuity in Scene Transitions (Café Scene). (a) Initial daytime scene with natural sunlight and shadows; (b) Background transitions to dusk, but foreground lighting remains unchanged; (c) Background transitions to night, yet the persistent “daytime” shadows on the table highlight a clear visual discontinuity.

Figure 6. Experiment on Character Identity Drift in Cross-Scene Generation: (a) Unified character reference image; (b,c) Two distinct scene reference images; (d,e) Corresponding generated results. A comparison between (d) and (e) reveals significant discrepancies in facial features despite using the identical character reference (a). This visual inconsistency highlights the limitation of current methods in maintaining character identity across different contexts (i.e., identity drift).

Figure 7. Framework of Hierarchical Multimodal Attention Fusion Mechanism.

Figure 8. Examples of multimodal Fusion Results for the “Puppy Chasing Frisbee” scenario under Text-Dominant (

W_{t x t}

= 0.5) and Image-Dominant (

W_{i m g}

= 0.5) configurations.

Figure 8. Examples of multimodal Fusion Results for the “Puppy Chasing Frisbee” scenario under Text-Dominant (

W_{t x t}

= 0.5) and Image-Dominant (

W_{i m g}

= 0.5) configurations.

Figure 9. Framework of Scene Graph-driven Multi-segment Consistency Modeling.

Figure 10. Framework of Three-tier Constraint Mechanism.

Figure 11. Schematic Diagram of Multimodal Input Streams and Constraint Configurations for the “Puppy Chasing Frisbee” Generation Case.

Figure 12. Key frame sequences of the “Puppy Chasing Frisbee” generation case: (a) Initial segment: Puppy running on the grass (Biomechanical constraints ensure natural motion); (b) Transition segment: Viewpoint panning from field to street (Camera motion controlled by Transition Constraint Graph); (c) Final segment: Puppy chasing a frisbee on the street (Maintains subject identity consistency). The experiment demonstrates the dynamic evolution from a grassy field to a street scene, validating the proposed architecture’s capability to maintain visual continuity and logical coherence during multi-segment splicing.

Table 1. Runtime and memory comparison under the same hardware setting.

Model	Avg. Inference Time/5 s Clip	End-to-End Time/3-Segment Video	Peak GPU Memory
Backbone generator	9.1 s	27.3 s	12.4 GB
Proposed full framework	11.6 s	35.0 s	16.1 GB

Table 2. Performance comparison of the proposed framework with other approaches.

Metric	Pika	Runway	TRIP [7]	Proposed
Inter-frame Structural Consistency (4-frame SSIM ↑)	0.73	0.68	-	0.92
Kinematic error (rad ↓)	0.38	0.42	0.29	0.18
Style FID (↓)	34.7	32.5	28.1	18.2
Costume ΔE (↓)	18.6	16.9	12.4	4.3
Temporal coherence(F-Consistency ↑)	-	-	95.36%	94.82%

Table 3. Ablation study of the proposed framework under unified internal settings.

Model Variant	4-Frame SSIM ↑	Kinematic Error (Rad) ↓	Costume ΔE ↓	F-Consistency ↑	MOS ↑
w/o Scene Graph	0.86	0.24	8.9	91.10%	4.1
w/o Three-tier Constraints	0.84	0.29	7.8	90.45%	3.9
Fixed Modality Weights	0.87	0.23	6.5	92.03%	4.2
w/o Refinement	0.89	0.21	5.7	93.16%	4.3
w/o Biomechanical Constraints	0.88	0.31	5.2	93.01%	4.2
Full model	0.92	0.18	4.3	94.82%	4.6

↑ indicates that higher values are better; ↓ indicates that lower values are better.

Table 4. Subjective MOS comparison across methods.

Evaluation Dimension	Methods (Runway)	Methods (Pika)	Ours
Visual Quality	3.50	3.30	4.40
User Control Precision	2.90	2.80	4.38
Multimodal Fusion Effect	3.10	2.95	4.25

The 95% CI (confidence interval) indicates a range within which the true mean is expected to lie with 95% confidence, reflecting the statistical uncertainty of subjective ratings.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, Y.; Li, W.; Fan, C.; Yu, L. AIGC-Driven Short Video Generation Based on the Controllable Multimodal Fusion Architecture. Electronics 2026, 15, 1783. https://doi.org/10.3390/electronics15091783

AMA Style

Zhu Y, Li W, Fan C, Yu L. AIGC-Driven Short Video Generation Based on the Controllable Multimodal Fusion Architecture. Electronics. 2026; 15(9):1783. https://doi.org/10.3390/electronics15091783

Chicago/Turabian Style

Zhu, Yan, Wei Li, Caixia Fan, and Lu Yu. 2026. "AIGC-Driven Short Video Generation Based on the Controllable Multimodal Fusion Architecture" Electronics 15, no. 9: 1783. https://doi.org/10.3390/electronics15091783

APA Style

Zhu, Y., Li, W., Fan, C., & Yu, L. (2026). AIGC-Driven Short Video Generation Based on the Controllable Multimodal Fusion Architecture. Electronics, 15(9), 1783. https://doi.org/10.3390/electronics15091783

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AIGC-Driven Short Video Generation Based on the Controllable Multimodal Fusion Architecture

Abstract

1. Introduction

2. Related Work

2.1. AIGC Technology

2.2. AIGC-Driven Video Generation

2.3. Technical Paradigms of AIGC-Driven Video Generation

3. Method

3.1. Characteristics of Generated Short Videos

3.1.1. Motions and Transitions

3.1.2. Style Consistency and Character Continuity

3.2. Proposed Controllable Multimodal Fusion Architecture

3.2.1. Hierarchical Multimodal Attention Fusion

3.2.2. Scene Graph-Driven Multi-Segment Consistency Modeling

3.2.3. Three-Tier Constraint Mechanism

4. Experiments

4.1. AIGC-Driven Short Video

4.2. Experimental Settings and Runtime Analysis

4.3. Performance Comparison

4.4. Ablation Study

4.5. Subjective Quality Evaluation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI