Toward Intelligent AIoT: A Comprehensive Survey on Digital Twin and Multimodal Generative AI Integration

Luo, Xiaoyi; Wang, Aiwen; Zhang, Xinling; Huang, Kunda; Wang, Songyu; Chen, Lixin; Cui, Yejia

doi:10.3390/math13213382

Open AccessReview

Toward Intelligent AIoT: A Comprehensive Survey on Digital Twin and Multimodal Generative AI Integration

by

Xiaoyi Luo

^1,†

,

Aiwen Wang

^1,†,

Xinling Zhang

^1,‡,

Kunda Huang

^1,‡,

Songyu Wang

^1,‡,

Lixin Chen

¹ and

Yejia Cui

^2,*

¹

Hebei Key Laboratory of Marine Perception Network and Data Processing, Northeastern University at Qinhuangdao, Qinhuangdao 066004, China

²

Department of Clinical Laboratory, The Affiliated Dongguan Songshan Lake Central Hospital, Guangdong Medical University, Dongguan 523326, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

^‡

These authors also contributed equally to this work.

Mathematics 2025, 13(21), 3382; https://doi.org/10.3390/math13213382

Submission received: 10 September 2025 / Revised: 12 October 2025 / Accepted: 13 October 2025 / Published: 23 October 2025

(This article belongs to the Special Issue New Advances in Distributed Systems, Edge Intelligence, and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The Artificial Intelligence of Things (AIoT) is rapidly evolving from basic connectivity to intelligent perception, reasoning, and decision making across domains such as healthcare, manufacturing, transportation, and smart cities. Multimodal generative AI (GAI) and digital twins (DTs) provide complementary solutions. DTs deliver high-fidelity virtual replicas for real-time monitoring, simulation, and optimization with GAI enhancing cognition, cross-modal understanding, and the generation of synthetic data. This survey presents a comprehensive overview of DT–GAI integration in the AIoT. We review the foundations of DTs and multimodal GAI and highlight their complementary roles. We further introduce the Sense–Map–Generate–Act (SMGA) framework, illustrating their interaction through the SMGA loop. We discuss key enabling technologies, including multimodal data fusion, dynamic DT evolution, and cloud–edge–end collaboration. Representative application scenarios, including smart manufacturing, smart cities, autonomous driving, and healthcare, are examined to demonstrate their practical impact. Finally, we outline open challenges, including efficiency, reliability, privacy, and standardization, and we provide directions for future research toward sustainable, trustworthy, and intelligent AIoT systems.

Keywords:

Artificial Intelligence of Things (AIoT); multimodal generative AI (GAI); digital twin; multimodal fusion; cloud–edge–end collaboration

MSC:

68-02

1. Introduction

The Artificial Intelligence of Things (AIoT) has emerged as a frontier domain that has attracted increasing attention from both academia and industry. The rapid development of the AIoT enables ubiquitous connectivity and intelligence across diverse fields, ranging from healthcare and smart manufacturing to autonomous transportation and urban management. Nevertheless, the heterogeneous and multimodal nature of IoT data, combined with resource limitations at the edge, poses significant challenges for efficient perception, representation, and decision making. Recent advances in multimodal generative AI (GAI) offer promising solutions for fusing heterogeneous data sources, enhancing cross-modal understanding, and generating synthetic data. Meanwhile, DTs provide a dynamic virtual–physical mapping framework that can simulate, predict, and optimize AIoT systems. The convergence of these two paradigms opens opportunities to achieve adaptive, trustworthy, and resource-aware intelligence in AIoT environments. However, despite their potential, current research on DT–GAI integration remains fragmented.

1.1. Background and Motivation

The AIoT has emerged as a frontier for ubiquitous connectivity and intelligence across diverse fields from healthcare to autonomous transportation. However, the heterogeneous and multimodal nature of IoT data, combined with resource limitations at the edge, poses significant challenges for efficient perception, representation, and decision making. Recent advances in multimodal GAI and DTs offer a promising path forward. GAI provides solutions for fusing heterogeneous data sources and enhancing cross-modal understanding, while DTs offer a dynamic virtual–physical mapping framework to simulate, predict, and optimize AIoT systems. The convergence of these two paradigms—GAI’s cognitive generation and the DT’s structural fidelity—opens new opportunities for adaptive and trustworthy intelligence. Despite this potential, research on their synergistic integration remains fragmented, lacking a unified conceptual framework, systematic analysis of cross-domain challenges, and rigorous evaluation metrics. This survey aims to bridge these critical gaps.

1.2. Research Questions

To structure this survey in a systematic and problem-driven way, we frame the discussion around three research questions (RQs):

RQ1 (Conceptual Complementarity): How can the distinctive roles of DTs and GAI be conceptualized as complementary pillars of AIoT intelligence, and what theoretical framework can unify their contributions to virtualization and cognition?
RQ2 (System-Level Challenges): What domain-independent advantages, limitations, and recurring system-level challenges (e.g., efficiency, reliability, privacy, standardization) can be synthesized from a cross-literature comparison of DT and GAI approaches?
RQ3 (Formal Models and Evaluation): In what ways can DT–GAI interaction be formalized through functional or probabilistic models and optimization frameworks, and which mathematical indicators provide rigorous criteria for evaluating robustness, complexity, cross-modal accuracy, and twin–reality consistency?

These RQs serve not only as an organizing device for the survey but also as a means of transforming a descriptive review into a problem-driven inquiry.

Methodologically, this survey adopts a structured yet narrative approach. The RQs defined above serve as the organizing framework. Relevant studies were identified through keyword-based searches in major digital libraries (IEEE Xplore, ACM Digital Library, Web of Science, Scopus) using combinations such as “digital twin,” ”generative AI,” “multimodal,” and “AIoT.” We retained peer-reviewed articles and influential preprints that explicitly addressed DTs, GAI, or their integration within IoT/AIoT contexts while excluding purely conceptual works without technical relevance. The selected studies were synthesized and categorized according to the RQs, ensuring transparency and reproducibility while maintaining the flexibility of a narrative survey.

1.3. Contributions

Several recent surveys have explored related areas, including digital twins for IoT applications, multimodal learning for AI, or large language models in general-purpose contexts. However, these studies remain limited in scope. Surveys on digital twins typically emphasize system modeling and industrial deployment while overlooking the role of GAI in enhancing cognition and data augmentation. Conversely, surveys on GAI often focus on architectural advances or vision language models without considering their integration with cyber–physical systems. To the best of our knowledge, no prior survey has provided a comprehensive synthesis that explicitly frames DTs and GAI as complementary paradigms for the AIoT. Furthermore, mathematical formalization and evaluation metrics have been largely absent in prior works, leaving a gap that this paper directly addresses. Table 1 summarizes representative surveys on DTs and GAI, highlighting their scope and focus.

Framing our contributions around the three research questions defined above, this survey provides the following advances:

Addressing RQ1 (Complementarity). We systematically review how DTs and GAI tackle different aspects of AIoT intelligence—DTs providing structured and temporally grounded digital replicas, and GAI contributing semantic reasoning, synthetic data generation, and cross-modal adaptation. We synthesize these insights into a unified framework, which is illustrated through mechanisms such as the Sense–Map–Generate–Act (SMGA) loop.
Addressing RQ2 (Comparative Challenges). We conduct a critical review of representative approaches—GANs, diffusion models, and large language models (LLMs)—and highlight their distinct strengths and limitations across domains such as manufacturing, healthcare, and transportation. We identify common challenges including efficiency, reliability, privacy, and standardization.
Addressing RQ3 (Mathematical Formalization and Evaluation). We move beyond descriptive accounts by proposing functional and probabilistic models of DT–GAI interaction, formulating optimization problems under edge resource constraints, and consolidating evaluation practices into a set of mathematical indicators for robustness, complexity, accuracy, and twin–reality consistency.

1.4. Literature Search and Screening Process

To improve transparency and reproducibility, we explicitly document the literature search and screening process applied in this survey. The literature was retrieved from major academic databases, including IEEE Xplore, ACM Digital Library, SpringerLink, ScienceDirect, and Web of Science. The search employed the following Boolean query:

("digital twin" OR "DT") AND ("generative AI" OR "GAI" OR "multimodal")

AND ("AIoT" OR "IoT" OR "edge computing")

AND ("pruning" OR "quantization" OR "knowledge distillation")

This query integrates both application-oriented keywords (“digital twin, generative AI, multimodal, AIoT”) and methodological keywords (“pruning, quantization, knowledge distillation”), ensuring comprehensive coverage across conceptual and technical dimensions.

The process began with an initial database search yielding 2346 records. An additional 185 records were identified through other sources, primarily reference snowballing, bringing the total to 2531 identified records. After removing 612 duplicates, 1919 unique articles proceeded to the screening stage. During the title and abstract screening, 1352 records were excluded as they did not meet the predefined inclusion criteria. The remaining 567 articles were then assessed for eligibility through a full-text review. In this final stage, 390 articles were excluded for reasons such as a lack of relevance to our research questions or insufficient methodological detail. This rigorous process resulted in a final corpus of 177 studies included in our synthesis. The entire literature selection process is visually summarized in the PRISMA flow diagram in Figure 1.

To support this process, EndNote was employed for reference management and duplicate detection, while Rayyan was used to assist in title/abstract screening with keyword highlighting, structured tagging, and blind checks by multiple authors. Disagreements were resolved through discussion among the authors. Additionally, backward snowballing (screening the references of included studies) and forward snowballing (checking citations of key papers via Google Scholar and Web of Science) were performed to identify further relevant works not captured by the initial query. This multi-step process ensured rigor, reduced the risk of omission, and enhanced the reproducibility of the review.

The remainder of this paper is organized as follows. Section 2 introduces the foundations of DTs and multimodal GAI in the context of AIoT. Section 3 presents the proposed SMGA framework and details how DTs and GAI interact across the Sense–Map–Generate–Act loop. Section 4 discusses key enabling technologies that support the integration, including multimodal data fusion, dynamic DT evolution, and cloud–edge–end collaboration. Section 5 surveys representative application domains such as smart manufacturing, smart cities, autonomous driving, and healthcare. Section 6 outlines the main challenges and future research directions. Finally, Section 7 concludes the paper.

2. Foundations: Digital Twins and Multimodal Generative AI

2.1. DTs in AIoT

Digital twins were initially developed for industrial manufacturing applications, where they modeled physical systems through virtual entities to enable precise replication and monitoring in a digital space [11]. The core concept involves establishing bidirectional connections between virtual and real environments with real-time feedback and interaction promoting system optimization and evolution. With the rapid development of AI and IoT technologies, the application boundaries of digital twins have continuously expanded. DTs have been gradually introduced into multiple domains including smart cities, intelligent transportation, healthcare, and intelligent manufacturing [12,13,14,15], evolving into a key enabling technology capable of supporting real-time monitoring, predictive analytics, and autonomous decision making [16]. In AIoT scenarios, digital twins act not only as virtual representations of individual devices but also as a unified platform for data integration, state visualization, and simulation prediction, thereby effectively supporting complex interactions and intelligent optimization across cross-domain systems.

DTs are characterized by three core attributes. The first is virtual–physical mapping, which aims to establish digital replicas in virtual space that are highly consistent with physical entities. This typically relies on IoT sensing devices for real-time data collection, utilizing edge computing nodes and cloud platforms to collect and transmit multimodal data from physical objects to virtual models in real time, enabling the construction of high-fidelity digital replicas [17,18,19]. A compact formulation is

x_{t} = g (s_{t}, h_{t}, c_{t}), y_{t} = h (x_{t}),

where g maps sensor, historical, and control data into the latent state; h maps the latent state to observable outputs,

s_{t}

represents real-time sensor inputs,

h_{t}

denotes historical states,

c_{t}

encodes control or configuration parameters,

x_{t}

is the latent state of the twin, and

y_{t}

is the observable output of the virtual model. IoT sensing devices, edge nodes, and cloud platforms transmit multimodal data from physical systems into virtual models [17,18,19].

Second is real-time interaction, where digital twins not only ingest data from physical entities but can also return control or optimization instructions to entities, achieving dynamic feedback loops [20]. Existing solutions typically rely on bidirectional data flow architectures, event-driven message queues, and low-latency edge computing systems to ensure that virtual model decisions can be quickly and accurately applied to physical systems. This interaction can be abstracted as

x_{t + 1} = f (x_{t}, u_{t}, w_{t}),

where f specifies the state transition under given control inputs and disturbances,

u_{t}

denotes control signals generated by the twin and applied to the physical system, and

w_{t}

represents environmental disturbances. Architecturally, this relies on low-latency edge computing, event-driven message queues, and bidirectional data pipelines [20].

Finally, iterative optimization allows a DT to continuously update itself based on historical data, predictive models, and optimization algorithms under the drive of AI technology and big data, enabling the accuracy and adaptability of virtual models to improve over time [21]. This primarily depends on historical data accumulation, predictive model training, and optimization algorithm iteration. Twin models utilize historical operational data and environmental change information, performing pattern learning and future state prediction through machine learning or deep learning models, and adjusting virtual model parameters through optimization algorithms, enabling virtual systems to better reflect the evolutionary patterns of physical systems. A typical prediction–correction scheme can be expressed as

{\hat{x}}_{t + 1} = f (x_{t}, θ) + ϵ_{t},

where

{\hat{x}}_{t + 1}

is the predicted state,

θ

are the model parameters, and

ϵ_{t}

is the prediction error. The parameters are iteratively updated by minimizing the loss function:

θ \leftarrow θ - η \nabla_{θ} L (x_{t + 1}, {\hat{x}}_{t + 1}),

where

η

is the learning rate and

L

measures the discrepancy between predicted and observed states. This mechanism improves the adaptability and accuracy of virtual models over time [21].

In AIoT scenarios, the primary role of DTs is to serve as a unified platform integrating multi-source heterogeneous data collected by distributed sensors and edge devices, achieving global state visualization and predictive simulation through modeling [22]. For instance, in intelligent transportation systems, DTs can integrate data from vehicles, roadside units, and environmental sensors to provide traffic congestion prediction and collaborative scheduling [23]. In smart healthcare, DTs can combine physiological signals, imaging data, and behavioral patterns to construct personalized health twins that assist in diagnostic and therapeutic decision making [24].

However, when facing highly complex and dynamically changing environments, traditional modeling and optimization methods often fail to fully exploit latent correlations and provide high-quality data support. To address these limitations, DTs are increasingly being enhanced by multimodal GAI, which can generate synthetic data consistent with real scenarios, support model training under few-shot conditions, and enable semantic reasoning and content creation across modalities such as speech, images, text, and time-series signals [25]. Building on these foundations, the next subsection introduces the rise of multimodal GAI and its key role in contextual understanding and content generation.

2.2. Multimodal Generative AI: The Rise of Contextual Understanding and Creation

GAI represents the most transformative technological direction in the current AI field, primarily encompassing three core model architectures: large language models (LLMs), generative adversarial networks (GANs), and diffusion models.

In the field of text generation, the development of LLMs originated from the revolutionary innovation of the Transformer architecture, which completely reshaped the theory and practice of sequence modeling through attention mechanisms [26]. By 2024, the Transformer architecture has achieved substantial breakthroughs in key dimensions including training stability, computational efficiency, and model fine tuning, particularly achieving milestone progress in long-context large language models [27], effectively overcoming the technical bottlenecks of early models in long-text processing.

In the field of image synthesis, GANs, as typical representatives of first-generation deep generative models, were first proposed by Goodfellow et al. [28], laying the technical foundation for multimodal content generation. They introduced adversarial training formalized as

min_{G} max_{D} E_{x \sim p_{data}} [log D (x)] + E_{z \sim p_{z}} [log (1 - D (G (z)))],

where G is the generator and D is the discriminator. This canonical min–max formulation defines the adversarial learning framework with extensions such as Conditional GANs (cGANs) and StyleGANs improving controllability and quality. GANs provide a general and powerful technical pathway for the generative modeling of AI models through adversarial training mechanisms. The introduction of cGANs enabled controllable generation under specific constraints [29], while advanced variants such as StyleGAN further enhanced the quality and controllability of generated images [30]. For medical image synthesis tasks, a comparative analysis between latent denoising diffusion probabilistic models and generative adversarial networks provides important theoretical guidance for selecting appropriate generative models [31]. However, the inherent defects of GANs in training stability and mode collapse have constrained their application depth and breadth in complex multimodal tasks.

As an emerging generative modeling paradigm, the rise of diffusion models marks a paradigm shift in the field of generative modeling. The denoising diffusion probabilistic model (DDPM) proposed by Ho et al. established a solid theoretical foundation for this field [32] with subsequent research validating that diffusion models can achieve sample quality surpassing existing state-of-the-art generative models in image synthesis tasks [33]. The denoising diffusion probabilistic model (DDPM) objective is

L_{DDPM} = E_{x_{0}, ϵ, t} [∥ ϵ - ϵ_{θ} (x_{t}, t) ∥^{2}],

where

x_{t}

is the noisy sample at step t, and

ϵ_{θ}

predicts the noise. The latent diffusion model proposed by Rombach et al. significantly improved the computational efficiency of high-resolution image synthesis by performing diffusion processes in latent space [34]. As a new generation of powerful generative models, diffusion models can generate high-fidelity samples across domains, but they still need to address core challenges including optimizing time-consuming iterative generation processes and enhancing the controllability and guidance of generation processes.

Collectively, LLMs, GANs, and diffusion models each provide distinct strengths across text, vision, and multimodal synthesis. Their convergence has fueled the growth of deep multimodal learning, which is a field unifying cross-modal representation and reasoning [35]. Baltrušaitis et al. [36] identified five challenges—representation, translation, alignment, fusion, and co-learning—which were later extended to six dimensions including reasoning and quantification [37]. These frameworks have driven rapid progress in multimodal learning, particularly in tackling missing modality data [38].

At the implementation level, progress manifests in four directions: (1) vision–language models (e.g., CLIP, BLIP, Flamingo, OFA) [39,40,41,42,43]; (2) audio–text alignment models (e.g., AudioPaLM, SALMONN) [44,45]; (3) cross-modal representation learning with contrastive learning as the dominant method [46,47,48]; and (4) multimodal prompt engineering, including multimodal chain-of-thought and visual prompting [49,50,51,52,53,54]. Together, these developments highlight the trajectory of GAI toward more unified, adaptable, and generalizable multimodal intelligence.

The development of vision-language models marks an important breakthrough in multimodal alignment technology. The CLIP model developed by OpenAI learns transferable visual representations through natural language supervision, demonstrating excellent zero-shot performance on multiple image classification benchmarks, even surpassing specialized visual models with fine tuning [39]. Jia et al. [40] further explored effective pathways for scaling visual and vision-language representation learning through noisy text supervision, laying important foundations for large-scale multimodal pre-training technologies. The subsequently launched BLIP model achieved a unified architecture for vision-language understanding and generation through bootstrapped language-image pre-training mechanisms [41], while the Flamingo model fully demonstrated the powerful potential of vision-language models in few-shot learning scenarios [42]. The OFA model successfully unified architectural design, task processing, and modality fusion through a concise sequence-to-sequence learning framework [43], marking milestone progress in multimodal unified architecture design. These vision-language models have achieved significant advances in cross-modal semantic alignment through the deep learning of large-scale vision-language data during the pre-training phase [4], and they are regarded as potential technical pathways toward artificial general intelligence.

In audio–text alignment technology, the AudioPaLM model constructs a unified multimodal architecture by fusing text-based PaLM-2 and speech-based AudioLM language models, fully demonstrating the powerful capabilities of large language models in speech processing [44]. SALMONN, as a Speech Audio Language Music Open Neural Network, achieves the unified processing and understanding of speech, audio, language, and music through an innovative dual-encoder architecture [45], demonstrating the technical potential for the deep integration of audio and text modalities.

Cross-modal representation learning, as the core technological foundation for achieving multimodal alignment and fusion, significantly enhances model performance through the organic integration of diverse data types including text, images, audio, and video [55]. Based on an in-depth survey of over 200 related papers, Wang et al. [46] showed that multimodal alignment and fusion research focuses on fusion mechanisms of visual and language modalities, helping researchers gain a deep understanding of basic concepts, core methodologies, and current technological progress. Contrastive learning has become the mainstream method for constructing shared representation spaces [47], achieving efficient alignment by maximizing the similarity between relevant modalities while minimizing the similarity between irrelevant modalities. Huang et al. [48] conducted an in-depth analysis of the performance and technical challenges faced by multimodal large language models across different tasks, while the latest survey by Gupta et al. [47] further analyzed the alignment mechanisms, benchmarks, evaluation methods, and technical challenges of large vision-language models, comprehensively showcasing the latest development trends in this field.

Multimodal prompt engineering, as an application-oriented emerging paradigm, achieves the rapid adaptation and optimization of models by designing task-specific prompts for large pre-trained models. Early multimodal cognitive architecture research profoundly elucidated the central role of perception in intelligent behavior [49], laying theoretical foundations for subsequent development. Traditional chain-of-thought (CoT) methods focus on single language modalities, while multimodal chain-of-thought innovatively incorporates text and visual information into a two-stage processing framework [50] with the first stage performing reasoning generation based on multimodal information and the second stage utilizing generated reasoning results for final answer decisions. Fine-grained multimodal prompt learning frameworks achieve a precise learning of global difference recognition and subtle discriminative details between specific visual categories through dual-granularity visual prompt schemes [51] while converting random vectors containing category names into category-specific discriminative representations. The emergence of visual prompting techniques provides new technological pathways for more fine-grained and free-form visual instruction processing [52], offering effective approaches for rapid model adaptation and optimization. Systematic research on vision-language foundation model prompt engineering covers multimodal-to-text generation models, image–text matching models, and text-to-image generation models [53] with NVIDIA’s technical guidelines further providing systematic practical guidance for vision-language models in image and video understanding [54].

2.3. GAI as Cognitive Augmentation Layer for ML-Based DT

In contemporary digital twin implementations, the integration of generative AI as a cognitive augmentation layer represents a hierarchical architecture where GAI capabilities enhance rather than replace existing ML-based systems [56]. Within this framework, the “Generate” layer operates as a higher-order cognitive interface that interprets and extends the outputs from the “Map” layer’s traditional machine learning models. Specifically, generative models such as variational autoenomes and generative adversarial networks can synthesize missing data patterns or predict future states based on historical ML predictions, while explanatory large language models provide natural language interpretations of complex ML outputs, making them accessible to non-technical stakeholders [57]. This vertical integration enables a bidirectional information flow: the Map layer provides structured, data-driven predictions grounded in sensor measurements and physical models, while the Generate layer contextualizes these predictions within broader operational narratives, generates what-if scenarios, and facilitates human-in-the-loop decision making through conversational interfaces [58]. For instance, in smart manufacturing scenarios, conventional ML models might detect anomalies in equipment vibration patterns, while GAI layers can automatically generate maintenance reports, explain root causes in natural language, and simulate alternative intervention strategies [59].

The cognitive augmentation paradigm is particularly crucial for reflecting IoT object evolution in pre-existing data-driven DT systems, where physical assets undergo modifications, operational context changes, or deployment in novel environments not captured in original training datasets [60]. Traditional ML-based digital twins often suffer from model drift and reduced accuracy when confronted with such evolutionary changes, requiring costly retraining procedures [22]. GAI addresses this limitation through several mechanisms: firstly, generative models can perform domain adaptation by synthesizing training samples that bridge the gap between historical and evolved operational conditions; secondly, large language models can incorporate unstructured contextual information—such as maintenance logs, operator notes, and design modification documents—to update the DT’s knowledge base without explicit model retraining [61]. This layered architecture proves essential in real-world implementation scenarios characterized by resource constraints and operational continuity requirements. For example, in smart city applications where infrastructure evolves incrementally through upgrades and modifications, the GAI layer can maintain semantic consistency across system versions while the underlying ML models continue to operate on established patterns, thus avoiding service disruptions associated with complete system overhauls [62]. Furthermore, this approach enables gradual migration strategies where organizations can preserve investments in existing ML-based DT infrastructure while progressively enhancing capabilities through modular GAI components, thereby reducing implementation risks and facilitating stakeholder buy-in [21].

2.4. The Symbiosis: Why DT and GAI Are Complementary Partners

Multimodal GAI provides more powerful capabilities for addressing these limitations. By integrating information from diverse modalities such as vision, speech, text, and sensor data, it enables richer semantic understanding and more flexible knowledge transfer. Beyond simple perception, multimodal generative AI can create new representations, synthesize missing information, and support cross-domain adaptation. These features allow AIoT systems to move from passive data analysis toward proactive reasoning and content generation, which is essential for achieving higher-level cognition and creativity in complex environments. This interaction can be abstracted as a bidirectional mapping:

D_{GAI} = ϕ (D_{DT}, y), E_{DT} = ψ (E_{GAI}, u),

where

D_{DT}

denotes multimodal data streams from DTs, y represents latent priors for GAI,

D_{GAI}

represents synthetic data generated by GAI,

E_{DT}

denotes DT state updates, and

E_{GAI}

represents generative policies or predictions.

In addition, digital twins serve as a critical enabler for enhancing AIoT intelligence. By creating virtual replicas of physical entities, processes, and environments, digital twins provide a dynamic platform for real-time monitoring, simulation, and optimization. They allow AIoT systems to test decision strategies in safe virtual environments before deployment, predict system behaviors under different conditions, and continuously improve performance through feedback loops. This capability not only reduces risks and costs but also enhances adaptability and resilience in dynamic and resource-constrained AIoT scenarios.

At the same time, many AIoT domains such as industry and healthcare face significant challenges in acquiring sufficient high-quality fault data or rare scenario data for model training. GAI offers a promising solution by producing realistic synthetic data from limited real samples, thereby enriching and augmenting training sets to improve robustness and generalization. Its cross-modal generation ability, particularly in multimodal settings, enables the integration of diverse information sources—for example, aligning visual content with textual descriptions to support more accurate reasoning and decision making. Building on these foundations, the integration of digital twins and multimodal GAI is increasingly regarded as a key direction for advancing AIoT. Digital twins provide the structural foundation for building high-fidelity virtual representations of physical systems, whereas multimodal GAI offers the cognitive and creative capabilities needed to interpret, generate, and interact with such representations. They together open a pathway toward more intelligent, adaptive, and trustworthy AIoT applications.

DTs and GAI exhibit a powerful symbiotic relationship, forming perfect complementarity at the data, environment, and functional levels. IBM research indicates that GAI can construct inputs and comprehensive outputs for digital twins, while digital twins can provide powerful testing and learning environments for GAI [63]. As real-time virtual replicas of physical systems, digital twins mirror the topology, states, and environments of real networks, providing high-quality, multimodal, spatiotemporally correlated training and validation data for GAI models [64]. Conversely, GAI endows digital twins with intelligent enhancement capabilities. Technical insights from Plain Concepts [65] further point out that GAI can enhance the accuracy and operational efficiency of network digital twins, possessing complex reasoning and decision-making capabilities and the ability to automatically generate code and models. Research on EdgeAgentX-DT [66] validates the effectiveness of this integration in providing resilient edge intelligence for tactical networks, while Nielsen Norman Group’s research [67] further explores how digital twins simulate human behavior through GAI, demonstrating the enormous potential of this symbiotic relationship in enhancing system intelligence levels and autonomy.

To contextualize the role of DTs and GAI in AIoT, Table 2 compares existing surveys across research objects, mathematical focus, strengths/limitations, and appli- cation domains. As shown, prior surveys either emphasize DT or GAI in isolation, whereas our survey explicitly unifies them under a common problem-driven framework.

3. A Framework for Integration

3.1. The Proposed SMGA Architecture for AIoT

In the previous section, we reviewed fundamental concepts, core characteristics of DTs and multimodal GAI, and their complementary relationships in AIoT scenarios. It can be observed that DTs provide high-fidelity, multimodal environmental support through virtual–real mapping and real-time interaction, while MGA, with its capabilities in context understanding and generation, endows DTs with the potential for intelligent decision making and content creation. The integration of the two not only facilitates the realization of more intelligent and adaptive AIoT systems but also opens up new avenues for prediction, optimization, and control in complex environments. To further elaborate on their integration mode, this section proposes a holistic architecture based on Sense–Map–Generate–Act (SMGA), aiming to systematically depict the deep coupling and synergy between DT and MGA in the AIoT.

3.1.1. Sense Layer

The Sense Layer is located at the bottom of the SMGA architecture, directly facing the physical world. Its core task is to acquire raw data streams through multimodal sensors (e.g., cameras, LiDAR, accelerometers, audio collectors, etc.). These data cover multiple dimensions such as visual, spatial, motion, and environmental aspects, providing multi-source inputs for DT construction and the reasoning of generative AI. Formally, the multimodal observation at time t can be expressed as

X (t) = {x^{(1)} (t), x^{(2)} (t), \dots, x^{(M)} (t)},

(1)

where

x^{(m)} (t)

denotes the observation from the m-th sensor modality.

Due to sensor heterogeneity and dynamic data distribution, the Sense Layer needs to focus on addressing issues such as data noise, missing data, and time delay. A common formulation considers noise-contaminated signals as

{\tilde{x}}^{(m)} (t) = x^{(m)} (t) + ϵ^{(m)} (t), ϵ^{(m)} (t) \sim N (0, σ_{m}^{2}),

(2)

where

{\tilde{x}}^{(m)} (t)

represents the noisy measurement and

ϵ^{(m)} (t)

models Gaussian noise with variance

σ_{m}^{2}

.

3.1.2. Map Layer

The Map Layer serves as a bridge from raw sensing data to the system’s state representation, constructing and maintaining the State Digital Twin (DT_S) of the physical environment. Its main functions include data cleaning, alignment, and synchronization to eliminate temporal deviations and format differences among sensors. Specifically, temporal alignment is achieved by mapping each sensor stream into a unified timeline via interpolation or resampling:

{\bar{x}}^{(m)} (t) = I_{m} (x^{(m)} (τ_{m} (t))), τ_{m} (t) = α_{m} t + β_{m} - Δ_{m},

(3)

where

α_{m}

and

β_{m}

denote scale and offset,

Δ_{m}

is the estimated latency, and

I_{m} (\cdot)

is an interpolation operator.

For spatial alignment, sensor measurements are transformed into a common world coordinate system using rigid-body transformations:

{\tilde{p}}^{W} = T_{m} {\tilde{p}}^{(m)}, T_{m} = [\begin{matrix} R_{m} & t_{m} \\ 0^{⊤} & 1 \end{matrix}] \in S E (3),

(4)

where

R_{m}

and

t_{m}

denote the rotation and translation for sensor m. For camera modalities, projection into the image plane is given by

\tilde{u} \sim K_{m} [R_{m} ∣ t_{m}] {\tilde{p}}^{W}, u = \frac{1}{{\tilde{u}}_{3}} [\begin{matrix} {\tilde{u}}_{1} \\ {\tilde{u}}_{2} \end{matrix}],

(5)

where

K_{m}

is the intrinsic matrix and

Π (\cdot)

denotes homogeneous division.

After alignment, multimodal measurements are fused to form a consistent observation vector. A common strategy is uncertainty-aware weighted averaging:

z_{t} = \sum_{m = 1}^{M} w_{m} (t) {\bar{x}}_{W}^{(m)} (t), w_{m} (t) = \frac{{(σ_{m}^{2} (t) + δ_{m})}^{- 1}}{\sum_{j = 1}^{M} {(σ_{j}^{2} (t) + δ_{j})}^{- 1}},

(6)

where

σ_{m}^{2} (t)

denotes the estimated noise variance and

δ_{m}

prevents degeneracy.

For representation learning, sensor data are embedded into a shared semantic space:

e^{(m)} (t) = ϕ_{m} ({\bar{x}}_{W}^{(m)} (t)), e_{t} = Ψ ({e^{(m)} (t)}),

(7)

with alignment encouraged by minimizing the discrepancy between embeddings:

L_{align} = \sum_{m < n} {∥ e^{(m)} (t) - e^{(n)} (t) ∥}_{2}^{2} .

(8)

Finally, the DT state is updated via a standard state-space formulation:

s_{t} = f (s_{t - 1}, u_{t}) + w_{t}, y_{t} = h (s_{t}) + v_{t},

(9)

where f and h denote transition and observation functions with process noise

w_{t}

and observation noise

v_{t}

. The posterior distribution is estimated using Bayesian filtering:

p (s_{t} ∣ y_{1 : t}) \propto p (y_{t} ∣ s_{t}) \int p (s_{t} ∣ s_{t - 1}) p (s_{t - 1} ∣ y_{1 : t - 1}) d s_{t - 1} .

(10)

Alternatively, state estimation can be cast as an optimization problem:

s_{t} = arg min_{s} \{∥ h (s) - y_{t} ∥_{R}^{2} + λ {∥ s - f (s_{t - 1}, u_{t}) ∥}_{Q}^{2}\},

(11)

balancing observation fidelity and dynamic consistency.

In this process, the Map Layer not only ensures the real-time performance and accuracy of the DT_S’s state representation (

s_{t}

) but also provides this high-fidelity state information as a reliable input for the upper-layer Generate Layer. It is worth noting that the Map Layer must balance data quality and computational efficiency, while supporting multimodal data fusion and standardization, so that the Generate Layer can conduct analysis and deduction based on a consistent and unified world state.

3.1.3. Generate Layer

In the SMGA architecture, the Generate Layer represents the reasoning and prediction core that builds on the high-fidelity DT state provided by the Mapping Layer. By incorporating generative AI models, this layer is able to analyze the current system context, reason about potential outcomes, and generate multiple candidate strategies.

To formalize this process, different generative models can be employed:

Variational Autoencoders (VAEs). Given observed state

x

, a latent representation

z

is inferred via an encoder

q_{ϕ} (z | x)

, and reconstructed through a decoder

p_{θ} (x | z)

. The optimization objective is the Evidence Lower Bound (ELBO) [68]:

L_{VAE} (θ, ϕ) = E_{q_{ϕ} (z | x)} [log p_{θ} (x | z)] - D_{KL} (q_{ϕ} (z | x) ∥ p (z)) .

(12)

Generative Adversarial Networks (GANs). A generator

G_{θ} (z)

maps noise

z \sim p (z)

to candidate strategies, while a discriminator

D_{ϕ} (\cdot)

distinguishes real from generated data. The objective is the standard GAN min–max problem [28]:

min_{G} max_{D} L_{GAN} = E_{x \sim p_{data}} [log D_{ϕ} (x)] + E_{z \sim p (z)} [log (1 - D_{ϕ} (G_{θ} (z)))] .

(13)

Diffusion Models. These models gradually add Gaussian noise to data through a forward process

q (x_{t} | x_{t - 1})

and learn a reverse denoising process parameterized by

θ

[32]:

L_{diff} = E_{t, x, ϵ} [∥ ϵ - ϵ_{θ} (x_{t}, t) ∥^{2}],

(14)

where

ϵ_{θ}

predicts the added noise at timestep t.

Transformers. Given an input sequence of multimodal tokens

{x_{1}, \dots, x_{n}}

, the self-attention mechanism is [26]

Attention (Q, K, V) = softmax (\frac{{QK}^{⊤}}{\sqrt{d_{k}}}) V,

(15)

where

Q = {XW}^{Q}

,

K = {XW}^{K}

,

V = {XW}^{V}

.

By combining these generative mechanisms, the Generate Layer can not only analyze the current system context but also simulate diverse futures, reason about potential outcomes, and produce a set of candidate strategies. Through predictive rollouts, it can evaluate trade-offs in performance, safety, energy efficiency, and cost. For instance, in smart manufacturing, the Generate Layer can output alternative production schedules and assess their influence on throughput, energy consumption, and equipment lifetime. In this way, the Generate Layer equips AIoT systems with both tactical optimization and long-term strategic foresight. Ultimately, this layer outputs a set of candidate strategies, denoted as

π_{1}, π_{2}, \dots, π_{k}

, which are then passed to the Act Layer for rigorous validation in a simulated environment.

3.1.4. Act Layer

The Act Layer functions as the critical bridge between virtual planning and physical execution, ensuring that generated strategies are safe, effective, and optimized before deployment. Its core component is the Predictive Sandbox Twin (DT_P), which is a dedicated simulation environment distinct from the DT_S.

The process within the Act Layer follows a structured validation loop:

Simulation: It receives the set of candidate strategies $π_{i}$ from the Generate Layer. Within the risk-free DT_P environment, each strategy is simulated to predict its potential outcomes and consequences on the system.
Validation: Following simulation, a Policy Validation Module evaluates each strategy’s performance. This evaluation is not arbitrary; it is performed using a comprehensive validation function, $V (π_{i})$ , which quantifies the strategy’s quality based on the Key Performance Indicators (KPIs) detailed in Section 3.2, such as accuracy (Equation (16)), robustness (Equation (24)), and computational efficiency (Equation (29)).
Decision: A strategy is deemed ’valid’ if its score exceeds a predefined acceptance threshold ( $V (π_{i}) > τ$ ). The highest-scoring valid strategy is selected as the optimal strategy, $π^{*}$ , and it is translated into executable instructions for deployment to actuators in the physical world.
Feedback and Refinement: Conversely, strategies that fail the validation ( $V (π_{i}) \leq τ$ ) are rejected. Crucially, information about the failure—for instance, which specific KPIs were not met—is compiled into a refinement signal and fed back to the Generate Layer. This intelligent feedback guides the model to produce improved strategies in subsequent iterations, addressing the shortcomings of the previous attempts.

Through this closed loop of simulation, validation, and feedback, the Act Layer not only guarantees trustworthy execution but also facilitates continual learning, enabling the AIoT system to evolve toward greater intelligence and adaptability.

In summary, the proposed SMGA architecture provides a holistic closed-loop framework that seamlessly integrates sensing, mapping, generation, and action for an intelligent AIoT. At the bottom of the framework, the Sense Layer continuously collects heterogeneous multimodal data from the physical world, while the Map Layer processes these inputs through cleaning, alignment, and synchronization to construct a high-fidelity DT_S representation. Building on this virtual mirror, the Generate Layer employs advanced generative AI models—including Transformers, diffusion models, and LVLMs—to reason about system dynamics, predict future states, and produce a set of candidate strategies. These strategies are subsequently verified within DT_P by the Act Layer’s validation module, which translates the optimal plan into executable commands for physical actuators and simultaneously feeds execution results back into the system for refinement. Through this iterative cycle, the SMGA architecture enables AIoT systems to achieve adaptive decision making, robust control, and lifelong learning in complex environments. Specifically, the interplay among the four layers of the SMGA framework and their closed-loop operation are illustrated in Figure 2.

3.2. Quantitative Evaluation Metrics for DT–GAI Integration

3.2.1. Accuracy and Performance Metrics

In the context of DT and GAI integration, evaluating model performance requires a set of well-established, widely used quantitative metrics. Accuracy and performance metrics are fundamental for assessing the predictive capabilities of models across various tasks, including classification, regression, and multi-task scenarios.

Classification Metrics
For classification tasks, common metrics include accuracy, precision, recall, and F1-score. Accuracy measures the proportion of correctly predicted instances over the total number of samples:

$Accuracy = \frac{T P + T N}{T P + T N + F P + F N}$

(16)

where $T P$ denotes the number of true positives, $T N$ denotes the number of true negatives, $F P$ denotes the number of false positives, and $F N$ denotes the number of false negatives. Precision quantifies the fraction of correctly predicted positive instances among all predicted positives:

$Precision = \frac{T P}{T P + F P}$

(17)

Recall evaluates the fraction of actual positive instances that are correctly identified:

$Recall = \frac{T P}{T P + F N}$

(18)

The F1-score, as the harmonic mean of precision and recall, balances the trade-off between false positives and false negatives:

$F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$

(19)

Threshold-independent metrics such as the ROC-AUC (Receiver Operating Characteristic Area Under the Curve) and the PR-AUC (Precision–Recall Area Under the Curve) are also widely adopted to compare classifier performance under varying decision thresholds.
Regression Metrics
For regression or continuous prediction tasks, commonly used metrics include the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and the coefficient of determination ( $R^{2}$ ). These metrics are defined as shown below:

$MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}, RMSE = \sqrt{MSE}$

(20)

$MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |$

(21)

$R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}$

(22)

where $y_{i}$ and ${\hat{y}}_{i}$ denote the ground truth and predicted values, respectively, $\bar{y}$ is the mean of the true values, and n is the total number of samples.
Multi-Task and Composite Metrics
In scenarios where DT–GAI frameworks handle multiple tasks simultaneously, composite metrics that aggregate performance across tasks are often adopted. The weighted averages of F1-scores or task-specific accuracies provide a holistic evaluation, enabling fair comparison across heterogeneous datasets or modalities.

Overall, accuracy and performance metrics offer a fundamental and universally recognized basis for evaluating the predictive capabilities of DT–GAI systems. They are widely applicable, interpretable, and facilitate benchmarking across different models and experimental setups, forming a cornerstone of the quantitative evaluation framework described in this section.

3.2.2. Robustness Metrics

Robustness metrics are essential for evaluating the reliability of DT–GAI systems under uncertain or adverse conditions. These metrics quantify how models respond to noise, anomalies, or perturbations, and they assess their stability across different datasets and experimental setups.

Sensitivity to Noise or Outliers
One common approach is to introduce controlled noise or simulate outliers in the input data and observe the model’s performance degradation. Let $M_{0}$ denote the performance of the model on clean data, and let $M_{noise}$ denote its performance on noisy data. The degradation rate can be calculated as

$Performance Drop = \frac{M_{0} - M_{noise}}{M_{0}} \times 100 %$

(23)
Adversarial Robustness
Adversarial robustness evaluates the model’s resistance to adversarial examples crafted to induce incorrect predictions. Metrics include the adversarial success rate or accuracy on perturbed inputs. Formally, if ${\hat{y}}_{i}^{adv}$ is the prediction on an adversarial input $x_{i}^{adv}$ , the adversarial accuracy is

$Adversarial Accuracy = \frac{1}{n} \sum_{i = 1}^{n} ⊮ ({\hat{y}}_{i}^{adv} = y_{i})$

(24)

where $⊮ (\cdot)$ is the indicator function and $y_{i}$ is the true label.
Model Stability Metrics
Stability metrics evaluate how consistently a model performs under repeated experiments or cross-validation folds. One widely used measure is the variance of performance across k-fold cross-validation:

$CV Variance = \frac{1}{k} \sum_{j = 1}^{k} {(M_{j} - \bar{M})}^{2}$

(25)

where $M_{j}$ is the performance metric (e.g., accuracy, F1-score) on the j-th fold, and $\bar{M}$ is the mean performance across all folds. Another indicator is the difference between training and testing performance, which reflects overfitting or underfitting:

$Train-Test Gap = M_{train} - M_{test}$

(26)

Overall, robustness metrics provide critical insight into the reliability and trustworthiness of DT–GAI systems in real-world environments, particularly when data may be noisy, incomplete, or intentionally manipulated. They complement accuracy and performance metrics by highlighting the model’s resilience to perturbations and variability.

3.2.3. Cross-Modal and Multi-Source Metrics

Cross-modal and multi-source metrics evaluate how effectively DT–GAI systems integrate information from multiple modalities or heterogeneous data sources. They are key for applications where data fusion is essential.

Consistency Across Modalities
Consistency metrics measure agreement between predictions or representations derived from different modalities. For two modalities A and B, a simple consistency score can be calculated as

$C_{modal} = \frac{1}{n} \sum_{i = 1}^{n} ⊮ ({\hat{y}}_{i}^{A} = {\hat{y}}_{i}^{B})$

(27)

where ${\hat{y}}_{i}^{A}$ and ${\hat{y}}_{i}^{B}$ are predictions from modalities A and B for sample i.
Domain Adaptation Accuracy
Cross-domain accuracy evaluates the model’s generalization when applied to a target domain different from the training domain. Let $M_{source}$ and $M_{target}$ denote performance on source and target domains; the adaptation accuracy can be expressed as

$Domain Accuracy = \frac{M_{target}}{M_{source}} \times 100 %$

(28)
Fusion Efficiency
Fusion efficiency assesses how well multi-source data are combined to improve model performance without excessive computational overhead. It can be quantified as the performance gain per unit of computational cost:

$Fusion Efficiency = \frac{M_{fusion} - max (M_{single})}{{FLOPs}_{fusion}}$

(29)

where $M_{fusion}$ is the performance of the fused model, $max (M_{single})$ is the best single-modality performance, and ${FLOPs}_{fusion}$ represents the computational cost of the fused model.

Cross-modal and multi-source metrics are crucial for evaluating DT–GAI systems in scenarios involving heterogeneous or complementary information sources, ensuring both reliability and efficiency in multimodal fusion.

3.2.4. Computational and Resource Metrics

Computational and resource metrics are critical for evaluating the efficiency and deployability of DT–GAI systems, particularly in edge or resource-constrained environments. These metrics quantify the computational demand, memory footprint, and energy consumption of models.

Inference Time
Inference time measures the duration a model takes to process an input and generate an output. Formally, for a set of n inputs, the average inference time is

$T_{inference} = \frac{1}{n} \sum_{i = 1}^{n} t_{i}$

(30)

where $t_{i}$ is the processing time for the i-th input.
Model Parameter Count
The number of trainable parameters provides an indicator of model complexity and storage requirements. Let L denote the total number of layers and $p_{l}$ the number of parameters in layer l, then

$P_{total} = \sum_{l = 1}^{L} p_{l}$

(31)
FLOPs (Floating Point Operations)
FLOPs quantify the total number of arithmetic operations required for a single forward pass. This is widely used to compare the computational cost between models:

${FLOPs}_{total} = \sum_{l = 1}^{L} {FLOPs}_{l}$

(32)
Memory and Energy Consumption
Memory usage and energy consumption assess the model’s suitability for deployment on edge devices. Let $M_{model}$ denote memory usage in bytes and $E_{inference}$ the energy consumed per inference; these metrics can be measured empirically using profiling tools.

These metrics collectively inform trade-offs between model performance and resource efficiency, which is crucial for practical DT–GAI deployment scenarios.

4. Key Enabling Technologies

4.1. Multimodal Data Fusion and Representation Learning

Multimodal data fusion represents a fundamental approach in AI research that aims to integrate heterogeneous data sources into a unified representation space, thereby enabling effective cross-modal interaction and complementary information exchange. This integration addresses the inherent limitations of unimodal data regarding representational completeness, robustness, and generalization capabilities while establishing a solid foundation for advanced multimodal AI applications [69]. Through cross-modal alignment and joint modeling, multimodal AI systems achieve deep semantic correlations and knowledge transfer. In DT applications, these technologies have demonstrated broad applicability across diverse tasks including image–text alignment and generation, cross-modal retrieval, and video synthesis [40]. Multimodal data typically encompass visual, auditory, textual, and sensor-based modalities along with structured and semi-structured data formats. These data types exhibit significant heterogeneity in temporal dynamics, spatial properties, semantic content, and structural constraints. Such heterogeneity not only complicates fusion strategies but also directly impacts the effectiveness of representation learning and model alignment. This section examines two critical aspects: (i) data processing techniques for both unimodal and multimodal inputs, and (ii) fusion strategies specifically designed for multimodal scenarios.

4.1.1. Unimodal and Multimodal Data Processing Techniques

Unimodal learning approaches model features and perform tasks based on single data sources with the primary objective of efficiently extracting domain-specific information to highlight task-relevant features. Multimodal learning, conversely, jointly models data from multiple sources to establish unified representation spaces that enable feature alignment and cross-modal complementarity. Current research focuses on three main categories: feature representation, multimodal alignment, and multimodal agent technologies.

Feature representation techniques utilize deep neural networks to extract high-dimensional embeddings from unimodal data, establishing the foundation for subsequent cross-modal modeling. Krizhevsky et al. demonstrated the effectiveness of convolutional neural networks (CNNs) in capturing local spatial structures in images while incorporating cross-modal attention mechanisms that facilitate early-stage information interaction across modalities, thereby enhancing representation completeness and robustness [70]. Similarly, Graves et al. employed recurrent neural networks (RNNs) and Transformer architectures, showing strong capabilities in modeling temporal dependencies for speech recognition and sequential processing tasks [71].

Mathematically, unimodal embedding can be represented as

h^{(m)} = f_{θ_{m}} (x^{(m)}), x^{(m)} \in R^{d_{m}},

(33)

where

x^{(m)}

denotes the input of modality m,

f_{θ_{m}}

is the feature extractor (e.g., CNN, RNN), and

h^{(m)}

is the learned embedding.

Unimodal approaches can provide efficient and concise representations when information comes from a single modality. However, when system states depend on multiple information sources, unimodal approaches often fail to capture underlying complexity, resulting in limitations in representational completeness, robustness, and cross-domain generalization.

Multimodal alignment techniques address semantic inconsistencies among different modalities through cross-modal embeddings and contrastive learning, projecting heterogeneous modalities into unified semantic spaces. The CLIP model exemplifies this approach, achieving remarkable results in image–text alignment tasks [72].

The standard cross-modal contrastive loss is

L CL = - \sum {i = 1}^{N} log \frac{exp (sim (h_{v}^{(i)}, h^{(i)} t) / τ)}{\sum {j = 1}^{N} exp (sim (h_{v}^{(i)}, h_{t}^{(j)}) / τ)},

(34)

where

sim (\cdot, \cdot)

denotes cosine similarity, and

τ

is the temperature parameter.

At present, scholars primarily focus on (i) incorporating knowledge graphs and logical constraints during alignment to enhance semantic consistency, (ii) developing dynamic alignment mechanisms that adaptively adjust fusion strategies based on task requirements, and (i) advancing weakly supervised and unsupervised alignment methods that leverage generative models and pseudo-labeling for effective alignment under limited paired samples. Adibfar and Costin [13] apply DT fusion in infrastructure monitoring, defining a transformation

Φ : ITS \to BrIM

, where ITS data are mapped to bridge information models. Cavalieri and Gambadoro [73] present semantic mapping from DTDL to OPC UA, which is formalized as

ψ : O_{D T D L} \to O_{O P C}

. Such mappings demonstrate the potential of algebraic structures for DT interoperability.

The evaluation of multimodal DTs also introduces mathematical indicators. Wu et al. [6] describe DT network fidelity as

Δ = \frac{1}{T} \sum_{t = 1}^{T} {∥ x_{t} - {\hat{x}}_{t} ∥}^{2},

(35)

where

{\hat{x}}_{t}

is the DT-predicted state. Fei et al. [3] emphasize the computational complexity of fusion:

Γ = \sum_{v = 1}^{V} O (d_{v} \cdot T),

(36)

where

d_{v}

represents the modality dimension and T represents the temporal length.

In healthcare applications, Laubenbacher et al. [74] formalize patient-specific DTs as systems of differential equations:

\frac{d x}{d t} = g (x (t), u (t), θ),

(37)

with multimodal data used to estimate parameter

θ

. GAI-based generative priors can fill in missing modalities, strengthening predictive robustness. Similarly, Vallee [75] highlight the optimization-based alignment of imaging and physiological signals.

These mathematical formulations demonstrate that DT–fusion research emphasizes linear algebraic integration, probabilistic inference for missing modalities, and system-level optimization under constraints. Together with DT–GAI modeling, they provide a holistic toolkit for formalizing next-generation AIoT intelligence.

Multimodal agent techniques integrate multimodal perception capabilities into intelligent agents to enhance their modeling and reasoning performance in complex environments. Formally, given the following multimodal inputs

X = {x^{(v)}, x^{(a)}, x^{(t)}, x^{(s)}},

(38)

representing visual, auditory, textual, and sensor data, the agent learns a joint embedding:

z = f_{θ} (x^{(v)}, x^{(a)}, x^{(t)}, x^{(s)}),

(39)

where

f_{θ} (\cdot)

is a multimodal encoder that aligns heterogeneous features into a unified latent space.

Unlike traditional agents relying on unimodal input, multimodal agents combine visual, auditory, textual, and sensor data to construct comprehensive environmental representations. Recent advances focus on the following:

(i) Unified architectures integrating perception and decision making, where decisions follow the formula below:

π (a | s) = softmax (W z + b),

(40)

enabling agents to dynamically weight modalities during inference [76];

(ii) Multi-agent collaborative frameworks, in which agents share cross-modal embeddings through

z_{i}^{'} = z_{i} + \sum_{j \in N (i)} α_{i j} z_{j},

(41)

with

α_{i j}

denoting attention-based collaboration weights;

(iii) Integration with DT systems, which are modeled as continuous virtual–real updates:

M_{t + 1} = M_{t} + Δ M (z, E_{t}),

(42)

where

M_{t}

is the DT state and

E_{t}

represents real-world feedback [77]. These developments have expanded multimodal AI applicability in autonomous driving and human–computer interaction while providing theoretical foundations for cross-modal collaborative decision-making research.

In AIoT and DT systems, unimodal and multimodal approaches offer complementary advantages: unimodal methods provide lightweight, efficient solutions for localized perception and specific tasks, while multimodal approaches excel in handling heterogeneous multi-source information, enabling global modeling and virtual–real interaction. With increasing demand for complex system modeling and high-fidelity simulations, multimodal approaches are increasingly recognized as core drivers for the intelligent evolution of AIoT and DT systems.

4.1.2. Multimodal Fusion Techniques Under Multimodal Conditions

In multimodal data fusion research, federated learning has emerged as an effective mechanism for cross-domain modeling while preserving privacy [78]. Unlike centralized approaches, federated multimodal learning trains local sub-models at each client and exchanges only parameters or gradients, mitigating privacy and compliance risks associated with raw data sharing. However, practical deployment faces several challenges: (i) modality heterogeneity, where clients may possess only unimodal data, resulting in incomplete cross-modal information during global aggregation; (ii) data distribution disparities, as sample sizes and quality vary across organizations, potentially weakening model consistency; and (iii) communication and computational overhead, as large-scale multimodal models and frequent parameter synchronization impose significant bandwidth and energy requirements.

To address these challenges, recent research has proposed approaches that can be categorized into six main types:

(1) Cross-modal shared representation methods construct unified latent semantic spaces to achieve feature alignment and complementarity across modalities, enhancing global model capacity. Che et al. proposed representation flattening with knowledge distillation to mitigate inter-modal distribution gaps and improve multimodal consistency and generalization [79]. For a modality v, the input

X^{(v)}

is projected into a shared space:

H^{(v)} = W^{(v)} X^{(v)}, v = 1, 2, \dots, V

(43)

All modality representations are then flattened into a unified representation:

H = Flatten (H^{(1)}, H^{(2)}, \dots, H^{(V)})

(44)

Zhou et al. introduced the MDE framework, employing modality-specific encoders for diverse data types combined with attention mechanisms and contrastive learning to achieve semantic alignment [80]. Chen et al. designed EFCOMFF-Net, which is a multi-scale feature fusion network incorporating feature correlation enhancement, aggregation attention, and refinement modules to reduce multi-scale feature discrepancies [81]. Some scholars developed a VAE-based model leveraging latent space sampling to uncover implicit relationships between multi-source data and traffic flow dynamics [68].

(2) Missing modality completion methods address cases where clients possess only unimodal data by using generative models or lightweight feature translators to synthesize missing modalities, improving system robustness and adaptability. Poudel et al. proposed cross-modal prototype regularization with contrastive mechanisms to enhance stability under missing-modality conditions [82], The method is as follows:

L_{c o n t r a s t i v e} = - \sum_{i = 1}^{N} log \frac{exp (sim (h_{i}, h_{i}^{+}) / τ)}{\sum_{j = 1}^{N} exp (sim (h_{i}, h_{j}) / τ)}

(45)

where

sim (\cdot)

represents similarity between representations, and

τ

is a temperature parameter.

Bao et al. applied bottleneck feature translation for modality synthesis, maintaining high accuracy while reducing computational and communication costs [83]. They applied feature translation networks to reconstruct missing modalities, minimizing reconstruction loss:

L_{r e c o n} = {∥ {\hat{X}}_{m} - X_{m} ∥}_{2}^{2}

(46)

(3) Knowledge distillation and modality-specific aggregation methods emphasize effective multimodal knowledge integration during federated aggregation through distillation or modality-weighting mechanisms, enhancing global model performance under privacy-preserving constraints. The FedMEKT algorithm exemplifies this approach through embedding-based distillation methods that achieve cross-modal knowledge transfer without raw data sharing [84]. They achieve this through the following loss function:

L_{K D}^{(k)} = \sum_{m = 1}^{M} KL (σ (Z_{m}^{(k)} / T) ‖ σ (Z_{m}^{(g l o b a l)} / T))

(47)

Aggregation across modalities can be weighted:

θ_{g l o b a l} = \sum_{v = 1}^{V} α_{v} θ_{v}, \sum_{v = 1}^{V} α_{v} = 1

(48)

(4) Personalized federated multimodal learning methods introduce personalized branches or parameters alongside globally shared representations, balancing global generalization with local adaptation requirements [85]. Park [85] reviews industrial IoT-based DTs, where multimodal inputs

{X^{(1)}, X^{(2)}, \dots, X^{(V)}}

are integrated through tensor projections:

H = \sum_{v = 1}^{V} W^{(v)} X^{(v)},

(49)

producing a unified latent representation H for DT simulation and optimization. Zhang [86] extends this to missing-data scenarios, modeling conditional generation as

p (x | y, z) = \frac{p (y | x) p (z | x) p (x)}{\sum_{x^{'}} p (y | x^{'}) p (z | x^{'}) p (x^{'})},

(50)

allowing DTs to operate reliably despite incomplete multimodal inputs.

(5) Heterogeneous model federation methods accommodate clients employing models with varying architectures and scales through parameter matching, interpolation, or representation fusion. Yu [87] proposed the CreamFL algorithm, integrating contrastive representations for knowledge transfer across heterogeneous clients. The contrast loss function is

L_{c o n t r a s t i v e} = - \sum_{i = 1}^{N} log \frac{exp (sim (h_{i}, h_{i}^{+}) / τ)}{\sum_{j = 1}^{N} exp (sim (h_{i}, h_{j}) / τ)}

(51)

Aggregation with alignment weights is expressed as

θ_{g l o b a l} = \sum_{k = 1}^{K} a_{k} θ_{k}, \sum_{k = 1}^{K} a_{k} = 1

(52)

(6) Communication and efficiency optimization methods address communication and computational bottlenecks in large-scale multimodal federated learning through client selection, parameter sparsification, and model pruning strategies. The mmFedMC algorithm adopts joint modality-client selection based on Shapley values to reduce redundant communication [86]. Konečný et al. proposed sparsification and quantization schemes to alleviate communication burdens [88],

\tilde{\nabla} θ_{k} = {Top}_{p} (\nabla θ_{k})

(53)

Wang et al. explored hierarchical aggregation and model compression strategies for enhanced training efficiency in large-scale multimodal tasks [89]. The sparsification method is

θ_{c l u s t e r} = \frac{1}{| C |} \sum_{k \in C} θ_{k}, θ_{g l o b a l} = \frac{1}{| G |} \sum_{C \in G} θ_{c l u s t e r}

(54)

where C is a set of clients in a cluster, and G is the set of clusters.

4.2. GAI for Dynamic DT Evolution

GAI possesses the capability to learn underlying distributions of physical systems and generate realistic, diverse synthetic data for training dataset augmentation and extreme scenario simulation. However, feature-level data generation faces several bottlenecks, including constraints from cost limitations, environmental factors, and privacy requirements. Additionally, challenges such as insufficient consistency guarantees, limited interpretability, and inadequate capabilities for high-dimensional, multi-domain dynamic simulation in generative digital twins remain critical issues requiring further investigation.

Current research on integrating GAI into DT systems primarily focuses on two key domains: (i) data generation techniques that address data sparsity problems arising from imbalanced datasets or restricted data accessibility due to privacy concerns, and (ii) prediction techniques built upon generated data that enhance model reliability and generalization in complex scenarios.

4.2.1. Data Generation Techniques for DTs Enabled by GAI

DT mapping represents the core process of DT technology, wherein physical entity attributes, states, behaviors, and operational processes are accurately replicated in virtual space to construct highly synchronized and interactive digital replicas. However, practical constraints frequently lead to challenges including missing values, incomplete spatiotemporal feature coverage, and difficulties in multimodal data fusion. These issues often result in reduced prediction accuracy and increased modeling errors in DT systems. Recent research has proposed several effective strategies to address these limitations through joint distribution learning, personalized generation techniques, and minimal-feature hybrid generation methods.

Joint distribution learning addresses data sparsity challenges in complex environments by enabling GAI models to encode multimodal features and integrate cross-domain information, transforming heterogeneous, multidimensional data into unified feature-space representations. Zhang et al. proposed a traffic flow prediction model based on GANs, where the generator performs joint distribution learning by embedding numerical features (meteorological data) and categorical features (temporal information) into traffic time-series modeling, effectively mitigating data sparsity [90]. Similarly, Li et al. introduced FIMI, which is a GAI-enabled federated learning method that functions as a resource-aware data augmentation strategy, ensuring efficient federated model training [91].

Personalized generation techniques address prediction applicability across diverse scenarios by leveraging GAI’s ability to model complex data distributions, providing systematic solutions for building personalized DT models. Roberts et al. quantified individual differences in patient physiological characteristics and employed DT technology to capture these variations, enabling the simulation and evaluation of different treatment options for precision medicine applications [92]. Yang et al. proposed a generative simulation framework for tumor evolution based on diffusion models and temporal modeling, overcoming traditional method limitations in predicting heterogeneous tumor progression and treatment effects, achieving 91% simulation consistency [93].

Minimal-feature hybrid generation methods focus on generating anomalous samples through cross-sample and cross-modal feature mixing, providing solutions for virtual environment simulation and decision making. Swerdlow et al. proposed UniDisc, the first unified multimodal discrete diffusion model that employs discrete diffusion as a universal generative framework, iteratively denoising to produce anomalous samples [94]. This enables the generation of feature data in DT environments that closely aligns with physical entities, supporting high-fidelity simulation and decision making. The model demonstrated superior performance and inference efficiency compared to existing multimodal autoregressive models, enhancing training accuracy and decision reliability through high-quality multimodal data generation in realistic simulation scenarios.

4.2.2. Prediction Techniques for DTs Enabled by GAI

Missing values are prevalent in DT systems due to equipment failures, network interruptions, extreme weather conditions, and maintenance activities. GAI-empowered prediction techniques can effectively enhance preventive control across various operational states through early-warning prediction techniques, enhanced predictive models, and meta-learning training strategies.

Early-warning prediction techniques handle unexpected events by leveraging GAI to generate possible future scenarios within DT platforms, enabling proactive alerts and timely responses. Shao et al. employed GAI to generate synthetic datasets reflecting patient blood glucose decline trends; then, they trained DT models to support chronic disease early-warning systems, achieving 89.2% prediction accuracy [95]. Yang et al. developed a VAE–GAN collaborative framework where mathematical modeling governed the encoder–decoder process, enabling the generation of high-fidelity virtual waveforms consistent with real-world signals. The use of synthetic data with a privacy budget

ϵ \leq 1

significantly shortened experimental cycles [96].

Enhanced predictive models emphasize sampling strategies that highlight key effective data, increasing their relative representation in training to strengthen the models’ ability to learn and recognize crucial historical features. He et al. proposed ZOD-MC, which is a zero-order sampling method that does not assume log-concavity or isoperimetric inequalities of target distributions. Through diffusion-based sampling, they demonstrated that even sparse low-dimensional features remain efficient, improving prediction reliability from limited historical data [97]. Stoian et al. advanced this concept by transforming deep generative models (DGMs) for tabular data into constrained DGMs (C-DGMs), which automatically parse and integrate constraints into DGMs through dedicated layers, ensuring generated samples meet domain requirements while reducing resource consumption in DT prediction tasks [98].

Ren et al. designed a hybrid Transformer–diffusion architecture for feature fusion and traffic forecasting, where multimodal data integration achieved approximately 12–15% accuracy improvement over traditional models relying solely on historical traffic data [99]. Zhang et al. combined diffusion models’ probabilistic strengths in uncertainty quantification with temporal Transformers’ long-range dependency capture capabilities, embedding Graph Neural ODE-based traffic dynamics constraints to enable highly reliable traffic forecasts at 15–60 min granularities [100].

Meta-learning training strategies enable models to acquire generalized knowledge across tasks, facilitating rapid adaptation to new tasks and improving generalization under small-sample and dynamic scenarios. Wang et al. proposed MetaCRL, which is a meta-learning causal representation learner that addresses spurious correlations between task-specific causal factors and labels, mitigating negative knowledge transfer across tasks and enhancing model generalization [101].

Empirical evidence demonstrates that GAI’s integration capacity with multi-source data not only broadens predictive information scope but also enables collaborative modeling across multiple factors, ultimately improving DT model adaptability and reliability in complex environments.

By integrating generative models, digital twins are empowered to autonomously generate high-quality virtual samples even in the presence of incomplete data, thereby enabling dynamic modeling and predictive optimization. Through continuous learning of the physical system’s operational patterns, generative AI (GAI) allows real-time updating and adaptive reconstruction of the virtual model. This establishes a closed-loop mechanism of “perception–generation–feedback–optimization,” transforming digital twins from static representations into intelligent systems capable of self-learning and continuous evolution. Consequently, GAI-driven digital twins achieve greater accuracy, adaptability, and autonomy in complex and changing environments.

4.3. Cloud–Edge–End Collaborative Intelligence

In AIoT systems, cloud–edge–end collaborative intelligence integrates the capabilities of the cloud, edge nodes, and terminal devices. The cloud provides powerful computing and global optimization, edge nodes offer low-latency services, and terminal devices contribute real-time sensing and flexible execution. Together, these elements form an efficient and resilient cross-layer collaborative architecture [102], which is crucial for addressing key challenges in the AIoT, including significant spatiotemporal variability, stringent latency and security requirements, and the processing of large-scale data. Through hierarchical collaboration, this architecture mitigates the constraints on computational and storage resources at terminal devices and overcomes network bandwidth bottlenecks, significantly enhancing system reliability and flexibility in complex and dynamic environments. In recent years, research on cloud–edge–end collaborative optimization has primarily focused on two key directions. The first is the lightweight deployment of models, which is aimed at overcoming resource constraints on edge and terminal devices. The second is resource scheduling and task allocation, focusing on efficient cross-layer collaboration [103].

4.3.1. Lightweight Model Deployment

For cloud–edge–end collaborative intelligence, the computational, storage, and energy capacities of edge nodes and terminal devices are limited. The direct deployment of conventional large-scale models on such devices often results in significant resource constraints and increased latency [104]. For multimodal generative models and DT systems, lightweight deployment is crucial for supporting real-time inference and generation tasks in AIoT environments [105]. In recent years, a number of effective strategies have been introduced, including pruning, quantization, and knowledge distillation.

Pruning is a widely used technique for model compression and acceleration [106]. Its core idea is to remove redundant weights or connections within neural networks, reducing computational complexity and storage overhead while maintaining model accuracy. Common approaches include structured pruning and unstructured pruning. The former eliminates structural units, such as convolutional kernels or channels, to improve execution efficiency on edge devices [107]. Channel pruning, as an important form of structured pruning, can quantify its optimization of the computational complexity of convolutional layers through the change in FLOPs formula [108]:

{FLOPs}_{pruned} = 2 \cdot H_{o} \cdot W_{o} \cdot C_{i}^{'} \cdot C_{o}^{'} \cdot K_{h} \cdot K_{w}

(55)

where

H_{o}

and

W_{o}

denote the height and width of the output feature map,

C_{i}^{'}

and

C_{o}^{'}

represent the numbers of input and output channels after pruning, and

K_{h}

and

K_{w}

are the height and width of the convolutional kernel. The coefficient “2” accounts for the multiplication and addition operations in the convolution.

The latter progressively prunes individual parameters according to their importance, allowing more flexible compression, although it imposes higher hardware demands [109]. As a key approach for model lightweighting, pruning achieves an effective balance between performance and resource consumption. Su et al. proposed a dynamic structured pruning approach [110], which identifies neuron importance through group-sparsity regularization and removes redundant neurons during training using dynamic thresholds. Experimental results demonstrated that this approach achieved up to 93% neuron compression and 96% weight compression with only minor performance degradation. Lu et al. introduced a sensitivity-based pruning framework [111]. The framework prunes redundant filters by evaluating the impact of convolutional layers on inference accuracy, thereby reducing both computational and storage overhead. Experiments conducted on VGG-16, ResNet, and multiple datasets validated its effectiveness. The results demonstrated that redundant structures can be identified and removed even in the early stages of training, significantly reducing resource consumption while maintaining model performance.

Quantization is a widely adopted technique for model compression and acceleration. It reduces storage requirements, computational load, and energy consumption by mapping high-precision floating-point parameters of neural networks to low-bit representations [112], such as INT8. A typical quantization function can be expressed as [113]

q (r; a, b, n) = ⌊\frac{clamp (r; a, b) - a}{s (a, b, n)}⌉ s (a, b, n) + a,

(56)

where

clamp (r; a, b) = \min (\max (r, a), b)

restricts the value range,

s (a, b, n) = \frac{b - a}{n - 1}

denotes the quantization step size, and n is the number of quantization levels.

Importantly, quantization does not necessitate modifications to the network architecture. Moreover, it can exploit hardware support for low-bit operations to accelerate inference. In recent years, mixed-precision quantization has emerged as a mainstream approach. It preserves high precision in critical layers while assigning low-bit representations to redundant or less sensitive layers [114,115]. This strategy achieves further model compression while maintaining overall performance. Mixed-precision quantization has shown notable advantages in multimodal generative models and in deployments on resource-constrained devices. It enables large-scale models to operate efficiently with reduced power consumption and latency. Peng et al. proposed a crossbar-aware mixed-precision quantization approach designed for RRAM accelerators [116]. By utilizing group-wise quantization and precision search strategies, their approach improves inference accuracy and robustness against noise while achieving substantial resource savings. In comparison with fixed-precision quantization approaches, it exhibits superior performance. Nahsung Kim et al. introduced a mixed-precision quantization technique that allocates additional quantization bits to critical weights during training and adjusts the learning rate [117]. This approach mitigates quantization loss and enables multi-bit representations (two to four bits) for both weights and activations. Experimental results demonstrate that the proposed approach significantly reduces storage and computational overhead while preserving model accuracy. This provides effective support for the lightweight deployment of models on resource-constrained devices.

Knowledge distillation (KD) is a widely recognized technique for model compression and acceleration [118,119]. Its fundamental principle is to transfer the knowledge, feature representations, or output distributions from a teacher model to a lightweight student model. This approach reduces model size and computational cost while maintaining performance to the greatest extent possible. The classic distillation loss can be formulated as shown below [120]:

L_{KD} = (1 - α) L_{CE} (y, z_{s}) + α T^{2} KL (σ (z_{t} / T) | | σ (z_{s} / T)),

(57)

where

z_{t}

and

z_{s}

denote the logits of the teacher and student models,

σ

is the softmax function, T is the temperature parameter, and

α

is the weight factor balancing the cross-entropy loss and the distillation loss.

Through knowledge distillation, the student model can acquire the capabilities of the teacher model in single-modal tasks. Furthermore, it demonstrates enhanced generalization in scenarios involving multimodal inputs, complex environments, or varying data distributions. This technique enables lightweight models to efficiently perform tasks that would otherwise rely on large-scale models. Tu et al. proposed a dynamic knowledge distillation (DKD) approach that enables more efficient knowledge transfer through interactive learning between teacher and student networks [121]. This approach enhances the performance of the student model while maintaining lightweight characteristics. This approach strikes a balance between model compression and accuracy, providing strong support for deployment in resource-constrained environments. Gou et al. introduced a feedforward–feedback knowledge distillation approach, where bidirectional interaction between the teacher and student models allows the lightweight student model to receive teacher feedback while acquiring knowledge. This process improves the student model’s representational capacity and generalization ability. Experimental results demonstrate that the proposed approach effectively improves the performance of lightweight models while preserving accuracy, thereby supporting lightweight deployment in resource-limited environments [122].

In summary, pruning and quantization reduce the computational and storage overhead of neural networks at the structural and numerical representation levels, respectively. Knowledge distillation, by contrast, transfers the knowledge from a teacher model to preserve the representational capacity and generalization ability of lightweight models. The combined effect of these three techniques effectively supports the “cloud training–edge inference” deployment paradigm, enabling generative multimodal models to achieve efficient inference, low-latency responses, and energy-efficient operation in AIoT terminals and DT environments. Moreover, the integration of model lightweighting and knowledge transfer enhances the adaptability of models in complex environments, involving multimodal inputs and operating under resource-constrained conditions. This provides reliable support for edge intelligence applications and real-time generative tasks.

4.3.2. Intelligent Resource Allocation

In cloud–edge–end collaborative intelligence architectures, efficiently managing and scheduling computational, communication, and energy resources is a critical challenge for ensuring overall system performance. Traditional approaches typically include optimization theory, heuristic algorithms, and game-theoretic approaches [123,124], which can improve efficiency to some extent but often fall short in highly dynamic and complex AIoT environments. In recent years, increasing research efforts have been devoted to emerging approaches, including GAN-based scheduling, VAE/diffusion-model-driven scheduling, and reinforcement learning-enabled cross-layer scheduling.

Through adversarial training, GANs are capable of generating realistic and diverse network environment states as well as task demand distributions, thereby enabling more robust and generalizable scheduling strategies for resource allocation [125]. For example, Naeem et al. integrated a GAN with a Deep Distributional Q-Network (GAN-DDQN) to learn robust transmission scheduling policies in highly random and noisy Internet of Vehicles environments, achieving efficient resource allocation and network performance optimization under uncertain conditions [126]. In addition, Gu et al. proposed GANSlicing, a GAN-based software-defined mobile network slicing scheme that predicts service requests and resource demands for IoT applications, which enables dynamic slice allocation. Compared with conventional approaches, GANSlicing significantly improves resource utilization and enhances the Quality of Experience (QoE) for users [127].

VAE and diffusion models capture latent distributions of historical traffic and task patterns, enabling adaptive task and resource allocation. VAE is trained by maximizing the Evidence Lower Bound (ELBO) to map input data x to latent variables z, and its loss function is expressed as [68]:

L_{V A E} = - E_{q (z | x)} [log p (x | z)] + D_{KL} (q (z | x) ‖ p (z)),

(58)

where

q (z | x)

is the variational distribution,

p (x | z)

is the generative model,

p (z)

is the prior, and

D_{KL}

is the Kullback–Leibler divergence. This enables VAE to learn latent representations for adaptive task and resource allocation. This approach is suitable for handling task demands with long-term dependencies and high diversity [128,129]. For instance, Li et al. combined VAE with a DT framework to improve the sensor state prediction and resource allocation [130]. Diffusion models generate data through iterative denoising. In DDPM, the goal is to minimize noise prediction error. The loss function is [32]

L = E_{q (x_{t}, x_{t - 1})} [{∥ϵ_{θ} (x_{t}, t) - ϵ∥}^{2}],

(59)

where

ϵ_{θ} (x_{t}, t)

is the predicted noise, and

ϵ

is the actual noise. During generation, DDPM performs reverse diffusion, starting from Gaussian noise

x_{T}

and progressively recovering data

x_{0}

. The process is given below [32]:

x_{t - 1} = \frac{1}{\sqrt{β_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x_{t}, t)),

(60)

where

β_{t}

is the noise schedule, and

α_{t} = 1 - β_{t}

. This denoising process enables diffusion models to handle complex resource allocation scenarios. Liu et al. used diffusion models with Multi-Agent Reinforcement Learning (MARL) for task and resource demand prediction in vehicular networks [131]. Zhang et al. proposed a diffusion model-based reinforcement learning method (Meta-DSAC), which combines its generative capabilities with the decision-making advantages of RL to effectively address complex offloading and resource allocation problems in multi-UAV-assisted edge-enabled metaverse systems [132].

Reinforcement learning can dynamically determine the allocation of resources among cloud, edge, and end devices. By incorporating constraints such as network conditions, task urgency, and service quality, the system can balance latency and energy consumption while improving the overall Quality of Service (QoS) [133,134,135]. To achieve this, reinforcement learning models typically utilize a reward function to evaluate system performance. Specifically, the reward function for resource allocation can be expressed as shown below [136]:

R = - (λ_{1} \cdot Latency + λ_{2} \cdot Energy)

(61)

where

λ_{1}

,

λ_{2}

, and

λ_{3}

are weight coefficients that balance the importance of latency, energy consumption, and the QoS. For instance, Guo et al. employed DT-enhanced federated reinforcement learning to dynamically allocate resources in Device-to-Device (D2D)-aided edge networks [137]. This approach supports cross-layer resource scheduling between end and edge devices, considering network conditions, task demands, and privacy preservation. Similarly, Peng et al. investigated resource management in vehicular networks assisted by MEC and UAV edge nodes using Multi-Agent Reinforcement Learning (MARL). By employing MADDPG, they achieved vehicle association and task resource allocation, reducing latency and improving QoS satisfaction [138].

In summary, approaches such as GANs, VAE/diffusion models, and reinforcement learning provide new paradigms for resource allocation in cloud–edge–end collaborative intelligence. These approaches can adaptively schedule computational and communication resources in dynamic and complex environments, thus balancing latency, energy efficiency, and service quality, and significantly enhancing the overall system performance.

5. Application Scenarios

DT and GAI integration has shown transformative potential across multiple industrial sectors. However, to strengthen the practical validation of the SMGA framework and improve the depth of analysis, this section expands upon real-world case studies and domain-specific insights, highlighting both applied outcomes and sectoral maturity trends.

5.1. Smart Manufacturing: Generative Design and Autonomous Optimization

Smart manufacturing is rapidly evolving through the integration of generative design and autonomous optimization, leveraging technologies such as generative AI, digital twins, and multi-agent systems. Generative AI models—including GANs, VAEs, and Transformers—are being used for tasks ranging from anomaly detection and process optimization to the accelerated design of advanced materials, significantly reducing development time and cost while improving product performance [139]. Autonomous intelligent manufacturing systems (AIMS) utilize data-driven approaches, knowledge graphs, and digital twins to enable multi-level perception, cross-domain cognition, and collaborative decision making, as demonstrated in intelligent factories for networked, collaborative manufacturing [140].

A concrete example can be found in the Programmable Manufacturing Advisor (PMA)-based Smart Production System, which autonomously diagnoses system health and designs optimal continuous improvement projects [141]. In an automotive underbody assembly line case, the PMA identified bottlenecks and proposed reducing the mean-time-to-repair (MTTR) of critical stations, leading to a validated 10% throughput increase when implemented. Similarly, in a hot-dip galvanization plant, the PMA optimized raw material release policies to balance throughput and lead time, demonstrating how digital twins coupled with automated decision making can achieve both productivity gains and quality control. These case studies show how generative AI can further enhance such systems by simulating alternative designs, proposing adaptive policies, and supporting human-in-the-loop decision making. Blockchain-enabled multi-agent systems further enhance resilience and adaptability by enabling decentralized, peer-to-peer negotiation and dynamic process control, which is crucial for individualized and frequently disrupted manufacturing environments [142]. The integration of these technologies within immersive industrial metaverse environments—combining IoT, real-time analytics, and extended reality—enables real-time production logistics, collaborative robotics, and advanced simulation, driving economic performance and innovation in smart factories.

5.2. Smart Cities: City-Scale Simulation and Emergency Response

Digital twins are increasingly central to smart city development, offering real-time simulation, monitoring, and predictive capabilities that enhance urban planning, emergency response, and risk management. These virtual replicas of city systems enable city-scale simulations for scenarios such as traffic regulation, epidemic control, and disaster management, allowing for more informed and adaptive decision making in emergencies [143]. Digital twin frameworks, like DUET and METACITIES, integrate diverse data sources and models to support what-if analyses, optimize traffic flow, and improve emergency management with ongoing projects demonstrating their potential in European cities.

Research highlights the benefits of digital twins in providing up-to-date information, supporting early warning systems, and enabling bidirectional data flow between physical and digital environments, which is especially valuable for climate resilience and disaster response. However, challenges remain in terms of data integration, real-time modeling, and the practical implementation of these systems at scale with much of the research still in the conceptual or pilot phase [144]. Despite these challenges, case studies and systematic reviews indicate that digital twins can significantly enhance situation assessment, coordination, and resource allocation during emergencies, and ongoing innovation is expected to address current research gaps.

5.3. Autonomous Driving: Lifelong Learning and Simulation Testing

Recent research highlights the growing importance of digital twins and advanced simulation in autonomous driving, particularly for lifelong learning, testing, and validation. Digital twins create high-fidelity virtual replicas of vehicles and environments, enabling safe, cost-effective, and comprehensive testing across the entire vehicle lifecycle, including complex scenarios that are difficult or unsafe to reproduce in the real world [145]. These virtual environments support reinforcement learning and transfer learning, helping to bridge the “sim2real” gap—the challenge of transferring knowledge from simulation to real-world driving—by integrating real and simulated data, domain randomization, and adaptation techniques.

Simulation-based testing, especially when enhanced with digital twins or even “digital siblings” (multi-simulator ensembles), improves the reliability and predictive value of autonomous driving software validation, outperforming single-simulator approaches in predicting real-world failures. Digital twins also facilitate hardware-in-the-loop and vehicle-in-the-loop testing, accelerating the development and deployment of robust control policies and reducing the need for extensive physical testing. However, challenges remain in ensuring the realism and predictive accuracy of digital twins as well as in standardizing their development and integration into industry practices [146]. Overall, digital twins are emerging as a practical and effective paradigm for lifelong learning, simulation testing, and validation in autonomous driving research and development.

5.4. Healthcare: Personalized Medicine and Surgical Planning

Digital twin technology is rapidly transforming healthcare by enabling highly personalized medicine and advanced surgical planning. Digital twins are virtual replicas of patients, integrating real-time data, multi-omics information, and predictive modeling to simulate disease progression, optimize treatment plans, and select the most effective therapies for individuals [147]. In surgical planning, digital twins provide accurate 3D representations of patient anatomy, allowing clinicians to rehearse procedures, enhance precision, and reduce risks using simulation and augmented reality tools. These technologies also support early diagnosis, tailored drug development, and remote patient monitoring, while improving clinical operations and resource allocation.

Such patient-specific DTs not only support therapy optimization but also enable surgical planning. For instance, a cardiac digital twin can generate a 3D replica of a patient’s heart and simulate surgical procedures—such as valve replacement or arrhythmia ablation—allowing surgeons to rehearse the procedure and assess risks before entering the operating room. In practice, these DTs can be linked with extended reality (XR) environments so that clinicians can visualize the patient’s anatomy in real time and adapt the surgical plan dynamically.

Furthermore, after clinical deployment, the DT is not static: remote monitoring feeds new data back into the twin, enabling longitudinal updates and adaptive care. For example, if a patient’s wearable devices detect abnormal heart rhythms or blood pressure fluctuations, the DT can re-run simulations to predict potential complications and trigger early interventions. In the context of digital clinical trials, DTs have been shown to serve as computational testbeds, where thousands of drug–patient interactions are evaluated in silico to prioritize the most promising candidates for actual trials, as emphasized by Laubenbacher et al. [74].

Despite their promise, digital twins face significant challenges, including data security, interoperability, ethical concerns, and the complexity of modeling the human body. Addressing these issues requires interdisciplinary collaboration, robust regulatory frameworks, and advances in artificial intelligence and data integration. Overall, digital twins are poised to revolutionize healthcare by delivering more precise, efficient, and patient-centered care, but widespread adoption will depend on overcoming technical, ethical, and organizational barriers [75].

5.5. Comparative Summary and Observations

Table 3 summarizes sectoral validation maturity. Manufacturing and healthcare lead adoption (empirical case studies, simulation validation), while smart city and autonomous driving remain under rapid development. Future work should emphasize benchmarking and cross-domain evaluation, ensuring that SMGA is validated not only conceptually but also through measurable performance metrics (e.g., latency, accuracy, adaptability).

6. Challenges and Future Research Directions

As the integration of AIoT and DT technologies accelerates, both academia and industry express high expectations for their transformative potential. Yet realizing this vision requires addressing a series of systemic challenges. On the technical side, the multimodal perception and generative capabilities of AIoT demand unprecedented levels of computation, storage, and energy efficiency while raising pressing concerns over reliability, privacy, and interoperability. On the research front, advancing toward verifiable, compressible, sustainable, continually adaptive, and ethically governed AIoT systems will be decisive for ensuring their long-term viability. This section reviews these challenges and outlines promising future research directions across two key dimensions: technical limitations and prospective solutions.

6.1. Technical Challenges

6.1.1. The Tension Between Computational Demand and Efficiency

The convergence of AIoT and digital twins has amplified the tension between computational demand and resource availability. Modern generative models require billions of parameters and vast memory bandwidth, which stands in stark contrast to the limitations of edge devices. A comprehensive survey on TinyML shows that ultra-low-power microcontrollers are capable of inference for extended periods—even up to a year on coin-cell batteries [148]. While these achievements highlight the promise of extreme energy efficiency, they are inherently insufficient to support large multimodal generative models. Similarly, analyses of IoT programming platforms emphasize that fragmentation in development tools and runtime environments exacerbates deployment inefficiency [149]. This dual challenge of model scale and ecosystem fragmentation not only inflates latency but also aggravates energy costs, making sustainable deployment a pressing issue. Moving forward, achieving balance will require joint advances in model compression, specialized hardware accelerators, and algorithm–architecture co-design.

6.1.2. Reliability and Hallucination Risks

Efficiency alone cannot guarantee a trustworthy AIoT. The reliability of generative models has emerged as an equally critical concern. Large language and vision models frequently produce hallucinations—outputs that are fluent yet factually incorrect. A systematic survey highlights that such hallucinations are pervasive across tasks, severely undermining trust in generative AI systems [150]. In sensitive domains such as healthcare, frameworks for evaluating medical summarization reveal that hallucinations remain even with carefully designed prompts, underscoring the inadequacy of ad hoc fixes [151]. Complementary approaches, such as semantic entropy-based uncertainty estimators, provide a way to flag questionable outputs before they reach users, but practical deployment in real-time systems remains difficult [152]. This persistence of hallucinations shows that reliability cannot be treated as a post hoc correction; rather, it must be embedded in the very design of AIoT systems. For digital twins and multimodal AI, integrating formal verification, symbolic constraints, and human-in-the-loop mechanisms may be necessary to ensure outputs align with both physical laws and application safety.

6.1.3. Data Security and Privacy

Data security and privacy represent fundamental challenges for the adoption of DT–GAI in real-world scenarios. In domains such as healthcare, autonomous driving, and industrial IoT, the collection and modeling of large-scale multimodal data inevitably involve highly sensitive information, ranging from personal health records and medical imaging to vehicle trajectories and enterprise operational data. The leakage or misuse of such information not only undermines user trust but may also result in severe regulatory and ethical consequences [153,154,155,156]. Therefore, ensuring confidentiality and privacy protection has become a prerequisite for the sustainable deployment of DT–GAI systems. Specific privacy-preserving mechanisms that address these challenges will be further discussed in the following subsection.

6.1.4. Privacy-Preserving Mechanisms for DT–GAI

Privacy protection is one of the key challenges in the integration of DT–GAI. In healthcare, autonomous driving, and the industrial IoT, large-scale multimodal data often include personal health records, driving trajectories, and enterprise operational data, which are highly sensitive. Such information not only concerns individual privacy but also directly affects regulatory compliance and public acceptance. Achieving data security without sacrificing intelligence has therefore become a prerequisite for DT–GAI deployment.

Among existing approaches, differential privacy (DP) is one of the most widely adopted techniques. By injecting noise into data or gradients, DP reduces the risk of re-identifying individual users and provides formal privacy guarantees. However, this protection usually comes at the cost of lower model accuracy, leading to a persistent privacy–utility trade-off. Balancing privacy budgets with performance has been identified as a key bottleneck [157]. To mitigate this challenge, adaptive clipping strategies have been developed to reduce accuracy loss [158], while new mechanisms at the gradient level improve robustness [159]. In addition, combining DP with synthetic data generation has been shown to preserve privacy while still enabling model training [160].

Complementary to DP, federated learning (FL) makes it possible to train models without centralizing raw data, relying instead on the exchange of local parameters or gradients. This design reduces privacy risks across institutions and devices but also introduces challenges such as modality heterogeneity, non-IID data distributions, and high communication overhead. To address these issues, FL frameworks tailored for healthcare data protection have been proposed [153], while strategies to defend against data reconstruction attacks during large language model fine tuning have been explored [154]. In IIoT scenarios, semi-supervised FL combined with Bayesian estimation has been applied to enhance both privacy and generalization under limited annotations [161], and personalized FL integrated with DP has been introduced to balance individual adaptation with global model sharing [162].

In healthcare, combining DP and FL allows hospitals to jointly train diagnostic models without exposing raw imaging or genomic data. In autonomous driving and IoT, performance-enhanced FL with DP improves privacy-preserving learning at the edge and facilitates the development of IIoT digital twins [163]. In public health, societal digital twins are considered effective only if they are privacy-aware [155], while privacy-preserving clinical digital twins are regarded as essential for compliance in critical care workflows [156].

Taken together, DP and FL offer complementary perspectives on privacy preservation: DP provides statistical guarantees through controlled noise, while FL reduces data movement by design. In practice, the two are often combined—for example, by integrating DP into FL updates to enhance robustness and defense. Looking ahead, promising directions include dynamic privacy budget allocation, the joint consideration of privacy with fairness and robustness, and integration with secure multi-party computation and homomorphic encryption. These advances will be vital for ensuring both security and compliance, paving the way for the trustworthy deployment of DT–GAI in high-stakes domains such as healthcare, transportation, and industry.

6.1.5. Lack of Standardization and Interoperability

Challenges in privacy and security are further compounded by fragmentation in standards. Surveys on IoT interoperability emphasize that security and interoperability must be co-designed; otherwise, neglecting one aspect creates systemic vulnerabilities. Studies on digital twins reveal that most current platforms only enable data-level exchange, lacking semantic consistency and collaborative capabilities [164]. In the industrial context, building high-fidelity digital twins is shown to require the integration of AI, blockchain, and distributed technologies, which further complicates standardization efforts [165]. Without unified protocols, interfaces, and semantic models, AIoT ecosystems risk remaining siloed, preventing the realization of their full potential.

6.2. Future Research Directions

6.2.1. Neuro-Symbolic Verified Generation

A promising research direction lies in combining the generative power of neural networks with the logical rigor of symbolic methods. Formal verification frameworks demonstrate that neuro-symbolic systems can map outputs into temporal logic or automata, enabling guarantees that generated content aligns with physical rules and safety requirements. For example, evaluations of text-to-video models using such methods show that outputs can be systematically checked against predefined constraints [166]. At the same time, consistency models provide computationally efficient alternatives to diffusion processes, achieving high-quality synthesis with fewer steps [167]. The integration of these paradigms suggests that neuro-symbolic generation could simultaneously mitigate hallucinations and enhance verifiability in safety-critical AIoT applications.

6.2.2. GAI for DT Compression

To complement verification, another research frontier focuses on resource efficiency. High-fidelity DTs demand heavy computation, limiting their deployment in edge environments. GAI offers promising solutions by enabling lightweight yet accurate DT models. One-step diffusion distillation reduces the number of iterative steps while maintaining quality, significantly cutting computation costs [168]. Similarly, deep equilibrium approaches enable the distillation of large diffusion models into efficient generators [169]. Singular value scaling further refines pruned models, accelerating fine tuning while preserving fidelity [170]. Beyond these techniques, a comprehensive survey emphasizes that GAI can serve not only as a compression tool but also as a means to generate synthetic training data and modular DT components [171]. These developments suggest that lightweight DTs empowered by GAI could soon become viable for real-time AIoT applications.

6.2.3. Toward Sustainable and Green AIoT

The sustainability of an intelligent AIoT is increasingly recognized as a priority. The exponential growth of deep learning has raised serious environmental concerns. Analyses of computational trends reveal that the carbon footprint of training large models has grown dramatically over the past decade, calling for new evaluation metrics where energy efficiency is valued alongside accuracy [172]. Complementary reviews of green AI research highlight multi-level solutions, ranging from algorithmic efficiency to eco-aware hardware and scheduling [173]. For the AIoT, this means not only developing low-power architectures but also embedding sustainability into the design of training and inference pipelines. By prioritizing energy-aware optimization, intelligent AIoT systems can evolve in ways that reduce emissions while maintaining high utility.

6.2.4. Lifelong and Continual Learning in Dynamic Environments

The pursuit of sustainability also aligns with the need for adaptability. AIoT systems often operate in dynamic environments, requiring continuous adaptation without catastrophic forgetting. Surveys on continual learning emphasize mechanisms such as rehearsal, regularization, and dynamic architectures as effective strategies for classification tasks [174]. Expanding on this, incremental learning frameworks demonstrate how retrievable skill libraries allow systems to adapt to novel tasks with minimal retraining [175]. In robotic reinforcement learning, approaches that preserve and combine prior knowledge with new experience enable continual evolution across tasks, highlighting pathways for long-term adaptability [176]. Generative replay further contributes to stability with recent work showing its effectiveness in security domains such as malware classification [177]. Together, these advances suggest that future AIoT systems can achieve resilience through continual adaptation, maintaining performance in ever-changing physical environments.

6.2.5. Ethical and Governance Frameworks for Intelligent AIoT

Finally, technological advances must be grounded in ethical and governance principles. Reviews of responsible AI emphasize that governance must cover fairness, explainability, and accountability across the full lifecycle of system design and deployment [178]. Studies in computing ethics indicate that as digital technologies become increasingly embedded in everyday activities, ethical considerations must evolve from theoretical discourse to practical, implementable frameworks. For the AIoT, this implies the need for context-specific governance frameworks that reconcile innovation with societal trust. Building such frameworks will ensure that the AIoT is not only technically feasible but also aligned with human values. Studies in computing ethics indicate that as digital technologies become increasingly embedded in everyday activities, ethical considerations must evolve from theoretical discourse to practical, implementable frameworks.

7. Conclusions

This survey established a symbiotic integration of multimodal GAI and DTs for AIoT systems. GAI acts as the cognitive core, enabling reasoning, contextual interpretation, and adaptive generation, while DTs provide a physics-grounded execution environment that ensures operational fidelity. Their complementarity is synthesized into a closed-loop Sense–Map–Generate–Act (SMGA) architecture (RQ1). We further highlighted domain-specific strengths and common limitations of representative approaches such as GANs, diffusion models, and LLMs, and we discussed recurring challenges in efficiency, reliability, privacy, and interoperability across application domains (RQ2). For RQ3, we consolidated evaluation practices into functional/probabilistic models and mathematical indicators, including robustness, complexity, accuracy, and twin–reality consistency. Despite persistent challenges in data integrity, scalability, and trustworthiness, this integration offers a foundational blueprint for a trustworthy AIoT with future work focusing on privacy-preserving computation and sustainability-aware optimization.

Author Contributions

Conceptualization, X.L.; Data curation, X.L. and A.W.; Investigation, A.W. and X.Z.; Writing—original draft, A.W., X.Z., K.H., L.C., S.W. and Y.C.; Writing—review and editing, X.L., X.Z., K.H., L.C., S.W. and Y.C.; Resources, Y.C.; Supervision, X.L. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62371116) and the Innovation Support Project for Postgraduates in Hebei Province (CXZZSS2025166).

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Xu, P.; Zhu, X.; Clifton, D.A. Multimodal Learning with Transformers: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef]
Zhu, Y.; Wu, Y.; Sebe, N.; Yan, Y. Vision + X: A Survey on Multimodal Learning in the Light of Data. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9102–9122. [Google Scholar] [CrossRef]
Zhao, F.; Zhang, C.; Geng, B. Deep Multimodal Data Fusion. ACM Comput. Surv. 2024, 56, 216. [Google Scholar] [CrossRef]
Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; Chen, E. A survey on multimodal large language models. Natl. Sci. Rev. 2024, 11, nwae403. [Google Scholar] [CrossRef]
Cao, Y.; Li, S.; Liu, Y.; Yan, Z.; Dai, Y.; Yu, P.; Sun, L. A Survey of AI-Generated Content (AIGC). ACM Comput. Surv. 2025, 57, 125. [Google Scholar] [CrossRef]
Wu, Y.; Zhang, K.; Zhang, Y. Digital Twin Networks: A Survey. IEEE Internet Things J. 2021, 8, 13789–13804. [Google Scholar] [CrossRef]
Liu, X.; Jiang, D.; Tao, B.; Xiang, F.; Jiang, G.; Sun, Y.; Kong, J.; Li, G. A Systematic Review of Digital Twin about Physical Entities, Virtual Models, Twin Data, and Applications. Adv. Eng. Inform. 2023, 55, 101876. [Google Scholar] [CrossRef]
Bibri, S.E.; Huang, J.; Jagatheesaperumal, S.K.; Krogstie, J. The Synergistic Interplay of Artificial Intelligence and Digital Twin in Environmentally Planning Sustainable Smart Cities: A Comprehensive Systematic Review. Environ. Sci. Ecotechnol. 2024, 20, 100433. [Google Scholar] [CrossRef]
Qin, B.; Pan, H.; Dai, Y.; Si, X.; Huang, X.; Yuen, C.; Zhang, Y. Machine and Deep Learning for Digital Twin Networks: A Survey. IEEE Internet Things J. 2024, 11, 34694–34716. [Google Scholar] [CrossRef]
Pan, Y.; Lei, L.; Shen, G.; Zhang, X.; Cao, P. A Survey on Digital Twin Networks: Architecture, Technologies, Applications, and Open Issues. IEEE Internet Things J. 2025, 12, 19119–19143. [Google Scholar] [CrossRef]
Glaessgen, E.; Stargel, D. The digital twin paradigm for future NASA and US Air Force vehicles. In Proceedings of the 53rd AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics and Materials Conference, Honolulu, HI, USA, 23–26 April 2012; p. 1818. [Google Scholar] [CrossRef]
Ariyachandra, M.R.M.F.; Wedawatta, G. Digital Twin Smart Cities for Disaster Risk Management: A Review of Evolving Concepts. Sustainability 2023, 15, 11910. [Google Scholar] [CrossRef]
Adibfar, A.; Costin, A.M. Creation of a Mock-up Bridge Digital Twin by Fusing Intelligent Transportation Systems (ITS) Data into Bridge Information Model (BrIM). J. Constr. Eng. Manag. 2022, 148, 04022094. [Google Scholar] [CrossRef]
Mulder, S.T.; Omidvari, A.H.; Rueten-Budde, A.J.; Huang, P.H.; Kim, K.H.; Bais, B.; Rousian, M.; Hai, R.; Akgun, C.; van Lennep, J.R. Dynamic Digital Twin: Diagnosis, Treatment, Prediction, and Prevention of Disease During the Life Course. J. Med. Internet Res. 2022, 24, e35675. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Qu, T.; Liu, Y.; Zhong, R.Y.; Xu, G.; Sun, H.; Gao, Y.; Lei, B.; Mao, C.; Pan, Y. Sustainability Assessment of Intelligent Manufacturing Supported by Digital Twin. IEEE Access 2020, 8, 174988–175008. [Google Scholar] [CrossRef]
Tao, F.; Zhang, H.; Liu, A.; Nee, A.Y.C. Digital twin in industry: State-of-the-art. IEEE Trans. Ind. Inform. 2019, 15, 2405–2415. [Google Scholar] [CrossRef]
Zhao, Y.; Liu, Y.; Mu, E. A review of intelligent subway tunnels based on digital twin technology. Buildings 2024, 14, 2452. [Google Scholar] [CrossRef]
Guzina, L.; Ferko, E.; Bucaioni, A. Investigating digital twin: A systematic mapping study. In Proceedings of the 10th Swedish Production Symposium (SPS2022), Västerås, Sweden, 7–10 September 2022; pp. 449–460. [Google Scholar] [CrossRef]
Ding, G.; Guo, S.; Wu, X. Dynamic scheduling optimization of production workshops based on digital twin. Appl. Sci. 2022, 12, 10451. [Google Scholar] [CrossRef]
Nascimento, F.H.N.; Cardoso, S.A.; Lima, A.M.N.; Santos, D.F.S. Synchronizing a collaborative arm’s digital twin in real-time. In Proceedings of the 2023 Latin American Robotics Symposium (LARS), 2023 Brazilian Symposium on Robotics (SBR), and 2023 Workshop on Robotics in Education (WRE), Salvador, Brazil, 9–11 October 2023; pp. 230–235. [Google Scholar] [CrossRef]
Kritzinger, W.; Karner, M.; Traar, G.; Henjes, J.; Sihn, W. Digital Twin in Manufacturing: A Categorical Literature Review and Classification. IFAC-PapersOnLine 2018, 51, 1016–1022. [Google Scholar] [CrossRef]
Fuller, A.; Fan, Z.; Day, C.; Barlow, C. Digital Twin: Enabling Technologies, Challenges and Open Research. IEEE Access 2020, 8, 108952–108971. [Google Scholar] [CrossRef]
Qi, Q.; Tao, F. Digital Twin and Big Data Towards Smart Manufacturing and Industry 4.0: 360 Degree Comparison. IEEE Access 2018, 6, 3585–3593. [Google Scholar] [CrossRef]
Okegbile, S.D.; Cai, J.; Niyato, D.; Yi, C. Human digital twin for personalized healthcare: Vision, architecture and future directions. IEEE Netw. 2022, 37, 262–269. [Google Scholar] [CrossRef]
Uhlemann, T.H.-J.; Lehmann, C.; Steinhilper, R. The digital twin: Realizing the cyber-physical production system for industry 4.0. Procedia CIRP 2017, 61, 335–340. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
Huang, Y.; Xu, J.; Lai, J.; Jiang, Z.; Chen, T.; Li, Z.; Yao, Y.; Ma, X.; Yang, L.; Chen, H.; et al. Advancing transformer architecture in long-context large language models: A comprehensive survey. arXiv 2023, arXiv:2311.12351. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems 27 (NIPS 2014); Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 2672–2680. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar] [CrossRef]
Müller-Franzes, G.; Niehues, J.M.; Khader, F.; Arasteh, S.T.; Haarburger, C.; Kuhl, C.; Wang, T.; Han, T.; Nolte, T.; Nebelung, S.; et al. A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis. Sci. Rep. 2023, 13, 12098. [Google Scholar] [CrossRef] [PubMed]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. arXiv 2020, arXiv:2006.11239. [Google Scholar] [CrossRef]
Dhariwal, P.; Nichol, A. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021); Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 8780–8794. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
Ramachandram, D.; Taylor, G.W. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Process. Mag. 2017, 34, 96–108. [Google Scholar] [CrossRef]
Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef]
Liang, P.P.; Zadeh, A.; Morency, L.-P. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Comput. Surv. 2024, 56, 1–42. [Google Scholar] [CrossRef]
Wu, R.; Wang, H.; Chen, H.-T.; Carneiro, G. Deep multimodal learning with missing modality: A survey. arXiv 2024, arXiv:2409.07825. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual Event, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Brookline, MA, USA, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual Event, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Brookline, MA, USA, 2021; Volume 139, pp. 4904–4916. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; PMLR: Brookline, MA, USA, 2022; Volume 162, pp. 12888–12900. [Google Scholar]
Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022); Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 23716–23736. [Google Scholar]
Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; PMLR: Brookline, MA, USA, 2022; Volume 162, pp. 23318–23340. [Google Scholar]
Rubenstein, P.K.; Asawaroengchai, C.; Nguyen, D.D.; Bapna, A.; Borsos, Z.; de Chaumont Quitry, F.; Chen, P.; El Badawy, D.; Han, W.; Kharitonov, E.; et al. AudioPaLM: A large language model that can speak and listen. arXiv 2023, arXiv:2306.12925. [Google Scholar] [CrossRef]
Tang, C.; Yu, W.; Sun, G.; Chen, X.; Tan, T.; Li, W.; Lu, L.; Ma, Z.; Zhang, C. SALMONN: Towards generic hearing abilities for large language models. arXiv 2023, arXiv:2310.13289. [Google Scholar] [CrossRef]
Li, S.; Tang, H. Multimodal alignment and fusion: A survey. arXiv 2024, arXiv:2411.17040. [Google Scholar] [CrossRef]
Li, Z.; Wu, X.; Du, H.; Liu, F.; Nghiem, H.; Shi, G. A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges. arXiv 2025, arXiv:2501.02189. [Google Scholar] [CrossRef]
Wang, J.; Jiang, H.; Liu, Y.; Ma, C.; Zhang, X.; Pan, Y.; Liu, M.; Gu, P.; Xia, S.; Li, W. A comprehensive review of multimodal large language models: Performance and challenges across different tasks. arXiv 2024, arXiv:2408.01319. [Google Scholar] [CrossRef]
Sloman, A. Multimodal cognitive architecture: Making perception more central to intelligent behavior. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), Boston, MA, USA, 16–20 July 2006; AAAI Press: Menlo Park, CA, USA, 2006; Volume 2, pp. 1488–1493. [Google Scholar]
Zhang, Z.; Zhang, A.; Li, M.; Zhao, H.; Karypis, G.; Smola, A. Multimodal chain-of-thought reasoning in language models. arXiv 2023, arXiv:2302.00923. [Google Scholar] [CrossRef]
Liu, Y.; Deng, Y.; Liu, A.; Liu, Y.; Li, S. Fine-grained multi-modal prompt learning for vision–language models. Neurocomputing 2025, 636, 130028. [Google Scholar] [CrossRef]
Wu, J.; Zhang, Z.; Xia, Y.; Li, X.; Xia, Z.; Chang, A.; Yu, T.; Kim, S.; Rossi, R.A.; Zhang, R.; et al. Visual prompting in multimodal large language models: A survey. arXiv 2024, arXiv:2409.15310. [Google Scholar] [CrossRef]
Gu, J.; Han, Z.; Chen, S.; Ma, Y.; Torr, P.; Tresp, V. A systematic survey of prompt engineering on vision-language foundation models. arXiv 2023, arXiv:2307.12980. [Google Scholar] [CrossRef]
NVIDIA Corporation. Vision Language Model Prompt Engineering Guide for Image and Video Understanding; Technical Report; NVIDIA Developer Blog: Santa Clara, CA, USA, 2025; Available online: https://developer.nvidia.com/blog/vision-language-model-prompt-engineering-guide-for-image-and-video-understanding/ (accessed on 6 September 2025).
Jiao, T.; Guo, C.; Feng, X.; Chen, Y.; Song, J. A comprehensive survey on deep learning multi-modal fusion: Methods, technologies and applications. Comput. Mater. Contin. 2024, 80, 1–35. [Google Scholar] [CrossRef]
Kaur, M.J.; Mishra, V.P.; Maheshwari, P. The Convergence of Digital Twin, IoT, and Machine Learning: Transforming Data into Action. In Digital Twin Technologies and Smart Cities; Farsi, M., Daneshkhah, A., Hosseinian-Far, A., Jahankhani, H., Eds.; Springer: Cham, Switzerland, 2020. [Google Scholar] [CrossRef]
Sahal, R.; Alsamhi, S.H.; Brown, K.N.; O’Shea, D.; McCarthy, C.; Guizani, M. Blockchain-Empowered Digital Twins Collaboration: Smart Transportation Use Case. Machines 2021, 9, 193. [Google Scholar] [CrossRef]
Semeraro, C.; Lezoche, M.; Panetto, H.; Dassisti, M. Digital Twin Paradigm: A Systematic Literature Review. Comput. Ind. 2021, 130, 103469. [Google Scholar] [CrossRef]
Minerva, R.; Lee, G.M.; Crespi, N. Digital Twin in the IoT Context: A Survey on Technical Features, Scenarios, and Architectural Models. Proc. IEEE 2020, 108, 1785–1824. [Google Scholar] [CrossRef]
Liu, M.; Fang, S.; Dong, H.; Xu, C. Review of Digital Twin about Concepts, Technologies, and Industrial Applications. J. Manuf. Syst. 2021, 58, 346–361. [Google Scholar] [CrossRef]
Barricelli, B.R.; Casiraghi, E.; Fogli, D. A Survey on Digital Twin: Definitions, Characteristics, Applications, and Design Implications. IEEE Access 2019, 7, 167653–167671. [Google Scholar] [CrossRef]
Rasheed, A.; San, O.; Kvamsdal, T. Digital Twin: Values, Challenges and Enablers from a Modeling Perspective. IEEE Access 2020, 8, 21980–22012. [Google Scholar] [CrossRef]
IBM Corporation. Will Generative AI Make the Digital Twin Promise Real in the Energy and Utilities Industry? Technical Report; IBM Blog: Armonk, NY, USA, 2023; Available online: https://www.ibm.com/blog/will-generative-ai-make-the-digital-twin-promise-real-in-the-energy-and-utilities-industry/ (accessed on 6 September 2025).
Muhammad, K.; David, T.; Nassisid, G.; Kumar, A.; Singh, R. Integrating generative AI with network digital twins for enhanced network operations. arXiv 2024, arXiv:2406.17112. [Google Scholar] [CrossRef]
Plain Concepts. Digital Twins and Generative AI: The Perfect Duo; Technical Insights Report; Plain Concepts: Madrid, Spain, 2025; Available online: https://www.plainconcepts.com/digital-twins-generative-ai/ (accessed on 6 September 2025).
Ray, A. EdgeAgentX-DT: Integrating digital twins and generative AI for resilient edge intelligence in tactical networks. arXiv 2025, arXiv:2507.21196. [Google Scholar] [CrossRef]
Nielsen Norman Group. Digital Twins: Simulating Humans with Generative AI; UX Research Report; Nielsen Norman Group: Fremont, CA, USA, 2025; Available online: https://www.nngroup.com/articles/digital-twins/ (accessed on 6 September 2025).
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar] [CrossRef]
Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; Getoor, L., Scheffer, T., Eds.; Omnipress: Madison, WI, USA, 2011; pp. 689–696. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS 2012); Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
Graves, A.; Mohamed, A.-R.; Hinton, G.E. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 6645–6649. [Google Scholar] [CrossRef]
Tan, H.; Bansal, M. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 5099–5109. [Google Scholar] [CrossRef]
Cavalieri, S.; Gambadoro, S. Proposal of mapping digital twins definition language to open platform communications unified architecture. Sensors 2023, 23, 2349. [Google Scholar] [CrossRef]
Laubenbacher, R.C.; Mehrad, B.; Shmulevich, I.; Trayanova, N.A. Digital twins in medicine. Nat. Comput. Sci. 2024, 4, 184–191. [Google Scholar] [CrossRef]
Vallée, A. Envisioning the future of personalized medicine: Role and realities of digital twins. J. Med. Internet Res. 2024, 26, e50204. [Google Scholar] [CrossRef]
Russwinkel, N. A cognitive digital twin for intention anticipation in human-aware AI. In Intelligent Autonomous Systems 18; Strand, M., Dillmann, R., Menegatti, E., Lorenzo, S., Eds.; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2024; Volume 795, pp. 637–646. [Google Scholar] [CrossRef]
Jaegle, A.; Borgeaud, S.; Alayrac, J.-B.; Doersch, C.; Ionescu, C.; Ding, D.; Koppula, S.; Zoran, D.; Brock, A.; Shelhamer, E.; et al. Perceiver IO: A general architecture for structured inputs & outputs. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual Event, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Brookline, MA, USA, 2021; Volume 139, pp. 5339–5350. [Google Scholar]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. 2019, 10, 1–19. [Google Scholar] [CrossRef]
Che, L.; Wang, J.; Zhou, Y.; Ma, F. Multimodal Federated Learning: A Survey. Sensors 2023, 23, 6986. [Google Scholar] [CrossRef]
Chen, J.; Yi, J.; Chen, A.; Jin, Z. EFCOMFF-Net: A multiscale feature fusion architecture with enhanced feature correlation for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604917. [Google Scholar] [CrossRef]
Zhou, H.; Wang, Y.; Zhan, H. MDE: Modality discrimination enhancement for multi-modal recommendation. arXiv 2025, arXiv:2502.18481. [Google Scholar] [CrossRef]
Poudel, P.; Chhetri, A.; Gyawali, P.; Leontidis, G.; Bhattarai, B. Multimodal Federated Learning with Missing Modalities through Feature Imputation Network. arXiv 2025, arXiv:2505.20232. [Google Scholar] [CrossRef]
Bao, G.; Zhang, Q.; Miao, D.; Gong, Z.; Hu, L.; Liu, Y.; Shi, C. Multimodal Federated Learning with Missing Modality via Prototype Mask and Contrast. arXiv 2023, arXiv:2312.13508. [Google Scholar] [CrossRef]
Huy, Q.L.; Nguyen, M.N.H.; Thwal, C.M.; Qiao, Y.; Zhang, C.; Hong, C.S. FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal Federated Learning. Neural Netw. 2025, 183, 107017. [Google Scholar] [CrossRef]
Park, S.; Kim, C.; Youm, S. Establishment of an IoT-based smart factory and data analysis model for the quality management of SMEs die-casting companies in Korea. Int. J. Distrib. Sens. Netw. 2019, 15, 1550147719879378. [Google Scholar] [CrossRef]
Zhang, Y.; Du, Q.; Lv, J. FedEPA: Enhancing personalization and modality alignment in multimodal federated learning. In Intelligent Computing: Proceedings of the 2025 Computing Conference, London, UK, 19–20 June 2025; Arai, K., Ed.; Lecture Notes in Networks and Systems; Springer: Singapore, 2025; Volume 1017, pp. 115–124. [Google Scholar] [CrossRef]
Yu, Q.; Liu, Y.; Wang, Y.; Xu, K.; Liu, J. Multimodal Federated Learning via Contrastive Representation Ensemble (CreamFL). arXiv 2023, arXiv:2302.08888. [Google Scholar] [CrossRef]
Konečný, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv 2016, arXiv:1610.05492. [Google Scholar] [CrossRef]
Wang, H.; Kaplan, Z.; Niu, D.; Li, B. Optimizing Federated Learning on Non-IID Data with Reinforcement Learning. In Proceedings of the IEEE INFOCOM 2020—IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1698–1707. [Google Scholar] [CrossRef]
Zhang, L.; Wu, J.; Shen, J.; Chen, M.; Wang, R.; Zhou, X.; Wu, Q. SATP-GAN: Self-attention based generative adversarial network for traffic flow prediction. Transp. B Transp. Dyn. 2021, 9, 552–568. [Google Scholar] [CrossRef]
Li, P.; Zhang, H.; Wu, Y.; Qian, L.; Yu, R.; Niyato, D.; Shen, X. Filling the Missing: Exploring Generative AI for Enhanced Federated Learning Over Heterogeneous Mobile Edge Devices. IEEE Trans. Mob. Comput. 2024, 23, 10001–10015. [Google Scholar] [CrossRef]
Roberts, M.C.; Holt, K.E.; Del Fiol, G.; Kohlmann, W.; Shirts, B.H. Precision public health in the era of genomics and big data. Nat. Med. 2024, 30, 1865–1873. [Google Scholar] [CrossRef]
Yang, Y.; Wang, Z.-Y.; Liu, Q.; Sun, S.; Wang, K.; Chellappa, R.; Zhou, Z.; Yuille, A.; Zhu, L.; Zhang, Y.-D. Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning. arXiv 2024, arXiv:2506.02327. [Google Scholar] [CrossRef]
Swerdlow, A.; Prabhudesai, M.; Gandhi, S.; Pathak, D.; Fragkiadaki, K. Unified Multimodal Discrete Diffusion. arXiv 2025, arXiv:2503.20853. [Google Scholar] [CrossRef]
Shao, J.; Pan, Y.; Kou, W.B.; Feng, H.; Zhao, Y.; Zhou, K.; Zhong, S. Generalization of a deep learning model for continuous glucose monitoring-based hypoglycemia prediction: Algorithm development and validation study. JMIR Med. Inform. 2024, 12, e56909. [Google Scholar] [CrossRef]
Yang, Y.; Lan, T.; Wang, Y.; Li, F.; Liu, L.; Huang, X.; Gao, F.; Jiang, S.; Zhang, Z.; Chen, X. Data imbalance in cardiac health diagnostics using CECG-GAN. Sci. Rep. 2024, 14, 14767. [Google Scholar] [CrossRef]
He, Y.; Rojas, K.; Tao, M. Zeroth-order sampling methods for non-log-concave distributions: Alleviating metastability by denoising diffusion. arXiv 2024, arXiv:2402.17886. [Google Scholar] [CrossRef]
Stoian, M.C.; Dyrmishi, S.; Cordy, M.; Lukasiewicz, T.; Giunchiglia, E. How realistic is your synthetic data? Constraining deep generative models for tabular data. arXiv 2024, arXiv:2402.04823. [Google Scholar] [CrossRef]
Ren, F.; Aliper, A.; Chen, J.; Zhao, H.; Rao, S.; Kuppe, C.; Ozerov, I.V.; Peng, H.; Zhavoronkov, A. A small-molecule TNIK inhibitor targets fibrosis in preclinical and clinical models. Nat. Biotechnol. 2024, 42, 63–75. [Google Scholar] [CrossRef]
Zhang, K.; Zhou, F.; Wu, L.; Xie, N.; He, Z. Semantic Understanding and Prompt Engineering for Large-Scale Traffic Data Imputation. Inf. Fusion 2024, 102, 102038. [Google Scholar] [CrossRef]
Wang, J.; Ren, Y.; Song, Z.; Zhang, J.; Zheng, C.; Qiang, W. Hacking task confounder in meta-learning. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI-24), Jeju, Republic of Korea, 3–9 August 2024; Bessiere, C., Ed.; IJCAI Organization: California, CA, USA, 2024; pp. 5064–5072. [Google Scholar] [CrossRef]
Wang, Y.; Yang, C.; Lan, S.; Zhu, L.; Zhang, Y. End-edge-cloud collaborative computing for deep learning: A comprehensive survey. IEEE Commun. Surv. Tutor. 2024, 26, 2647–2683. [Google Scholar] [CrossRef]
Fan, W.; Su, Y.; Liu, J.; Li, S.; Huang, W.; Wu, F.; Liu, Y. Joint Task Offloading and Resource Allocation for Vehicular Edge Computing Based on V2I and V2V Modes. IEEE Trans. Intell. Transp. Syst. 2023, 24, 4277–4292. [Google Scholar] [CrossRef]
Hu, H.; Jiang, C. Edge Intelligence: Challenges and Opportunities. In Proceedings of the 2020 International Conference on Computer, Information and Telecommunication Systems (CITS), Hangzhou, China, 5–7 October 2020; pp. 1–5. [Google Scholar] [CrossRef]
Navardi, M.; Aalishah, R.; Fu, Y.; Lin, Y.; Li, H.; Chen, Y.; Mohsenin, T. GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices. Proc. AAAI Symp. Ser. 2025, 5, 180–187. [Google Scholar] [CrossRef]
Cheng, H.; Zhang, M.; Shi, J.Q. A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10558–10578. [Google Scholar] [CrossRef]
He, Y.; Xiao, L. Structured Pruning for Deep Convolutional Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2900–2919. [Google Scholar] [CrossRef]
He, Y.; Zhang, X.; Sun, J. Channel Pruning for Accelerating Very Deep Neural Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1398–1406. [Google Scholar] [CrossRef]
Wang, H.; Zhang, W.-Q. Unstructured Pruning and Low Rank Factorisation of Self-Supervised Pre-Trained Speech Models. IEEE J. Sel. Top. Signal Process. 2024, 18, 1046–1058. [Google Scholar] [CrossRef]
Su, W.; Li, Z.; Xu, M.; Kang, J.; Niyato, D.; Xie, S. Compressing Deep Reinforcement Learning Networks with a Dynamic Structured Pruning Method for Autonomous Driving. IEEE Trans. Veh. Technol. 2024, 73, 18017–18030. [Google Scholar] [CrossRef]
Lu, Y.; Guan, Z.; Zhao, W.; Gong, M.; Wang, W.; Sheng, K. SNPF: Sensitiveness-Based Network Pruning Framework for Efficient Edge Computing. IEEE Internet Things J. 2024, 11, 6972–6991. [Google Scholar] [CrossRef]
Matos, J.B.P.; de Lima Filho, E.B.; Bessa, I.; Manino, E.; Song, X.; Cordeiro, L.C. Counterexample Guided Neural Network Quantization Refinement. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 1121–1134. [Google Scholar] [CrossRef]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar] [CrossRef]
Tai, Y.-S.; Chang, C.-Y.; Teng, C.-F.; Chen, Y.-T.; Wu, A.-Y. Joint Optimization of Dimension Reduction and Mixed-Precision Quantization for Activation Compression of Neural Networks. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 4025–4037. [Google Scholar] [CrossRef]
Motetti, B.A.; Risso, M.; Burrello, A.; Macii, E.; Poncino, M.; Pagliari, D.J. Joint Pruning and Channel-Wise Mixed-Precision Quantization for Efficient Deep Neural Networks. IEEE Trans. Comput. 2024, 73, 2619–2633. [Google Scholar] [CrossRef]
Peng, J.; Liu, H.; Zhao, Z.; Li, Z.; Liu, S.; Li, Q. CMQ: Crossbar-Aware Neural Network Mixed-Precision Quantization via Differentiable Architecture Search. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 4124–4133. [Google Scholar] [CrossRef]
Kim, N.; Shin, D.; Choi, W.; Kim, G.; Park, J. Exploiting Retraining-Based Mixed-Precision Quantization for Low-Cost DNN Accelerator Design. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 2925–2938. [Google Scholar] [CrossRef]
Yang, S.; Xu, L.; Zhou, M.; Yang, X.; Yang, J.; Huang, Z. Skill-Transferring Knowledge Distillation Method. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6487–6502. [Google Scholar] [CrossRef]
Gou, J.; Sun, L.; Yu, B.; Du, L.; Ramamohanarao, K.; Tao, D. Collaborative Knowledge Distillation via Multiknowledge Transfer. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 6718–6730. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Tu, Z.; Liu, X.; Xiao, X. A General Dynamic Knowledge Distillation Method for Visual Analytics. IEEE Trans. Image Process. 2022, 31, 6517–6531. [Google Scholar] [CrossRef]
Gou, J.; Chen, Y.; Yu, B.; Liu, J.; Du, L.; Wan, S.; Yi, Z. Reciprocal Teacher-Student Learning via Forward and Feedback Knowledge Distillation. IEEE Trans. Multimed. 2024, 26, 7901–7916. [Google Scholar] [CrossRef]
Li, S.; Li, S.; Sun, Y.; Wang, B.; Wang, B.; Zhang, B. Digital Twin-Assisted Computation Offloading and Resource Allocation for Multi-Device Collaborative Tasks in Industrial Internet of Things. IEEE Trans. Netw. Sci. Eng. 2025, 1–16. [Google Scholar] [CrossRef]
Sun, W.; Wang, P.; Xu, N.; Wang, G.; Zhang, Y. Dynamic Digital Twin and Distributed Incentives for Resource Allocation in Aerial-Assisted Internet of Vehicles. IEEE Internet Things J. 2022, 9, 5839–5852. [Google Scholar] [CrossRef]
Mehdipourchari, K.; Askarizadeh, M.; Nguyen, K.K. Shared-Resource Generative Adversarial Network (GAN) Training for 5G URLLC Deep Reinforcement Learning Augmentation. In Proceedings of the ICC 2024—IEEE International Conference on Communications, Denver, CO, USA, 9–13 June 2024; pp. 2998–3003. [Google Scholar] [CrossRef]
Naeem, F.; Seifollahi, S.; Zhou, Z.; Tariq, M. A Generative Adversarial Network Enabled Deep Distributional Reinforcement Learning for Transmission Scheduling in Internet of Vehicles. IEEE Trans. Intell. Transp. Syst. 2021, 22, 4550–4559. [Google Scholar] [CrossRef]
Gu, R.; Zhang, J. GANSlicing: A GAN-Based Software Defined Mobile Network Slicing Scheme for IoT Applications. In Proceedings of the 2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019; pp. 1–7. [Google Scholar] [CrossRef]
Singh, P.; Hazarika, B.; Singh, K.; Huang, W.-J.; Duong, T.Q. Digital Twin-Assisted Adaptive Federated Multi-Agent DRL with GenAI for Optimized Resource Allocation in IoV Networks. In Proceedings of the 2025 IEEE Wireless Communications and Networking Conference (WCNC), Milan, Italy, 24–27 March 2025; pp. 1–6. [Google Scholar] [CrossRef]
Fang, J.; He, Y.; Yu, F.R.; Du, J. Resource Allocation for Video Diffusion Task Offloading in Cloud-Edge Networks: A Deep Active Inference Approach. In Proceedings of the GLOBECOM 2024—2024 IEEE Global Communications Conference, Cape Town, South Africa, 8–12 December 2024; pp. 2021–2026. [Google Scholar] [CrossRef]
Li, M.; Gao, J.; Zhou, C.; Zhao, L.; Shen, X. Digital-Twin-Empowered Resource Allocation for On-Demand Collaborative Sensing. IEEE Internet Things J. 2024, 11, 37942–37958. [Google Scholar] [CrossRef]
Liu, Z.; Du, H.; Lin, J.; Gao, Z.; Huang, L.; Hosseinalipour, S.; Niyato, D. DNN Partitioning, Task Offloading, and Resource Allocation in Dynamic Vehicular Networks: A Lyapunov-Guided Diffusion-Based Reinforcement Learning Approach. IEEE Trans. Mob. Comput. 2025, 24, 1945–1962. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, J.; Chen, J.; Fu, H.; Tong, Z.; Jiang, C. Diffusion-Based Reinforcement Learning for Cooperative Offloading and Resource Allocation in Multi-UAV Assisted Edge-Enabled Metaverse. IEEE Trans. Veh. Technol. 2025, 74, 11281–11293. [Google Scholar] [CrossRef]
Wang, L.; Liang, H.; Mao, G.; Zhao, D.; Liu, Q.; Yao, Y.; Zhang, H. Resource Allocation for Dynamic Platoon Digital Twin Networks: A Multi-Agent Deep Reinforcement Learning Method. IEEE Trans. Veh. Technol. 2024, 73, 15609–15620. [Google Scholar] [CrossRef]
Tang, L.; Wang, A.; Xia, B.; Tang, Y.; Chen, Q. Research on Integrated Sensing, Communication Resource Allocation, and Digital Twin Placement Based on Digital Twin in IoV. IEEE Internet Things J. 2025, 12, 17300–17315. [Google Scholar] [CrossRef]
Ji, B.; Dong, B.; Li, D.; Wang, Y.; Yang, L.; Tsimenidis, C.; Menon, V.G. Optimization of Resource Allocation for V2X Security Communication Based on Multi-Agent Reinforcement Learning. IEEE Trans. Veh. Technol. 2025, 74, 1849–1861. [Google Scholar] [CrossRef]
Yin, J.; Zhang, Y.; Li, X.; Zhao, Y.; Liu, Z.; Tang, M. QoS-Aware Energy-Efficient Multi-UAV Offloading Ratio and Trajectory Control Algorithm in Mobile-Edge Computing. IEEE Internet Things J. 2024, 11, 40588–40602. [Google Scholar] [CrossRef]
Guo, Q.; Tang, F.; Kato, N. Federated Reinforcement Learning-Based Resource Allocation for D2D-Aided Digital Twin Edge Networks in 6G Industrial IoT. IEEE Trans. Ind. Inf. 2023, 19, 7228–7236. [Google Scholar] [CrossRef]
Peng, H.; Shen, X. Multi-Agent Reinforcement Learning Based Resource Management in MEC- and UAV-Assisted Vehicular Networks. IEEE J. Sel. Areas Commun. 2021, 39, 131–141. [Google Scholar] [CrossRef]
Kusiak, A. Generative artificial intelligence in smart manufacturing. J. Intell. Manuf. 2025, 36, 1–3. [Google Scholar] [CrossRef]
Wang, H.; Wang, C.; Liu, Q.; Zhang, X.; Liu, M.; Ma, Y.; Yan, F.; Shen, W. A data and knowledge driven autonomous intelligent manufacturing system for intelligent factories. J. Manuf. Syst. 2024, 74, 512–526. [Google Scholar] [CrossRef]
Alavian, P.; Eun, Y.; Meerkov, S.; Zhang, L. Smart production systems: Automating decision-making in manufacturing environment. Int. J. Prod. Res. 2019, 58, 828–845. [Google Scholar] [CrossRef]
Leng, J.; Sha, W.; Lin, Z.; Jing, J.; Liu, Q.; Chen, X. Blockchained smart contract pyramid-driven multi-agent autonomous process control for resilient individualised manufacturing towards Industry 5.0. Int. J. Prod. Res. 2022, 61, 4302–4321. [Google Scholar] [CrossRef]
Lin, H.; Guo, R.; Ma, D.; Kuai, X.; Yuan, Z.; Du, Z.; He, B. Digital-twin-based multi-scale simulation supports urban emergency management: A case study of urban epidemic transmission. Int. J. Digit. Earth 2024, 17, 2421950. [Google Scholar] [CrossRef]
Wang, H.; Chen, X.; Jia, F.; Cheng, X. Digital twin-supported smart city: Status, challenges and future research directions. Expert Syst. Appl. 2023, 217, 119531. [Google Scholar] [CrossRef]
Hu, X.; Li, S.; Huang, T.; Tang, B.; Huai, R.; Chen, L. How Simulation Helps Autonomous Driving: A Survey of Sim2real, Digital Twins, and Parallel Intelligence. IEEE Trans. Intell. Veh. 2024, 9, 593–612. [Google Scholar] [CrossRef]
Niaz, A.; Shoukat, M.U.; Jia, Y.; Khan, S.; Niaz, F.; Raza, M.U. Autonomous driving test method based on digital twin: A survey. In Proceedings of the 2021 International Conference on Computing, Electronic and Electrical Engineering (ICE Cube), Quetta, Pakistan, 26–27 October 2021; pp. 1–7. [Google Scholar] [CrossRef]
Björnsson, B.; Borrebaeck, C.; Elander, N.; Gasslander, T.; Gawel, D.; Gustafsson, M.; Jörnsten, R.; Lee, E.; Li, X.; Lilja, S.; et al. Digital twins to personalize medicine. Genome Med. 2020, 12, 4. [Google Scholar] [CrossRef]
Rajapakse, V.; Karunanayake, I.; Ahmed, N. Intelligence at the extreme edge: A survey on reformable TinyML. Acm Comput. Surv. 2023, 55, 1–30. [Google Scholar] [CrossRef]
Hannou, F.Z.; Lefrançois, M.; Jouvelot, P.; Charpenay, V.; Zimmermann, A. A Survey on IoT Programming Platforms: A Business-Domain Experts Perspective. ACM Comput. Surv. 2024, 57, 1–37. [Google Scholar] [CrossRef]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
Asgari, E.; Montaña-Brown, N.; Dubois, M.; Khalil, S.; Balloch, J.; Yeung, J.A.; Pimenta, D. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digit. Med. 2025, 8, 274. [Google Scholar] [CrossRef]
Farquhar, S.; Kossen, J.; Kuhn, L.; Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 2024, 630, 625–630. [Google Scholar] [CrossRef]
Pati, S.; Kumar, S.; Varma, A.; Edwards, B.; Lu, C.; Qu, L.; Wang, J.J.; Lakshminarayanan, A.; Wang, S.H.; Sheller, M.J.; et al. Privacy preservation for federated learning in health care. Patterns 2024, 5, 100850. [Google Scholar] [CrossRef]
Wang, F.; Li, B. Data reconstruction and protection in federated learning for fine-tuning large language models. IEEE Trans. Big Data 2024, 1–12. [Google Scholar] [CrossRef]
Rehan, M.W.; Rehan, M.M. Survey, taxonomy, and emerging paradigms of societal digital twins for public health preparedness. npj Digit. Med. 2025, 8, 520. [Google Scholar] [CrossRef]
Kuruppu Appuhamilage, G.D.K.; Hussain, M.; Zaman, M.; Ali Khan, W. A health digital twin framework for discrete event simulation based optimised critical care workflows. npj Digit. Med. 2025, 8, 376. [Google Scholar] [CrossRef]
Wei, K.; Li, J.; Ding, M.; Ma, C.; Yang, H.H.; Farhad, F.; Jin, S.; Quek, T.; Poor, V. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Trans. Inf. Forensics Secur. 2020, 15, 3454–3469. [Google Scholar] [CrossRef]
Andrew, G.; Thakkar, O.; McMahan, B.; Ramaswamy, S. Differentially private learning with adaptive clipping. Adv. Neural Inf. Process. Syst. 2021, 34, 17455–17466. [Google Scholar]
Agarwal, N.; Kairouz, P.; Liu, Z. The Skellam mechanism for differentially private federated learning. Adv. Neural Inf. Process. Syst. 2021, 34, 5052–5064. [Google Scholar]
Xin, B.; Geng, Y.; Hu, T.; Chen, S.; Yang, W.; Wang, S.; Huang, L. Federated synthetic data generation with differential privacy. Neurocomputing 2022, 468, 1–10. [Google Scholar] [CrossRef]
Qi, Y.; Hossain, M.S. Semi-supervised federated learning for digital twin 6G-enabled IIoT: A Bayesian estimated approach. J. Adv. Res. 2024, 66, 47–57. [Google Scholar] [CrossRef]
Hu, R.; Guo, Y.; Li, H.; Pei, Q.; Gong, Y. Personalized federated learning with differential privacy. IEEE Internet Things J. 2020, 7, 9530–9539. [Google Scholar] [CrossRef]
Shen, X.; Liu, Y.; Zhang, Z. Performance-enhanced federated learning with differential privacy for internet of things. IEEE Internet Things J. 2022, 9, 24079–24094. [Google Scholar] [CrossRef]
David, I.; Shao, G.; Gomes, C.; Tilbury, D.; Zarkout, B. Interoperability of Digital Twins: Challenges, Success Factors, and Future Research Directions; Springer: Berlin/Heidelberg, Germany, 2024; pp. 27–46. [Google Scholar]
Xu, H.; Wu, J.; Pan, Q.; Guan, X.; Guizani, M. A survey on digital twin for industrial internet of things: Applications, technologies and tools. IEEE Commun. Surv. Tutor. 2023, 25, 2569–2598. [Google Scholar] [CrossRef]
Sharan, S.P.; Choi, M.; Shah, S.; Goel, H.; Omama, M.; Chinchali, S. Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 8395–8405. [Google Scholar] [CrossRef]
Song, Y.; Dhariwal, P.; Chen, M.; Sutskever, I. Consistency Models. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023. [Google Scholar] [CrossRef]
Geng, Z.; Pokle, A.; Kolter, J.Z. One-Step Diffusion Distillation via Deep Equilibrium Models. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar] [CrossRef]
Kim, H.; Yoo, J. Singular Value Scaling: Efficient Generative Model Compression via Pruned Weights Refinement. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2025; Volume 39, pp. 17859–17867. [Google Scholar] [CrossRef]
Xiong, K.; Wang, Z.; Leng, S.; Yang, K.; Chen, Y. A digital-twin-empowered lightweight model-sharing scheme for multirobot systems. IEEE Internet Things J. 2023, 10, 17231–17242. [Google Scholar] [CrossRef]
Chiaro, D.; Qi, P.; Pescapè, A.; Piccialli, F. Generative AI-Empowered Digital Twin: A Comprehensive Survey with Taxonomy. IEEE Trans. Ind. Inform. 2025, 21, 4287–4295. [Google Scholar] [CrossRef]
Schwartz, R.; Dodge, J.; Smith, N.A.; Etzioni, O. Green AI. Commun. ACM 2020, 63, 54–63. [Google Scholar] [CrossRef]
Bolón-Canedo, V.; Morán-Fernández, L.; Cancela, B.; Alonso-Betanzos, A. A review of green artificial intelligence: Towards a more sustainable future. Neurocomputing 2024, 599, 128096. [Google Scholar] [CrossRef]
De Lange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; Tuytelaars, T. A Continual Learning Survey: Defying Forgetting in Classification Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3366–3385. [Google Scholar] [CrossRef]
Lee, D.; Yoo, M.; Kim, W.K.; Choi, W.; Woo, H. Incremental Learning of Retrievable Skills For Efficient Continual Task Adaptation. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024); Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024. [Google Scholar] [CrossRef]
Park, J.; Ji, A.; Park, M.; Rahman, M.S.; Oh, S.E. MalCL: Leveraging GAN-based generative replay to combat catastrophic forgetting in malware classification. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence (AAAI’25/IAAI’25/EAAI’25), Philadelphia, PA, USA, 25 February–4 March 2025; pp. 658–666, Article 74. [Google Scholar] [CrossRef]
Meng, Y.; Bing, Z.; Yao, X.; Chen, K.; Huang, K.; Gao, Y.; Sun, F.; Knoll, A. Preserving and combining knowledge in robotic lifelong reinforcement learning. Nat. Mach. Intell. 2025, 7, 256–269. [Google Scholar] [CrossRef]
Papagiannidis, E.; Mikalef, P.; Conboy, K. Responsible artificial intelligence governance: A review and research framework. J. Strateg. Inf. Syst. 2025, 34, 101885. [Google Scholar] [CrossRef]

Figure 1. PRISMA flow diagram.

Figure 2. SMGA framework.

Table 1. Representative surveys on DTs and GAI: scope and focus.

Ref	Year	Description	Focus
[1]	2023	A survey on multimodal learning with transformers, covering their architectures, pretraining strategies, applications across vision, language, and speech, and open research challenges.	Multimodal generative artificial intelligence
[2]	2024	A comprehensive survey on multimodal learning from a data-centric perspective, reviewing datasets, benchmarks, and challenges in aligning and integrating vision with other modalities.
[3]	2024	A survey on deep multimodal data fusion, discussing methods, applications across diverse domains, and key challenges in effectively integrating heterogeneous data.
[4]	2024	Reviews multimodal large language models (MLLMs) focusing on their architectures, training strategies, datasets, and evaluation methods, highlighting emergent capabilities beyond traditional multimodal models.
[5]	2025	A survey of AI-generated content (AIGC), outlining its evolution, core techniques, applications across text, vision, audio and code, and key challenges in alignment and evaluation.
[6]	2021	This survey introduces digital twin networks, outlining their architectures, enabling technologies, representative applications, and key open issues for future intelligent networked systems.	Digital twins
[7]	2023	This review analyzes physical-based digital twins, emphasizing architectures, cross-domain applications, and the challenges of integrating physics-driven models with data-driven approaches.
[8]	2024	A systematic review exploring the interplay of AI, AIoT, and urban digital twins to enhance data-driven strategies for environmental sustainability in smart cities.
[9]	2024	A survey on applying machine and deep learning to digital twin networks for monitoring, optimization, and security.
[10]	2025	A survey on digital twin networks covering architecture, enabling technologies, applications, and research challenges.

Table 2. Survey comparison matrix: DTs, GAI, and multimodal fusion.

Reference	Research Object	Mathematical/ Modeling Focus	Strengths/ Limitations	Application Domains
Wu et al. (2021) [6]	DT networks	Network modeling, graph structures, performance equations	Comprehensive survey of DT networks; lacks cross-modal AI integration	IoT, communication networks
Liu et al. (2023) [7]	DT foundations	Categorization: entity, virtual model, twin data, application	Clear taxonomy; limited mathematical formalization	Engineering, manufacturing
Simon et al. (2024) [8]	DTs + AI for smart cities	System-level conceptual modeling	Good integration vision; limited quantitative models	Smart cities, sustainability
Qin et al. (2024) [9]	DTs with ML/DL	ML/DL optimization functions for DT networks	Mathematical rigor; lacks GAI perspective	IoT networks
Yidan et al. (2025) [10]	DT networks survey	Architectural and protocol modeling	Comprehensive; but largely descriptive	IoT, edge networks
Zhu et al. (2024) [2]	Multimodal fusion	Cross-modal alignment functions, embedding models	Detailed taxonomy; lacks DT integration	Computer vision, multimedia
Xu et al. (2023) [1]	Multimodal transformers	Transformer architectures, attention functions	Mathematical clarity; limited DT linkage	Vision, speech, language
Fei et al. (2024) [3]	Deep multimodal data fusion	Fusion operators, joint embedding, objective functions	Strong mathematical synthesis; lacks application scenarios	Multimedia, healthcare
Yin et al. (2024) [4]	Multimodal LLMs	Pre-training loss, alignment objectives	Good coverage; little DT consideration	LLMs, cross-modal AI
Cao et al. (2025) [5]	AIGC overview	GAN/diffusion model equations, generative objectives	Captures model diversity; limited DT context	AIGC, creative industries

Table 3. Sectoral validation summary for DT + GAI applications.

Sector	Typical Validation (Examples)	Representative Works (by Ref. No.)	Maturity	Key Challenges
Smart Manufacturing	Case studies (throughput via PMA); DT + GAI predictive maintenance; simulation benchmarks for design optimization	[139,140,141,142]	High (pilot selective production lines)	Standardization across lines; data integration; cross-plant generalization
Smart Cities	City-scale traffic twins (real-time control); DT-enhanced evacuation simulations (response time); what-if analysis	[143,144]	Medium (many pilots; limited citywide ops)	Real-time fusion; multi-agent coordination; unified simulation standards
Autonomous Driving	Twin-driven RL training; sim2real evaluation; vehicle-in-the-loop via digital siblings	[145,146]	Emerging medium	Scenario realism; sim2real gap; safety certification
Healthcare	Patient-specific DTs; surgical rehearsal; virtual clinical trials (in silico cohorts)	[74,75,147]	High in research; medium in clinical translation	Privacy and ethics; interoperability; computational scalability

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, X.; Wang, A.; Zhang, X.; Huang, K.; Wang, S.; Chen, L.; Cui, Y. Toward Intelligent AIoT: A Comprehensive Survey on Digital Twin and Multimodal Generative AI Integration. Mathematics 2025, 13, 3382. https://doi.org/10.3390/math13213382

AMA Style

Luo X, Wang A, Zhang X, Huang K, Wang S, Chen L, Cui Y. Toward Intelligent AIoT: A Comprehensive Survey on Digital Twin and Multimodal Generative AI Integration. Mathematics. 2025; 13(21):3382. https://doi.org/10.3390/math13213382

Chicago/Turabian Style

Luo, Xiaoyi, Aiwen Wang, Xinling Zhang, Kunda Huang, Songyu Wang, Lixin Chen, and Yejia Cui. 2025. "Toward Intelligent AIoT: A Comprehensive Survey on Digital Twin and Multimodal Generative AI Integration" Mathematics 13, no. 21: 3382. https://doi.org/10.3390/math13213382

APA Style

Luo, X., Wang, A., Zhang, X., Huang, K., Wang, S., Chen, L., & Cui, Y. (2025). Toward Intelligent AIoT: A Comprehensive Survey on Digital Twin and Multimodal Generative AI Integration. Mathematics, 13(21), 3382. https://doi.org/10.3390/math13213382

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Toward Intelligent AIoT: A Comprehensive Survey on Digital Twin and Multimodal Generative AI Integration

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Research Questions

1.3. Contributions

1.4. Literature Search and Screening Process

2. Foundations: Digital Twins and Multimodal Generative AI

2.1. DTs in AIoT

2.2. Multimodal Generative AI: The Rise of Contextual Understanding and Creation

2.3. GAI as Cognitive Augmentation Layer for ML-Based DT

2.4. The Symbiosis: Why DT and GAI Are Complementary Partners

3. A Framework for Integration

3.1. The Proposed SMGA Architecture for AIoT

3.1.1. Sense Layer

3.1.2. Map Layer

3.1.3. Generate Layer

3.1.4. Act Layer

3.2. Quantitative Evaluation Metrics for DT–GAI Integration

3.2.1. Accuracy and Performance Metrics

3.2.2. Robustness Metrics

3.2.3. Cross-Modal and Multi-Source Metrics

3.2.4. Computational and Resource Metrics

4. Key Enabling Technologies

4.1. Multimodal Data Fusion and Representation Learning

4.1.1. Unimodal and Multimodal Data Processing Techniques

4.1.2. Multimodal Fusion Techniques Under Multimodal Conditions

4.2. GAI for Dynamic DT Evolution

4.2.1. Data Generation Techniques for DTs Enabled by GAI

4.2.2. Prediction Techniques for DTs Enabled by GAI

4.3. Cloud–Edge–End Collaborative Intelligence

4.3.1. Lightweight Model Deployment

4.3.2. Intelligent Resource Allocation

5. Application Scenarios

5.1. Smart Manufacturing: Generative Design and Autonomous Optimization

5.2. Smart Cities: City-Scale Simulation and Emergency Response

5.3. Autonomous Driving: Lifelong Learning and Simulation Testing

5.4. Healthcare: Personalized Medicine and Surgical Planning

5.5. Comparative Summary and Observations

6. Challenges and Future Research Directions

6.1. Technical Challenges

6.1.1. The Tension Between Computational Demand and Efficiency

6.1.2. Reliability and Hallucination Risks

6.1.3. Data Security and Privacy

6.1.4. Privacy-Preserving Mechanisms for DT–GAI

6.1.5. Lack of Standardization and Interoperability

6.2. Future Research Directions

6.2.1. Neuro-Symbolic Verified Generation

6.2.2. GAI for DT Compression

6.2.3. Toward Sustainable and Green AIoT

6.2.4. Lifelong and Continual Learning in Dynamic Environments

6.2.5. Ethical and Governance Frameworks for Intelligent AIoT

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI