MDPI - Publisher of Open Access Journals

18 pages, 1777 KB

Open AccessArticle

DeepFakeX: A Comprehensive Multimodal Deepfake Dataset for Research and Analysis

by Sonia Salman, Jawwad Ahmed Shamsi and Rizwan Qureshi

Data 2026, 11(6), 141; https://doi.org/10.3390/data11060141 - 11 Jun 2026

Viewed by 189

The expanding capabilities of deep learning-based media synthesis have intensified concerns regarding the authenticity of digital content and the reliability of forensic analysis tools. In response to these challenges, this work introduces DeepFakeX, a collection of 800 synthetically generated videos available under controlled [...] Read more.

The expanding capabilities of deep learning-based media synthesis have intensified concerns regarding the authenticity of digital content and the reliability of forensic analysis tools. In response to these challenges, this work introduces DeepFakeX, a collection of 800 synthetically generated videos available under controlled access for research purposes. The dataset encompasses four distinct categories of AI-driven synthesis: facial identity replacement, audio track substitution, neural voice cloning, and combined audiovisual alteration. Unlike existing deepfake datasets that predominantly focus on facial synthesis, DeepFakeX covers a broader range of manipulation modalities, reflecting the diversity of synthetic media encountered in real-world settings. All deepfakes were generated using state-of-the-art, publicly available tools. Standardized post-processing procedures were applied to each video to ensure uniformity in terms of quality, duration and encoding format. DeepFakeX also emphasizes diversity in gender, age, ethnicity, and language. Video contexts span speeches, informational videos, movie clips, news broadcasts, and interviews that reflect content scenarios commonly encountered in real-world online environments. The dataset includes videos in both English and Urdu. The dataset’s quality and structural variability were assessed through visual and audio analyses using the Structural Similarity Index Measure (SSIM), Mel-Frequency Cepstral Coefficients (MFCCs), and Principal Component Analysis (PCA). The evaluation results revealed substantial variability within each manipulation category, along with clearly distinguishable patterns specific to each modality. DeepFakeX has been developed to facilitate rigorous and transparent research in deepfake detection, cross-modal forensic analysis, and AI-driven media forensics. It is hosted on Zenodo under controlled access for research use. Full article

31 pages, 30018 KB

Open AccessArticle

Sensors-Driven Multimodal Deepfake Detection: A Cross-Attention Fusion Approach with Adaptive Modality Gating

by Syeda Sitara Waseem, Noman Shabbir, Syed Rizwan Hassan and KangYoon Lee

Sensors 2026, 26(12), 3695; https://doi.org/10.3390/s26123695 - 10 Jun 2026

Viewed by 163

Abstract

Deepfakes threaten sensor-based authentication systems, including biometric sensors, surveillance cameras, and IoT edge devices. Unimodal detectors remain vulnerable to modality-specific attacks. We propose a multimodal deepfake detection framework optimized for resource-constrained edge devices, featuring a novel cross-modal attention fusion mechanism with adaptive gating. [...] Read more.

Deepfakes threaten sensor-based authentication systems, including biometric sensors, surveillance cameras, and IoT edge devices. Unimodal detectors remain vulnerable to modality-specific attacks. We propose a multimodal deepfake detection framework optimized for resource-constrained edge devices, featuring a novel cross-modal attention fusion mechanism with adaptive gating. The architecture combines enhanced Res2Net for audio, temporal 3D CNN with SE attention for video, and bidirectional cross-modal attention with quality-based gates. On our benchmark (5472 audio + 1842 video samples), the fusion model achieves 96.7% accuracy, 96.6% F1-score, 0.988 AUC-ROC, and 3.3% EER. Adversarial testing shows 92.3% accuracy under the Fast Gradient Sign Method (FGSM) attack. The model has a 30.3 MB footprint and runs at 20 FPS on edge hardware. Modality contribution analysis reveals adaptive weighting (72% audio for TTS forgery, 78% video for lip-synced attacks). Cross-dataset evaluation on FakeAVCeleb achieves 92.3% overall accuracy, confirming generalization. Full article

(This article belongs to the Special Issue Secure and Resilient Solutions for CCTV, Small Sensor and IoT Device Security)

► Show Figures

Figure 1

32 pages, 2468 KB

Open AccessArticle

Sustainable Adoption of AI-Generated Instructional Videos: An Empirical Evaluation of the LBUC Model via NotebookLM

by Levent Çallı and Büşra Alma Çallı

Systems 2026, 14(6), 631; https://doi.org/10.3390/systems14060631 - 2 Jun 2026

Viewed by 211

Abstract

This study examines how learning input quality shapes students’ trust, perceived learning value, and post-exposure behavioural intentions, and whether AI-supported instructional content contributes to conceptual learning. Grounded in the Technology Acceptance Model, trust theory, and the Cognitive Theory of Multimedia Learning, the study [...] Read more.

This study examines how learning input quality shapes students’ trust, perceived learning value, and post-exposure behavioural intentions, and whether AI-supported instructional content contributes to conceptual learning. Grounded in the Technology Acceptance Model, trust theory, and the Cognitive Theory of Multimedia Learning, the study proposes the Learning Input Quality, Belief, User Learning Experience, and Continuance (LBUC) model. Data were collected from 320 university students in Türkiye via an online survey. To evaluate the proposed framework in an authentic instructional setting, participants watched NotebookLM-generated instructional videos in Turkish and completed pre-test and post-test knowledge measures together with Likert-type scales assessing Audio and Narration Quality, Perceived Visual Design Quality, AI Trust and Persuasion, Instructional Design Effectiveness, Perceived Learning Value, Using Intention, and Recommendation Intention. Learning gains were assessed using paired-samples t-tests, and the proposed LBUC model was evaluated using Partial Least Squares Structural Equation Modelling. The findings showed a significant within-group increase in post-test scores, suggesting short-term conceptual gains after exposure to the videos. In the structural model, Audio and Narration Quality strongly predicted AI Trust and Persuasion, whereas Perceived Visual Design Quality significantly predicted Instructional Design Effectiveness but did not directly predict trust. Both AI Trust and Persuasion and Instructional Design Effectiveness positively influenced Perceived Learning Value, which in turn strongly predicted Using Intention and Recommendation Intention. The results suggest that students’ immediate post-exposure Using Intention and Recommendation Intention are associated less with visual appeal alone than with pedagogically coherent narration, AI Trust and Persuasion, and Perceived Learning Value in the context of NotebookLM-generated instructional videos. Full article

(This article belongs to the Special Issue Generative AI Transformation in Education: Current Issues and Challenges)

► Show Figures

Figure 1

25 pages, 3409 KB

Open AccessArticle

Edge-Hosted LLM-Assisted NICU Discharge Summary Generation: Field-Level Evaluation Using a Clinician-Defined Rubric

by Harpreet Singh, Ravneet Kaur, Satish Saluja, Su Jin Cho, Yao Sun and Ryan M. McAdams

Healthcare 2026, 14(11), 1457; https://doi.org/10.3390/healthcare14111457 - 25 May 2026

Viewed by 314

Abstract

Objective: To develop and evaluate an edge-hosted Large Language Model (LLM)-assisted system for automated Neonatal Intensive Care Unit (NICU) discharge summary generation using an evidence-grounded, field-level evaluation framework. Methods: This implementation and evaluation study was conducted in a Level III NICU [...] Read more.

Objective: To develop and evaluate an edge-hosted Large Language Model (LLM)-assisted system for automated Neonatal Intensive Care Unit (NICU) discharge summary generation using an evidence-grounded, field-level evaluation framework. Methods: This implementation and evaluation study was conducted in a Level III NICU in India. Longitudinal patient records were constructed from integrated bedside physiologic data (ARCHITECT) and a structured electronic medical record (EMR) platform Although an embedded audio–video module was present, it was not used in this study. Automated discharge summaries were generated by MORPHEUS, an edge-hosted orchestration pipeline running on NVIDIA Jetson AGX Orin hardware with JetPack 6.2. Local orchestration, preprocessing, and workflow execution were performed on the edge device, while language generation inference was performed using the OpenAI gpt-4o-mini API. Documentation quality was assessed with an LLM-based evaluator guided by a clinician-defined rubric comprising 72 fields organized across 14 section contexts and scored on five dimensions: clinical accuracy, completeness, actionability, coherence, and non-hallucination. Paired, field-level comparisons were performed against clinician-authored summaries. Of 549 NICU admissions screened between 1 October 2024 and 3 November 2025, 401 met the inclusion criteria for evaluation. Prompt refinement was performed iteratively using omission-derived feedback without model weight updates. Results: Across 401 evaluated admissions, MORPHEUS-generated summaries demonstrated higher rubric-based scores and lower omission burden than clinician-authored summaries within the structured evaluation framework used in this study, with mean scores of 0.93 versus 0.75 for accuracy, 0.91 versus 0.67 for completeness, 0.93 versus 0.72 for actionability, 0.94 versus 0.74 for coherence, and 0.95 versus 0.78 for non-hallucination, with the largest absolute advantage observed for completeness. Error taxonomy analysis demonstrated fewer omissions, unsupported assertions, and contradictions in AI-generated summaries than in clinician-authored summaries. Iterative prompt refinement was associated with directional improvement across quality dimensions and reduced omission burden, with omission rate per patient decreasing from 2.484 to 1.807 in the later iteration. Conclusions: An edge-hosted LLM-assisted pipeline can generate NICU discharge summaries that meet or exceed clinician-authored documentation quality under a reproducible, clinician-grounded evaluation framework. These findings support the feasibility of deploying edge-orchestrated generative AI systems for high-stakes neonatal clinical documentation using a clinician-grounded field-level evaluation framework. Full article

► Show Figures

Figure 1

19 pages, 1739 KB

Open AccessArticle

Video-Supported Remote Cognitive Assessment in General Practice—A Pilot Mixed-Method Study on Usability, Acceptability and Feasibility

by Alexa Holfelder, Esther Brill, Rachid Guerchouche, Minh Tran-Duc, Jacob Lahr and Stefan Klöppel

Healthcare 2026, 14(11), 1452; https://doi.org/10.3390/healthcare14111452 - 25 May 2026

Viewed by 460

Abstract

Background/Objectives: Access to specialists and diagnostic resources continues to limit differential diagnosis of cognitive impairment in primary care. This pilot study examined the feasibility, usability, and clinical integration of a digitally supported Remote Cognitive Assessment (RCA) model embedded in general practice settings. Methods: [...] Read more.

Background/Objectives: Access to specialists and diagnostic resources continues to limit differential diagnosis of cognitive impairment in primary care. This pilot study examined the feasibility, usability, and clinical integration of a digitally supported Remote Cognitive Assessment (RCA) model embedded in general practice settings. Methods: A mixed-method design was used, combining structured quantitative surveys from patients (n = 10; mean age = 77.03; SD = 14.1) and neuropsychologists (10 RCAs completed by three neuropsychologists) with qualitative interviews from general practitioners (GP; n = 4). Patients were assessed remotely via a secure videoconference system operated by trained neuropsychologists. Assessments were conducted in the GP’s office, supported by local staff, to facilitate the process. Results: Patients reported high satisfaction with audio (M = 8; SD = 2.28) and video quality (M = 9.17; SD = 1.17) and expressed a strong willingness to recommend RCA (M = 8.83; SD = 1.17) on a 10-point Likert scale. Despite moderate scores for perceived simplicity (M = 5; SD = 3.41) and effectiveness (M = 5.83; SD = 2.14), overall acceptance (M = 8.33; SD = 0.82) was favorable, especially given the older age of participants. Neuropsychologists rated technical functionality positively (audio quality M = 8.17; SD = 1.18; video quality M = 8; SD = 1.67) but raised concerns about clinical utility and diagnostic depth (effectiveness M = 2.83; SD = 2.71). GPs highlighted the benefits of local facilitation, early screening, and improved access to specialist input while also noting space limitations, communication gaps, and the need for sustainable infrastructure. Conclusions: The RCA model was well accepted by patients and GPs, and technically feasible for neuropsychologists. However, neuropsychologists reported important reservations regarding usability and effectiveness. The results suggest an important mismatch between patient satisfaction and clinical confidence and RCA cannot yet be recommended for routine clinical implementations based on patient acceptability alone. This model holds promises for hybrid cognitive care, particularly in underserved or rural areas, but future development must prioritize diagnostic confidence and clinician workflow usability before scalable integration into rural cognitive care pathways can be established. Full article

(This article belongs to the Section Digital Health Technologies)

► Show Figures

Figure 1

17 pages, 259 KB

Open AccessArticle

Supporting Advance Care Planning Among Mandarin and Cantonese Speaking Communities: A Qualitative Exploratory Study

by Upma Chitkara, Ashfaq Chauhan, Ramya Walsan, Mary Li, Eric Yeung, Ursula M. Sansom-Daly and Reema Harrison

Curr. Oncol. 2026, 33(5), 288; https://doi.org/10.3390/curroncol33050288 - 14 May 2026

Viewed by 301

Abstract

Whilst advance care planning (ACP) is important to ensure person-centred end of life care, there is sparse evidence about factors contributing towards engagement for people from Mandarin and Cantonese speaking backgrounds (MCSB) affected by cancer. This study aimed to establish barriers and facilitators [...] Read more.

Whilst advance care planning (ACP) is important to ensure person-centred end of life care, there is sparse evidence about factors contributing towards engagement for people from Mandarin and Cantonese speaking backgrounds (MCSB) affected by cancer. This study aimed to establish barriers and facilitators for quality ACP among people from MCSB with cancer and carers. A qualitative study utilising semi-structured interviews and focus groups was conducted. Participants included adult community members from MCSB in New South Wales who had accessed cancer care services in Australia as a support person or a patient in the last five years with recruitment done purposefully. Data collected from eligible consenting participants were audio/video recorded, transcribed verbatim and analysed using the Framework Method applying the Theoretical Domains Framework. Eighteen people participated (11 in two focus groups, seven individual interviews). Key barriers to engagement with ACP were unclear understanding of process and conduct, poor quality communication by healthcare staff, resource constraints and cultural misalignment of ACP concepts. The main facilitators were openness of participants to discussions, culturally informed community resources and dedicated ACP services. Co-design provides a useful approach to address varied identified factors. At the system and service level, co-design with these communities and healthcare providers could potentially develop resources to assist these communities in engaging with ACP, including preparing for ACP communication. Understanding and acknowledging cultural factors that impact ACP and integrating this knowledge in ACP communication may enhance engagement. Full article

(This article belongs to the Section Palliative and Supportive Care)

12 pages, 1542 KB

Open AccessArticle

A Pilot Study of Telerobotic Radical Thyroidectomy for Thyroid Cancer Using a 5G Network

by Bing Wang, Chen Li, Zheng Wan, Jian Zhu, Meng Wang, Yanbing Jian, Zelong Yang, Xin Miao, Linlin Zhang, Fei Kuang, Lin Liu, Guolou Li, Qingqing He, Jing Yao and Wen Tian

J. Clin. Med. 2026, 15(10), 3591; https://doi.org/10.3390/jcm15103591 - 8 May 2026

Viewed by 407

Abstract

Background: The incidence of thyroid cancer has increased globally. In recent years, robotic surgical systems have been applied in thyroid surgery, and the rapid development of fifth-generation (5G) communication technology has laid a solid foundation for the smooth implementation of remote surgery. [...] Read more.

Background: The incidence of thyroid cancer has increased globally. In recent years, robotic surgical systems have been applied in thyroid surgery, and the rapid development of fifth-generation (5G) communication technology has laid a solid foundation for the smooth implementation of remote surgery. Objective: The aim was to explore the feasibility and safety of telerobotic radical thyroidectomy using 5G communication technology to treat thyroid cancer. Methods: From August 2024 to October 2024, telerobotic radical thyroidectomy was performed on seven female patients using a 5G wireless network and a dedicated line network (or ordinary wired broadband) spanning 22–2200 km. The patients’ clinical and information transmission data were analyzed. Results: All patients (papillary thyroid carcinoma, female, with an average age of 44.0 ± 4.6 years) underwent uneventful surgical procedures without any transfer to open surgery or complications. The average surgical duration was 91.3 ± 11.8 min, the average blood loss was 11.4 ± 4.8 mL, and the average postoperative hospital stay was 3.6 ± 0.8 days. All subjects were successfully discharged within 5 days after surgery. The average total latency time of the intraoperative network was 137.5 (range, 121–159) ms, and there were no adverse events, such as network disconnection, frame loss, or network attacks. The operator worked smoothly without any obvious delay or lag, and the recorded audio and video are clear. Conclusions: Telerobotic radical thyroidectomy for thyroid cancer over a 5G network demonstrates promising feasibility and safety. With stable network transmission and a clear surgical field, the precise operations required in thyroid surgery can be performed reliably. These findings suggest that this technology can facilitate high-quality surgical care in remote areas, contributing to a more balanced distribution of medical resources. Full article

(This article belongs to the Section Machine Learning and Artificial Intelligence in Clinical Medicine)

► Show Figures

Figure 1

17 pages, 3615 KB

Open AccessArticle

FastTalk: Speech-Driven Lip Synchronization Video Generation for Chinese-Language Scenarios

by Yizhang Liu, Tao Fan, Xu Zhao and Guozhong Wang

Appl. Sci. 2026, 16(9), 4438; https://doi.org/10.3390/app16094438 - 1 May 2026

Viewed by 798

Abstract

Speech-driven lip synchronization is an important technique for talking-face video generation, with broad application potential in virtual humans, video dubbing, digital media, and human–computer interaction. However, existing methods still face challenges in achieving reliable lip synchronization while maintaining stable identity preservation, high visual [...] Read more.

Speech-driven lip synchronization is an important technique for talking-face video generation, with broad application potential in virtual humans, video dubbing, digital media, and human–computer interaction. However, existing methods still face challenges in achieving reliable lip synchronization while maintaining stable identity preservation, high visual fidelity, and efficient inference, especially in Chinese-language scenarios where related research remains relatively limited. To address these issues, we propose FastTalk, a speech-driven lip synchronization method for Chinese-language scenarios. The proposed framework performs latent-space restoration for efficient video synthesis, uses a fixed-mask strategy to suppress shortcut visual cues and strengthen audio-driven lip-shape prediction, and adopts a two-stage training scheme to reduce the gap between training and inference. This design improves generation stability while preserving efficiency. Experimental results show that FastTalk achieves competitive lip synchronization performance while improving visual quality and identity preservation. These results indicate that FastTalk provides an effective solution for Chinese speech-driven lip synchronization video generation. Full article

► Show Figures

Figure 1

29 pages, 4742 KB

Open AccessArticle

DistSense: A Distributed P2P System for Privacy-Preserving and Robust Audiovisual Activity Recognition in Smart Homes

by José Manuel Torres, Luis P. Mota, Rui S. Moreira, Christophe Soares and Pedro Sobral

Appl. Sci. 2026, 16(9), 4407; https://doi.org/10.3390/app16094407 - 30 Apr 2026

Viewed by 603

Abstract

Ambient Assisted Living (AAL) systems have become increasingly relevant as aging populations intensify the demand for technologies that promote autonomy, safety, and quality of life. However, the widespread adoption of audiovisual sensing in smart homes raises critical concerns regarding data protection, privacy, and [...] Read more.

Ambient Assisted Living (AAL) systems have become increasingly relevant as aging populations intensify the demand for technologies that promote autonomy, safety, and quality of life. However, the widespread adoption of audiovisual sensing in smart homes raises critical concerns regarding data protection, privacy, and user trust. Ensuring secure processing while maintaining accurate activity recognition remains a key challenge. This work introduces DistSense, a distributed Peer-to-Peer (P2P) system designed to enhance activity detection in domestic environments through collaborative inference among intelligent audiovisual sensors. DistSense prioritizes privacy by performing local processing, sharing only high-level events, and leveraging distributed ledger mechanisms to ensure data integrity and auditability and support cross-device validation. This collaborative strategy reduces false positives caused by occlusions, illumination variability, and acoustic noise. To assess the system, functional tests were conducted for each module, followed by two use cases evaluated in both simulated and real edge hardware environments. The trained models achieved 88% accuracy for audio and 80% for video, and the system demonstrated effective performance in detecting daily activities and domestic hazards under varying noise conditions. Results indicate that DistSense successfully balances security, user acceptance, and inference robustness, positioning it as a viable solution for privacy-preserving activity monitoring in smart home contexts. Full article

(This article belongs to the Special Issue Blockchain Technologies: Trends, Challenges, Potentials and Applications)

► Show Figures

Figure 1

23 pages, 47800 KB

Open AccessArticle

AIGC-Driven Short Video Generation Based on the Controllable Multimodal Fusion Architecture

by Yan Zhu, Wei Li, Caixia Fan and Lu Yu

Electronics 2026, 15(9), 1783; https://doi.org/10.3390/electronics15091783 - 22 Apr 2026

Viewed by 838

Abstract

The utilization of Artificial Intelligence-Generated Content (AIGC) has attracted widespread attention in video content creation. To generate high-quality videos, this paper presents a controllable multimodal fusion architecture for AIGC-driven short-video production. This architecture employs hierarchical constraint mechanisms and a multimodal attention fusion mechanism [...] Read more.

The utilization of Artificial Intelligence-Generated Content (AIGC) has attracted widespread attention in video content creation. To generate high-quality videos, this paper presents a controllable multimodal fusion architecture for AIGC-driven short-video production. This architecture employs hierarchical constraint mechanisms and a multimodal attention fusion mechanism to enhance video content coherence and user controllability. Specifically, a scene coherence scheme is first designed to construct graph-based global and transition-level constraints by integrating text descriptions, reference images, and audio features. By leveraging the extracted style vector data, preliminary video clips are then generated through a combination of the cross-modal fusion unit and the spatio-temporal consistency unit. Finally, a fine-grained adjustment mechanism is implemented to ensure logical consistency and stylistic uniformity in the AIGC-generated videos. Experimental results indicate that the proposed architecture improves generation quality, controllability, and cross-segment coherence under the adopted evaluation settings. Full article

(This article belongs to the Topic Advanced Development and Applications of AI-Generated Content (AIGC))

► Show Figures

Figure 1

30 pages, 6186 KB

Open AccessArticle

CABIF-Net: Robust Confidence-Based Audio-Visual Fusion for Fine-Grained Bird Recognition

by Zilong Li, Yan Zhang, Danju Lv and Yueyun Yu

Biology 2026, 15(8), 661; https://doi.org/10.3390/biology15080661 - 21 Apr 2026

Viewed by 537

Abstract

Fine-grained bird identification is crucial for ecosystem monitoring, species conservation, and habitat assessment. However, in real-world environments, there are challenges such as imbalances in modality quality and interference from background noise. To improve fine-grained audio-visual bird classification under heterogeneous modality conditions, we propose [...] Read more.

Fine-grained bird identification is crucial for ecosystem monitoring, species conservation, and habitat assessment. However, in real-world environments, there are challenges such as imbalances in modality quality and interference from background noise. To improve fine-grained audio-visual bird classification under heterogeneous modality conditions, we propose an audio-visual feature fusion framework named CABIF-Net. This framework introduces a confidence-based Top-K mean pooling module to select key frames to optimize the visual representations at the video level. Through a Confidence Calibration module, it dynamically assesses the reliability of the visual and audio modalities and integrates a Bidirectional Inter-modulation Fusion module to achieve controllable cross-modal information interaction. Experiments were conducted on the publicly available SSW60 dataset, characterized by severe noise and imbalance in modality quality, and the self-built Birds21 dataset with balanced modality quality. The experimental results show that the classification accuracies were 85.76% and 96.67%, respectively, outperforming existing unimodal methods and several mainstream fusion strategies. Weight distribution and visualization analyses further indicate that the proposed method can adaptively adjust the modality contributions based on discriminative evidence at the sample level. This study provides an effective framework for fine-grained audio-visual bird species recognition. Full article

(This article belongs to the Topic Wildlife Intelligent Monitoring: Advancing Conservation Through Visual and Acoustic Monitoring Technologies)

► Show Figures

Figure 1

35 pages, 3098 KB

Open AccessArticle

ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding

by Reka Sandaruwan Gallena Watthage and Anil Fernando

Appl. Sci. 2026, 16(7), 3424; https://doi.org/10.3390/app16073424 - 1 Apr 2026

Viewed by 353

Abstract

Current 360-degree video streaming systems consider viewport prediction, adaptive bitrate allocation, tile selection, and quality-of-experience (QoE) estimation as independent activities, yielding fragmented pipelines that do not scale well across content type and network conditions and do not scale well to individual users. We [...] Read more.

Current 360-degree video streaming systems consider viewport prediction, adaptive bitrate allocation, tile selection, and quality-of-experience (QoE) estimation as independent activities, yielding fragmented pipelines that do not scale well across content type and network conditions and do not scale well to individual users. We propose ImmerseFM-3D, a foundation model that jointly solves all four sub-tasks through a single shared representation. Seven input modalities, namely video frames, network traces, head-motion trajectories, ambisonics audio, depth maps, eye-tracking signals, and CLIP scene semantics, are fused by four-layer cross-modal attention and compressed into a 256-dimensional bottleneck latent via a variational information bottleneck. Four task-specific decoders operate on this shared latent simultaneously. A model-agnostic meta-learning adapter augmented with episodic memory and a hypernetwork personalizes the model from as little as 1 s of user interaction data. An extended branch supports six-degrees-of-freedom volumetric content through spherical harmonic viewport decoding and depth-aware tile importance weighting. Trained and evaluated on the IMMERSE-1M combined dataset (1000 h of 360° and volumetric video, 524 users, and over 50,000 mean opinion scores), ImmerseFM-3D reduces the mean angular viewport error by 34%, lowers the bandwidth violation rate from 8.3% to 3.1%, and achieves a QoE Pearson correlation of 0.891. The personalization adapter reaches 90% of peak performance in 22 s, while zero-shot cross-format transfer attains 72% of full in-domain accuracy. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

22 pages, 1747 KB

Open AccessReview

Talking Head Generation Through Generative Models and Cross-Modal Synthesis Techniques

by Hira Nisar, Salman Masood, Zaki Malik and Adnan Abid

J. Imaging 2026, 12(3), 119; https://doi.org/10.3390/jimaging12030119 - 10 Mar 2026

Viewed by 1417

Abstract

Talking Head Generation (THG) is a rapidly advancing field at the intersection of computer vision, deep learning, and speech synthesis, enabling the creation of animated human-like heads that can produce speech and express emotions with high visual realism. The core objective of THG [...] Read more.

Talking Head Generation (THG) is a rapidly advancing field at the intersection of computer vision, deep learning, and speech synthesis, enabling the creation of animated human-like heads that can produce speech and express emotions with high visual realism. The core objective of THG systems is to synthesize coherent and natural audio–visual outputs by modeling the intricate relationship between speech signals, facial dynamics, and emotional cues. These systems find widespread applications in virtual assistants, interactive avatars, video dubbing for multilingual content, educational technologies, and immersive virtual and augmented reality environments. Moreover, the development of THG has significant implications for accessibility technologies, cultural preservation, and remote healthcare interfaces. This survey paper presents a comprehensive and systematic overview of the technological landscape of Talking Head Generation. We begin by outlining the foundational methodologies that underpin the synthesis process, including generative adversarial networks (GANs), motion-aware recurrent architectures, and attention-based models. A taxonomy is introduced to organize the diverse approaches based on the nature of input modalities and generation goals. We further examine the contributions of various domains such as computer vision, speech processing, and human–robot interaction, each of which plays a critical role in advancing the capabilities of THG systems. The paper also provides a detailed review of datasets used for training and evaluating THG models, highlighting their coverage, structure, and relevance. In parallel, we analyze widely adopted evaluation metrics, categorized by their focus on image quality, motion accuracy, synchronization, and semantic fidelity. Operating parameters such as latency, frame rate, resolution, and real-time capability are also discussed to assess deployment feasibility. Special emphasis is placed on the integration of generative artificial intelligence (GenAI), which has significantly enhanced the adaptability and realism of talking head systems through more powerful and generalizable learning frameworks. Full article

(This article belongs to the Special Issue AI-Driven Multimodal Image and Video Processing: Advances and Applications)

► Show Figures

Figure 1

24 pages, 879 KB

Open AccessReview

A Survey of Diffusion Models: Methods and Applications

by HaoYu Ma and Hon-Cheng Wong

Appl. Sci. 2026, 16(5), 2482; https://doi.org/10.3390/app16052482 - 4 Mar 2026

Cited by 2 | Viewed by 3504

Abstract

Diffusion models have emerged as the state-of-the-art generative paradigm, surpassing GANs in synthesizing high-fidelity images, videos, and audio. However, their reliance on iterative denoising processes imposes substantial computational burdens and memory overheads, creating a significant barrier to their deployment on resource-constrained edge devices. [...] Read more.

Diffusion models have emerged as the state-of-the-art generative paradigm, surpassing GANs in synthesizing high-fidelity images, videos, and audio. However, their reliance on iterative denoising processes imposes substantial computational burdens and memory overheads, creating a significant barrier to their deployment on resource-constrained edge devices. Unlike existing surveys that broadly cover general methodologies, this paper provides a focused review with a specific emphasis on efficient and lightweight diffusion models. We systematically analyze the trade-offs between generation quality and computational cost, categorizing acceleration techniques into sampling optimization, architectural compression, and knowledge distillation. Furthermore, we explore the integration of diffusion models with emerging architectures (e.g., Mamba) and their evolution towards general-purpose world simulators. This survey aims to provide a roadmap for “Green AI,” bridging the gap between high-end academic research and practical, real-world applications. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

21 pages, 2964 KB

Open AccessArticle

MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts

by Huaye Zhang, Chenglizhao Chen, Mengke Song, Tingting Chen, Diqiong Jiang, Lichun Liu and Xinyu Liu

Sensors 2026, 26(4), 1395; https://doi.org/10.3390/s26041395 - 23 Feb 2026

Viewed by 1018

Abstract

Recent technologies such as music retrieval, soundtrack generation, and video understanding have developed rapidly. Consequently, the aesthetic evaluation of video soundtracks has become an important research topic in academia. Soundtracks are key elements in shaping the emotional atmosphere and driving the narrative rhythm. [...] Read more.

Recent technologies such as music retrieval, soundtrack generation, and video understanding have developed rapidly. Consequently, the aesthetic evaluation of video soundtracks has become an important research topic in academia. Soundtracks are key elements in shaping the emotional atmosphere and driving the narrative rhythm. Therefore, they require systematic methods to assess their artistic coordination with visual content. However, existing approaches mostly focus on evaluating the quality of the music itself. They often lack the ability to model the deeper aesthetic synergy between audio and visuals. To address this gap, we propose MEMA, a new soundtrack aesthetic evaluation model. MEMA employs a two-stage training strategy. The first stage builds a crossmodal imagination mechanism using a Conditional Variational Autoencoder. This method achieves bidirectional semantic reconstruction between audio and visuals. The second stage introduces a Guided Cross-Attention Alignment Module. This module enhances the model’s focus on key narrative moments in video. To facilitate this research, we also construct VMAE-Sets. It is the first large-scale dataset dedicated to soundtrack aesthetic evaluation. Finally, MEMA performs scoring and textual evaluation along three core aesthetic dimensions. Experimental results demonstrate that MEMA outperforms existing methods, achieving average improvements of 18.137% in LCC and 17.866% in SRCC compared to the strongest baseline. These findings confirm its superior audio–visual narrative alignment, demonstrating high consistency with human judgments. Full article

(This article belongs to the Special Issue Music Acquisition and Automatic Processing for Machine Learning-Based Applications)

► Show Figures

Figure 1

Search Results (108)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (108)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI