MDPI - Publisher of Open Access Journals

17 pages, 10311 KB

Open AccessArticle

DeepFakeX: A Comprehensive Multimodal Deepfake Dataset for Research and Analysis

by Sonia Salman, Jawwad Ahmed Shamsi and Rizwan Qureshi

Data 2026, 11(6), 141; https://doi.org/10.3390/data11060141 - 11 Jun 2026

Viewed by 587

The expanding capabilities of deep learning-based media synthesis have intensified concerns regarding the authenticity of digital content and the reliability of forensic analysis tools. In response to these challenges, this work introduces DeepFakeX, a collection of 800 synthetically generated videos available under controlled [...] Read more.

The expanding capabilities of deep learning-based media synthesis have intensified concerns regarding the authenticity of digital content and the reliability of forensic analysis tools. In response to these challenges, this work introduces DeepFakeX, a collection of 800 synthetically generated videos available under controlled access for research purposes. The dataset encompasses four distinct categories of AI-driven synthesis: facial identity replacement, audio track substitution, neural voice cloning, and combined audiovisual alteration. Unlike existing deepfake datasets that predominantly focus on facial synthesis, DeepFakeX covers a broader range of manipulation modalities, reflecting the diversity of synthetic media encountered in real-world settings. All deepfakes were generated using state-of-the-art, publicly available tools. Standardized post-processing procedures were applied to each video to ensure uniformity in terms of quality, duration and encoding format. DeepFakeX also emphasizes diversity in gender, age, ethnicity, and language. Video contexts span speeches, informational videos, movie clips, news broadcasts, and interviews that reflect content scenarios commonly encountered in real-world online environments. The dataset includes videos in both English and Urdu. The dataset’s quality and structural variability were assessed through visual and audio analyses using the Structural Similarity Index Measure (SSIM), Mel-Frequency Cepstral Coefficients (MFCCs), and Principal Component Analysis (PCA). The evaluation results revealed substantial variability within each manipulation category, along with clearly distinguishable patterns specific to each modality. DeepFakeX has been developed to facilitate rigorous and transparent research in deepfake detection, cross-modal forensic analysis, and AI-driven media forensics. It is hosted on Zenodo under controlled access for research use. Full article

► Show Figures

Figure 1

31 pages, 30018 KB

Open AccessArticle

Sensors-Driven Multimodal Deepfake Detection: A Cross-Attention Fusion Approach with Adaptive Modality Gating

by Syeda Sitara Waseem, Noman Shabbir, Syed Rizwan Hassan and KangYoon Lee

Sensors 2026, 26(12), 3695; https://doi.org/10.3390/s26123695 - 10 Jun 2026

Viewed by 212

Abstract

Deepfakes threaten sensor-based authentication systems, including biometric sensors, surveillance cameras, and IoT edge devices. Unimodal detectors remain vulnerable to modality-specific attacks. We propose a multimodal deepfake detection framework optimized for resource-constrained edge devices, featuring a novel cross-modal attention fusion mechanism with adaptive gating. [...] Read more.

Deepfakes threaten sensor-based authentication systems, including biometric sensors, surveillance cameras, and IoT edge devices. Unimodal detectors remain vulnerable to modality-specific attacks. We propose a multimodal deepfake detection framework optimized for resource-constrained edge devices, featuring a novel cross-modal attention fusion mechanism with adaptive gating. The architecture combines enhanced Res2Net for audio, temporal 3D CNN with SE attention for video, and bidirectional cross-modal attention with quality-based gates. On our benchmark (5472 audio + 1842 video samples), the fusion model achieves 96.7% accuracy, 96.6% F1-score, 0.988 AUC-ROC, and 3.3% EER. Adversarial testing shows 92.3% accuracy under the Fast Gradient Sign Method (FGSM) attack. The model has a 30.3 MB footprint and runs at 20 FPS on edge hardware. Modality contribution analysis reveals adaptive weighting (72% audio for TTS forgery, 78% video for lip-synced attacks). Cross-dataset evaluation on FakeAVCeleb achieves 92.3% overall accuracy, confirming generalization. Full article

(This article belongs to the Special Issue Secure and Resilient Solutions for CCTV, Small Sensor and IoT Device Security)

► Show Figures

Figure 1

25 pages, 1735 KB

Open AccessArticle

WAFF: A Synergetic Face Forgery Video Detection Method via Weakly Supervised EfficientNet

by Zhengzhuo Pan, Bohan Chen, Longxiang Ma, Dawei Jin, Yu Zhou and Yudi Huang

J. Imaging 2026, 12(6), 240; https://doi.org/10.3390/jimaging12060240 - 29 May 2026

Viewed by 317

Abstract

Deepfake detection has become an essential task for ensuring the authenticity and security of digital media. Although recent approaches have achieved notable progress, most existing detectors still exhibit limited generalization to unseen forgery techniques and remain vulnerable to common perturbations such as compression, [...] Read more.

Deepfake detection has become an essential task for ensuring the authenticity and security of digital media. Although recent approaches have achieved notable progress, most existing detectors still exhibit limited generalization to unseen forgery techniques and remain vulnerable to common perturbations such as compression, noise, and adversarial attacks. To overcome these issues, we propose Weakly Supervised EfficientNet Augmented Face Forgery Detector (WAFF), a novel framework that integrates fine-grained per-frame analysis with adaptive video-level fusion. Specifically, WAFF integrates WSEffiNet, an EfficientNet-B3-based backbone enhanced with a Weakly Supervised Data Augmentation Network (WS-DAN). This design generates attention maps to emphasize subtle facial forgery artifacts while encouraging complementary local–global feature learning. At the video level, WAFF incorporates a multi-strategy fusion scheme that combines fake-frame counting, confidence averaging, and attention-guided voting to strike a balance between sensitivity and stability. Extensive experiments on FaceForensics++, Celeb-DF v2, DFD, DFDC, and FFIW-10K demonstrate that WAFF can achieve state-of-the-art performance under both high- and low-quality compression, while also enhancing cross-dataset generalization. Full article

(This article belongs to the Special Issue AI-Driven Image and Video Understanding)

► Show Figures

Figure 1

20 pages, 2215 KB

Open AccessArticle

Frame Selection Strategies for Video Deepfake Detection: Benchmarking Accuracy and Runtime Trade-Offs

by Artūras Serackis, Mindaugas Jankauskas, Anastasija Grubinskienė and Vytautas Abromavičius

Appl. Sci. 2026, 16(11), 5364; https://doi.org/10.3390/app16115364 - 27 May 2026

Viewed by 348

Abstract

This study evaluates frame selection during inference as an independent factor in video deepfake detection while keeping the downstream detectors fixed. We compare twelve frame selection strategies, ranging from simple temporal and quality baselines to landmark aware policies, using four validated pretrained detectors: [...] Read more.

This study evaluates frame selection during inference as an independent factor in video deepfake detection while keeping the downstream detectors fixed. We compare twelve frame selection strategies, ranging from simple temporal and quality baselines to landmark aware policies, using four validated pretrained detectors: Self-Blended Images (SBIs), Frequency-Enhanced Self-Blended Images (FSBIs), Generative Convolutional Vision Transformer (GenConViT), and GenD. The primary experiment is a complete factorial benchmark with 300 videos and five frame budgets (2, 4, 8, 16, and 32 selected frames), which provides the reference results at 32 frames. To address sample size limitations, an additional validation experiment uses a deduplicated split of 1180 Celeb-DF++ and FaceForensics++ videos, with complete results for 2, 4, and 8 selected frames and a reported subset for 16 selected frames. In the complete 300-video benchmark, 32 frames achieved the strongest average AUC, while 8 and 16 frames recovered most of the attainable performance with lower runtime. The best single validated configuration was GenD with Shot-aware sampling at 32 frames, yielding an AUC of 0.9607 and a balanced accuracy of 0.9133. The study therefore does not claim that smaller budgets universally outperform 32 frames; instead, it quantifies the tradeoff between accuracy and runtime and shows that frame selection remains a meaningful design variable under constrained inference budgets. Full article

(This article belongs to the Special Issue Integration of AI in Signal and Image Processing)

► Show Figures

Figure 1

19 pages, 1771 KB

Open AccessArticle

Dynamic Spatial-Temporal Inconsistency Learning for General Deepfake Detection in Visual Understanding

by Jicheng Li, Guangjun Liao, Yufei Wang, Xing Liu and Beibei Liu

Mathematics 2026, 14(10), 1612; https://doi.org/10.3390/math14101612 - 9 May 2026

Viewed by 376

Abstract

Generalizable deepfake detection is essential for trustworthy visual understanding in real-world computer vision applications. This paper presents a dynamic spatial-temporal inconsistency learning algorithm designed to achieve high generalization in deepfake video detection. Current video-based detection approaches tend to either isolate spatial artifacts or [...] Read more.

Generalizable deepfake detection is essential for trustworthy visual understanding in real-world computer vision applications. This paper presents a dynamic spatial-temporal inconsistency learning algorithm designed to achieve high generalization in deepfake video detection. Current video-based detection approaches tend to either isolate spatial artifacts or merely exploit coarse temporal inconsistencies when identifying deepfake videos, which impedes the acquisition of fine-grained spatial-temporal clues and consequently limits their generalization capability. To this end, we propose the dynamic spatial-temporal network (DST-Net), a deep architecture that systematically mines comprehensive inconsistency cues through three synergistic modules. The short-term temporal modality extraction (STME) module captures temporal dynamics from adjacent frames. The short-term spatial-temporal inconsistency extraction (SSTIE) module with pixel-wise supervision learns semantically meaningful inconsistency features resistant to perturbations. The dynamic-term spatial-temporal inconsistency extraction (DSTIE) module adaptively aggregates these features across timescales, building robust multi-scale representations. This design ensures that the learned representations capture intrinsic forgery patterns, enhancing generalization and robustness. Comprehensive evaluations conducted on five widely adopted benchmark datasets reveal that our method surpasses nine representative competitors, with superior robustness to common image perturbations. This work advances the application of deep learning algorithms to reliable visual understanding in multimedia forensics. Full article

(This article belongs to the Special Issue Recent Advances in Deep Learning Algorithms for Computer Vision and Image Analysis)

► Show Figures

Figure 1

18 pages, 741 KB

Open AccessReview

A Review of Tools and Technologies to Combat Deepfakes

by Dmitry Erokhin and Nadejda Komendantova

Information 2026, 17(4), 347; https://doi.org/10.3390/info17040347 - 3 Apr 2026

Cited by 1 | Viewed by 2755

Abstract

Deepfakes and adjacent synthetic-media capabilities have become a systemic challenge for information integrity, security, and digital trust. Countermeasures now span passive detection methods that infer manipulation from content traces, active provenance systems that cryptographically bind metadata to media, and watermarking approaches that embed [...] Read more.

Deepfakes and adjacent synthetic-media capabilities have become a systemic challenge for information integrity, security, and digital trust. Countermeasures now span passive detection methods that infer manipulation from content traces, active provenance systems that cryptographically bind metadata to media, and watermarking approaches that embed detectable signals into content or generative processes. This review presents a rigorous synthesis of tools and technologies to combat deepfakes across modalities (image, video, audio, and selected multimodal settings), drawing primarily from the peer-reviewed literature, standardized benchmarks, and official technical specifications and reports. The review analyzes detection methods, provenance and authentication technologies, with emphasis on cryptographic manifests and threat models, watermarking and content provenance, including diffusion-era watermarking and industrial deployments, adversarial robustness and attacker adaptation, datasets and benchmarks, evaluation metrics across tasks, and deployment and scalability constraints. A dedicated section addresses legal, ethical, and policy issues, focusing on emerging transparency obligations and platform governance. The review finds that no single countermeasure is sufficient in realistic adversarial settings. The strongest practical approach is a layered defense that combines provenance, watermarking, content-based detection, and human oversight. The study concludes with limitations of the current evidence base and prioritized research directions to improve generalization, interoperability, and trustworthy user experiences. Full article

(This article belongs to the Special Issue Surveys in Information Systems and Applications)

► Show Figures

Graphical abstract

10 pages, 375 KB

Open AccessEntry

Deepfakes

by Sean William Maher

Encyclopedia 2026, 6(4), 80; https://doi.org/10.3390/encyclopedia6040080 - 2 Apr 2026

Viewed by 87371

Definition

Deepfakes have emerged as one of the most significant developments in contemporary computational media, representing a sophisticated convergence of machine learning, computer vision, and audiovisual synthesis. Enabled primarily by deep neural networks such as generative adversarial networks (GANs) and transformer-based architectures, Deepfakes are [...] Read more.

Deepfakes have emerged as one of the most significant developments in contemporary computational media, representing a sophisticated convergence of machine learning, computer vision, and audiovisual synthesis. Enabled primarily by deep neural networks such as generative adversarial networks (GANs) and transformer-based architectures, Deepfakes are realistic video fabrications through sound and image alteration and substitution that synthesises human likeness, speech, and behaviours. Deepfakes function simultaneously as creative tools, political instruments, security risks, and epistemic disruptors. They have generated widespread scholarly, regulatory, and public concern by contributing to the reshaping of visual communication and posing significant challenges to established norms of authenticity. This entry defines Deepfakes, outlines their technological foundations, synthesises insights from current research and assesses implications for media industries, journalism, documentary, disinformation, governance, and digital culture. Full article

(This article belongs to the Section Social Sciences)

► Show Figures

Figure 1

18 pages, 1850 KB

Open AccessArticle

AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection

by Sameena Javaid, Marwa Chendeb El Rai, Abeer Elkhouly, Obada Al-Khatib, Aicha Beya Far and May El Barachi

Appl. Sci. 2026, 16(7), 3450; https://doi.org/10.3390/app16073450 - 2 Apr 2026

Viewed by 480

Abstract

The rapid advancement of deepfake generation technologies presents significant challenges to the verification of digital video authenticity. These time-dependent artifacts are difficult to detect using conventional frame-based detection approaches. This paper introduces AT-HSTNet, an Action-Transformer-based Hierarchical Spatiotemporal Network designed for robust and computationally [...] Read more.

The rapid advancement of deepfake generation technologies presents significant challenges to the verification of digital video authenticity. These time-dependent artifacts are difficult to detect using conventional frame-based detection approaches. This paper introduces AT-HSTNet, an Action-Transformer-based Hierarchical Spatiotemporal Network designed for robust and computationally efficient deepfake video detection. The proposed framework adopts a multi-stage hierarchical architecture in which frame-level visual features are extracted using an EfficientNet-B0 backbone, short- and medium-range temporal patterns are modeled through Bidirectional Long Short-Term Memory (BiLSTM) networks, and long-range temporal dependencies are captured using an action-aware Transformer operating on temporally aggregated representations. Unlike conventional video transformers that apply self-attention directly to raw frame-level features, the proposed action-aware attention mechanism reduces redundant computation and improves stability in temporal reasoning. Extensive experiments on the balanced FFIW-10K dataset demonstrate that AT-HSTNet achieves an accuracy of 98.7%, with 98.0% precision, 96.0% recall, and a 96.9% F1-score, outperforming representative CNN–BiLSTM and CNN–Transformer baseline architectures. In addition, AT-HSTNet is highly efficient, requiring only 0.45 GFLOPs and achieving an inference speed of approximately 30 FPS on consumer-grade GPU hardware. As a result of this study, we found hierarchical temporal modeling more effective when combined with action-aware attention for any deepfake video detection. Full article

► Show Figures

Figure 1

22 pages, 3493 KB

Open AccessArticle

Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers

by Joe Soundararajan and Dong Xu

AI 2026, 7(3), 115; https://doi.org/10.3390/ai7030115 - 19 Mar 2026

Viewed by 3570

Abstract

Background: Deepfakes pose a growing threat to the integrity of visual media, motivating detectors that remain reliable as forgeries become increasingly realistic. Methods: We propose a deepfake detection framework built on CLIP-derived SigLIP-2 vision transformers and a multi-task design that jointly performs (i) [...] Read more.

Background: Deepfakes pose a growing threat to the integrity of visual media, motivating detectors that remain reliable as forgeries become increasingly realistic. Methods: We propose a deepfake detection framework built on CLIP-derived SigLIP-2 vision transformers and a multi-task design that jointly performs (i) classification and (ii) manipulated-region localization when pixel-level supervision is available. We evaluated the approach on three public benchmarks of increasing complexity—HiDF, SID_Set (SIDA), and CiFake—using each dataset’s official partitions where provided (SID_Set uses the predefined train/validation split) and a standardized preprocessing and training pipeline across experiments. Results: On HiDF, our model achieved strong performance on both video and image tracks (AUC up to 0.931 on video and 0.968 on images), yielding large gains relative to previously reported HiDF baselines under their published settings. On SID_Set, the model achieved 99.1% three-class accuracy (real/synthetic/tampered) and produced accurate localization masks for many tampered regions, while we explicitly documented the split protocol and leakage checks to support the validity of the evaluation. On CiFake, the model exceeded 95% accuracy and attained an AUC of 0.986. Conclusions: Overall, the results indicate that SigLIP-2 representations combined with multi-task training can deliver high detection accuracy and interpretable localization on challenging, realistic forgeries, while highlighting the importance of clearly stated evaluation protocols for fair comparison. Full article

(This article belongs to the Section AI Systems: Theory and Applications)

► Show Figures

Figure 1

18 pages, 5241 KB

Open AccessViewpoint

The Generative AI Paradox: GenAI and the Erosion of Trust, the Corrosion of Information Verification, and the Demise of Truth

by Emilio Ferrara

Future Internet 2026, 18(2), 73; https://doi.org/10.3390/fi18020073 - 1 Feb 2026

Cited by 2 | Viewed by 4146

Abstract

Generative AI (GenAI) now produces text, images, audio, and video that can be perceptually convincing at scale and at negligible marginal cost. While public debate often frames the associated harms as “deepfakes” or incremental extensions of misinformation and fraud, this view misses a [...] Read more.

Generative AI (GenAI) now produces text, images, audio, and video that can be perceptually convincing at scale and at negligible marginal cost. While public debate often frames the associated harms as “deepfakes” or incremental extensions of misinformation and fraud, this view misses a broader socio-technical shift: GenAI enables synthetic realities—coherent, interactive, and potentially personalized information environments in which content, identity, and social interaction are jointly manufactured and mutually reinforcing. We argue that the most consequential risk is not merely the production of isolated synthetic artifacts, but the progressive erosion of shared epistemic ground and institutional verification practices as synthetic content, synthetic identity, and synthetic interaction become easy to generate and hard to audit. This paper (i) formalizes synthetic reality as a layered stack (content, identity, interaction, institutions), (ii) expands a taxonomy of GenAI harms spanning personal, economic, informational, and socio-technical risks, (iii) articulates the qualitative shifts introduced by GenAI (cost collapse, throughput, customization, micro-segmentation, provenance gaps, and trust erosion), and (iv) synthesizes recent risk realizations (2023–2025) into a compact case bank illustrating how these mechanisms manifest in fraud, elections, harassment, documentation, and supply-chain compromise. We then propose a mitigation stack that treats provenance infrastructure, platform governance, institutional workflow redesign, and public resilience as complementary rather than substitutable, and outline a research agenda focused on measuring epistemic security. We conclude with the Generative AI Paradox: as synthetic media becomes ubiquitous, societies may rationally discount digital evidence altogether, raising the cost of truth for everyday life and for democratic and economic institutions. Full article

(This article belongs to the Special Issue 2024 and 2025 Feature Papers from Future Internet’s Editorial Board Members)

► Show Figures

Figure 1

25 pages, 2900 KB

Open AccessArticle

SDEQ-Net: A Deepfake Video Anomaly Detection Method Integrating Stochastic Differential Equations and Hermitian-Symmetric Quantum Representations

by Ruixing Zhang, Bin Li and Degang Xu

Symmetry 2026, 18(2), 259; https://doi.org/10.3390/sym18020259 - 30 Jan 2026

Viewed by 689

Abstract

With the rapid advancement of deepfake generation technologies, forged videos have become increasingly realistic in visual quality and temporal consistency, posing serious threats to multimedia security. Existing detection methods often struggle to effectively model temporal dynamics and capture subtle inter-frame anomalies. To address [...] Read more.

With the rapid advancement of deepfake generation technologies, forged videos have become increasingly realistic in visual quality and temporal consistency, posing serious threats to multimedia security. Existing detection methods often struggle to effectively model temporal dynamics and capture subtle inter-frame anomalies. To address these challenges, we propose a Stochastic Differential Equation and Quantum Uncertainty Network (SDEQ-Net), a novel deepfake video anomaly detection framework that integrates continuous time stochastic modeling with quantum uncertainty mechanisms. First, a Continuous Time Neural Stochastic Differential Filtering Module (CNSDFM) is introduced to characterize the continuous evolution of latent inter-frame states using neural stochastic differential equations, enabling robust temporal filtering and uncertainty estimation. Second, a Quantum Uncertainty Aware Fusion Module (QUAFM) incorporates Hermitian-symmetric density matrix representations and von Neumann entropy to enhance feature fusion under uncertainty, leveraging the mathematical symmetry properties of quantum state representations for principled uncertainty quantification. Third, a Fractional Order Temporal Anomaly Detection Module (FOTADM) is proposed to generate fine grained temporal anomaly scores based on fractional order residuals, which are used as dynamic weights to guide attention toward anomalous frames. Extensive experiments on three benchmark datasets, including FaceForensics++, Celeb-DF, and DFDC, demonstrate the effectiveness of the proposed method. SDEQ-Net achieves AUC scores of 99.81% on FF++ (c23) and 97.91% on FF++ (c40). In cross dataset evaluations, it obtains 89.55% AUC on Celeb-DF and 86.21% AUC on DFDC, consistently outperforming existing state-of-the-art methods in both detection accuracy and generalization capability. Full article

(This article belongs to the Section Computer)

► Show Figures

Figure 1

19 pages, 1747 KB

Open AccessArticle

Video Deepfake Detection Based on Multimodality Semantic Consistency Fusion

by Fang Sun, Xiaoxuan Guo, Tong Zhang, Yang Liu and Jing Zhang

Future Internet 2026, 18(2), 67; https://doi.org/10.3390/fi18020067 - 23 Jan 2026

Cited by 1 | Viewed by 1256

Abstract

Deepfake detection in video data typically relies on mining deep embedded representations across multiple modalities to obtain discriminative fused features and thereby improve detection accuracy. However, existing approaches predominantly focus on how to exploit complementary information across modalities to ensure effective fusion, while [...] Read more.

Deepfake detection in video data typically relies on mining deep embedded representations across multiple modalities to obtain discriminative fused features and thereby improve detection accuracy. However, existing approaches predominantly focus on how to exploit complementary information across modalities to ensure effective fusion, while often overlooking the impact of noise and interference present in the data. For instance, issues such as small objects, blurring, and occlusions in the visual modality can disrupt the semantic consistency of the fused features. To address this, we propose a Multimodality Semantic Consistency Fusion model for video forgery detection. The model introduces a semantic consistency gating mechanism to enhance the embedding of semantically aligned information across modalities, thereby improving the discriminability of the fused representations. Furthermore, we incorporate an event-level weakly supervised loss to strengthen the global semantic discrimination of the video data. Extensive experiments on standard video forgery detection benchmarks demonstrate the effectiveness of the proposed method, achieving superior performance in both forgery event detection and localization compared to state-of-the-art approaches. Full article

(This article belongs to the Special Issue Information and Future Internet Security, Trust and Privacy—4th Edition)

► Show Figures

Figure 1

21 pages, 1055 KB

Open AccessArticle

FAIR-VID: A Multimodal Pre-Processing Pipeline for Student Application Analysis

by Algirdas Laukaitis, Diana Kalibatienė, Dovilė Jodenytė, Kęstutis Normantas, Julius Jancevičius, Mindaugas Jankauskas and Artūras Serackis

Appl. Sci. 2025, 15(24), 13127; https://doi.org/10.3390/app152413127 - 13 Dec 2025

Cited by 1 | Viewed by 1452

Abstract

The shift toward remote and automated admission processes in higher education introduces new challenges, including evaluator subjectivity and risks of applicant fraud. The FAIR-VID project addresses these issues by developing an artificial intelligence system that integrates multimodal data fusion with semi-supervised deep learning [...] Read more.

The shift toward remote and automated admission processes in higher education introduces new challenges, including evaluator subjectivity and risks of applicant fraud. The FAIR-VID project addresses these issues by developing an artificial intelligence system that integrates multimodal data fusion with semi-supervised deep learning to assess applicant video interviews, submitted documents, and form data. This paper presents the project’s data preprocessing pipeline, designed to fuse heterogeneous modalities and to support seamless interaction between AI agents and human decision-makers throughout the admission workflow. The proposed process is intentionally general, making it applicable not only to international university admissions but also to broader human resource management and hiring contexts. Emphasis is placed on the need for robust and transparent AI adoption in admission and recruitment, supported by open-source modules and models at every stage of interaction between applicants and institutions. As a proof of concept, we provide open-source solutions for the analysis of video interviews, images, and documents enriched with semantic descriptions generated by large multimodal and complementary AI models. The paper details the multi-phase implementation of this pipeline to create structured, semantically rich datasets suitable for training advanced deep learning systems for comprehensive applicant assessment and fraud detection. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

18 pages, 1001 KB

Open AccessArticle

Artificial Intelligence Physician Avatars for Patient Education: A Pilot Study

by Syed Ali Haider, Srinivasagam Prabha, Cesar Abraham Gomez-Cabello, Ariana Genovese, Bernardo Collaco, Nadia Wood, Mark A. Lifson, Sanjay Bagaria, Cui Tao and Antonio Jorge Forte

J. Clin. Med. 2025, 14(23), 8595; https://doi.org/10.3390/jcm14238595 - 4 Dec 2025

Cited by 3 | Viewed by 3125

Abstract

Background: Generative AI and synthetic media have enabled realistic human Embodied Conversational Agents (ECAs) or avatars. A subset of this technology replicates faces and voices to create realistic likenesses. When combined with avatars, these methods enable the creation of “digital twins” of physicians, [...] Read more.

Background: Generative AI and synthetic media have enabled realistic human Embodied Conversational Agents (ECAs) or avatars. A subset of this technology replicates faces and voices to create realistic likenesses. When combined with avatars, these methods enable the creation of “digital twins” of physicians, offering patients scalable, 24/7 clinical communication outside the immediate clinical environment. This study evaluated surgical patient perceptions of an AI-generated surgeon avatar for postoperative education. Methods: We conducted a pilot feasibility study with 30 plastic surgery patients at Mayo Clinic, USA (July–August 2025). A bespoke interactive surgeon avatar was developed in Python using the HeyGen IV model to reproduce the surgeon’s likeness. Patients interacted with the avatar through natural voice queries, which were mapped to predetermined, pre-recorded video responses covering ten common postoperative topics. Patient perceptions were assessed using validated scales of usability, engagement, trust, eeriness, and realism, supplemented by qualitative feedback. Results: The avatar system reliably answered 297 of 300 patient queries (99%). Usability was excellent (mean System Usability Scale score = 87.7 ± 11.5) and engagement high (mean 4.27 ± 0.23). Trust was the highest-rated domain, with all participants (100%) finding the avatar trustworthy and its information believable. Eeriness was minimal (mean = 1.57 ± 0.48), and 96.7% found the avatar visually pleasing. Most participants (86.6%) recognized the avatar as their surgeon, although many still identified it as artificial; voice resemblance was less convincing (70%). Interestingly, participants with prior exposure to deepfakes demonstrated consistently higher acceptance, rating usability, trust, and engagement 5–10% higher than those without prior exposure. Qualitative feedback highlighted clarity, efficiency, and convenience, while noting limitations in realism and conversational scope. Conclusions: The AI-generated physician avatar achieved high patient acceptance without triggering uncanny valley effects. Transparency about the synthetic nature of the technology enhanced, rather than diminished, trust. Familiarity with the physician and institutional credibility likely played a key role in the high trust scores observed. When implemented transparently and with appropriate safeguards, synthetic physician avatars may offer a scalable solution for postoperative education while preserving trust in clinical relationships. Full article

(This article belongs to the Special Issue Advancing Clinical Medicine Through Artificial Intelligence (AI) and Digital Technology: 2nd Edition)

► Show Figures

Figure 1

26 pages, 2820 KB

Open AccessArticle

Forensic Analysis of Manipulated Images and Videos

by Sergio A. Falcón-López, Llanos Tobarra, Antonio Robles-Gómez and Rafael Pastor-Vargas

Appl. Sci. 2025, 15(23), 12664; https://doi.org/10.3390/app152312664 - 29 Nov 2025

Cited by 1 | Viewed by 2483

Abstract

The transition from Industry 4.0 to Industry 5.0 emphasizes the need for ethical, transparent, and human-centric artificial intelligence systems. In this context, ensuring the authenticity of digital information has become crucial for maintaining societal trust. This study addresses the challenge of detecting manipulated [...] Read more.

The transition from Industry 4.0 to Industry 5.0 emphasizes the need for ethical, transparent, and human-centric artificial intelligence systems. In this context, ensuring the authenticity of digital information has become crucial for maintaining societal trust. This study addresses the challenge of detecting manipulated multimedia content, including synthetic images, videos, and audio generated by artificial intelligence, commonly known as Deepfakes. We analyze and compare general-purpose and Deepfake-specific detection methods to assess their effectiveness in real-world scenarios. This work introduces a refined reference model that integrates both application-oriented and methodological criteria, grouping tools into Blind Forensic, Handcrafted Machine Learning, Deep Learning-based methods, and Toolkits. This structured taxonomy provides a clearer comparative framework than existing works, which typically classify detectors using only one of these dimensions. To ensure reproducible evaluation, all experiments were performed using the SAFL dataset, which consolidates real and synthetic multimedia content generated with publicly available tools under a unified protocol. Among the tested tools, Forensically achieved the highest accuracy in image forgery detection 86.9%, while Autopsy reached 69.5% among Deepfake-specific image detectors. In video analysis, Forensically obtained 98.6% accuracy, whereas Deepware Scanner achieved 91.2% as the most effective Deepfake-focused tool. These results highlight that general-purpose methods remain robust for images, while specialized detectors perform competitively in videos. Overall, the proposed model and dataset establish a consistent foundation for advancing hybrid detection strategies aligned with the ethical and transparent AI principles envisioned in Industry 5.0. Full article

(This article belongs to the Special Issue AI from Industry 4.0 to Industry 5.0: Engineering for Social Change)

► Show Figures

Figure 1

Search Results (98)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (98)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI