MDPI - Publisher of Open Access Journals

18 pages, 1606 KB

Open AccessArticle

Multi-Scale Dynamic Perception and Context Guidance Modulation for Efficient Deepfake Detection

by Yuanqing Ding, Fanliang Bu and Hanming Zhai

Electronics 2026, 15(8), 1569; https://doi.org/10.3390/electronics15081569 - 9 Apr 2026

Viewed by 273

Deepfake technology poses significant threats to information authenticity and social trust, necessitating effective detection methods. However, existing detection approaches predominantly rely on high-complexity network architectures that, while accurate in controlled environments, suffer from prohibitive computational costs that hinder deployment in resource-constrained scenarios such [...] Read more.

Deepfake technology poses significant threats to information authenticity and social trust, necessitating effective detection methods. However, existing detection approaches predominantly rely on high-complexity network architectures that, while accurate in controlled environments, suffer from prohibitive computational costs that hinder deployment in resource-constrained scenarios such as social media platforms. To address this efficiency-accuracy dilemma, we propose a lightweight face forgery detection method that systematically learns multi-scale forgery traces. Our approach features a four-stage lightweight architecture that hierarchically extracts features from local textures to global semantics, mimicking the human visual system. Within each stage, a multi-scale dynamic perception mechanism divides feature channels into parallel groups equipped with lightweight attention modules to capture forgery cues spanning pixel-level anomalies, local structures, regional patterns, and semantic inconsistencies. Furthermore, rather than relying on conventional feature fusion that risks suppressing subtle artifacts, we introduce a novel Context-Guided Dynamic Convolution. This mechanism uses mid-level spatial anomalies as active anchors to dynamically modulate high-level semantic filters, with the goal of mitigating the disconnect between semantic content and forgery evidence. Our model achieves strong performance, yielding an AUC of 91.98% on FaceForensics++ and 93.50% on DeepFake Detection Challenge, outperforming current state-of-the-art lightweight methods. Furthermore, compared to heavy Vision Transformers, our model achieves a superior performance-efficiency trade-off, requiring only 3.06 M parameters and 1.36 G FLOPs, making it highly suitable for real-time, resource-constrained deployment. Full article

(This article belongs to the Section Electronic Multimedia)

► Show Figures

Figure 1

18 pages, 741 KB

Open AccessReview

A Review of Tools and Technologies to Combat Deepfakes

by Dmitry Erokhin and Nadejda Komendantova

Information 2026, 17(4), 347; https://doi.org/10.3390/info17040347 - 3 Apr 2026

Viewed by 637

Abstract

Deepfakes and adjacent synthetic-media capabilities have become a systemic challenge for information integrity, security, and digital trust. Countermeasures now span passive detection methods that infer manipulation from content traces, active provenance systems that cryptographically bind metadata to media, and watermarking approaches that embed [...] Read more.

Deepfakes and adjacent synthetic-media capabilities have become a systemic challenge for information integrity, security, and digital trust. Countermeasures now span passive detection methods that infer manipulation from content traces, active provenance systems that cryptographically bind metadata to media, and watermarking approaches that embed detectable signals into content or generative processes. This review presents a rigorous synthesis of tools and technologies to combat deepfakes across modalities (image, video, audio, and selected multimodal settings), drawing primarily from the peer-reviewed literature, standardized benchmarks, and official technical specifications and reports. The review analyzes detection methods, provenance and authentication technologies, with emphasis on cryptographic manifests and threat models, watermarking and content provenance, including diffusion-era watermarking and industrial deployments, adversarial robustness and attacker adaptation, datasets and benchmarks, evaluation metrics across tasks, and deployment and scalability constraints. A dedicated section addresses legal, ethical, and policy issues, focusing on emerging transparency obligations and platform governance. The review finds that no single countermeasure is sufficient in realistic adversarial settings. The strongest practical approach is a layered defense that combines provenance, watermarking, content-based detection, and human oversight. The study concludes with limitations of the current evidence base and prioritized research directions to improve generalization, interoperability, and trustworthy user experiences. Full article

(This article belongs to the Special Issue Surveys in Information Systems and Applications)

► Show Figures

Graphical abstract

18 pages, 1850 KB

Open AccessArticle

AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection

by Sameena Javaid, Marwa Chendeb El Rai, Abeer Elkhouly, Obada Al-Khatib, Aicha Beya Far and May El Barachi

Appl. Sci. 2026, 16(7), 3450; https://doi.org/10.3390/app16073450 - 2 Apr 2026

Viewed by 265

Abstract

The rapid advancement of deepfake generation technologies presents significant challenges to the verification of digital video authenticity. These time-dependent artifacts are difficult to detect using conventional frame-based detection approaches. This paper introduces AT-HSTNet, an Action-Transformer-based Hierarchical Spatiotemporal Network designed for robust and computationally [...] Read more.

The rapid advancement of deepfake generation technologies presents significant challenges to the verification of digital video authenticity. These time-dependent artifacts are difficult to detect using conventional frame-based detection approaches. This paper introduces AT-HSTNet, an Action-Transformer-based Hierarchical Spatiotemporal Network designed for robust and computationally efficient deepfake video detection. The proposed framework adopts a multi-stage hierarchical architecture in which frame-level visual features are extracted using an EfficientNet-B0 backbone, short- and medium-range temporal patterns are modeled through Bidirectional Long Short-Term Memory (BiLSTM) networks, and long-range temporal dependencies are captured using an action-aware Transformer operating on temporally aggregated representations. Unlike conventional video transformers that apply self-attention directly to raw frame-level features, the proposed action-aware attention mechanism reduces redundant computation and improves stability in temporal reasoning. Extensive experiments on the balanced FFIW-10K dataset demonstrate that AT-HSTNet achieves an accuracy of 98.7%, with 98.0% precision, 96.0% recall, and a 96.9% F1-score, outperforming representative CNN–BiLSTM and CNN–Transformer baseline architectures. In addition, AT-HSTNet is highly efficient, requiring only 0.45 GFLOPs and achieving an inference speed of approximately 30 FPS on consumer-grade GPU hardware. As a result of this study, we found hierarchical temporal modeling more effective when combined with action-aware attention for any deepfake video detection. Full article

► Show Figures

Figure 1

21 pages, 13964 KB

Open AccessArticle

Towards Generalizable Deepfake Detection via Facial Landmark-Guided Convolution and Local Structure Awareness

by Hao Chen, Zhengxu Zhang, Qin Li and Chunhui Feng

Algorithms 2026, 19(4), 270; https://doi.org/10.3390/a19040270 - 1 Apr 2026

Viewed by 377

Abstract

As deepfakes become increasingly realistic, there is a growing need for robust and highly accurate facial forgery detection algorithms. Existing studies show that global feature modeling approaches (Transformer, VMamba) are effective in capturing long-range dependencies, yet they often lack sufficient sensitivity to localized [...] Read more.

As deepfakes become increasingly realistic, there is a growing need for robust and highly accurate facial forgery detection algorithms. Existing studies show that global feature modeling approaches (Transformer, VMamba) are effective in capturing long-range dependencies, yet they often lack sufficient sensitivity to localized facial tampering artifacts. Meanwhile, traditional convolutional methods excel at extracting local image features but struggle to incorporate prior knowledge about facial anatomy, resulting in limited representational capability. To address these limitations, this paper proposes LGMamba, a novel detection framework that integrates facial guidance focusing on key facial components and fine-grained detail regions commonly manipulated in deepfakes with global modeling. First, we introduce an innovative Landmark-Guided Convolution (LGConv), which adaptively adjusts convolutional sampling positions using facial landmark information. This allows the model to attend to forgery-prone facial regions, such as the eyes and mouth. Second, we design a parallel Facial Structure Awareness Block (FSAB) to operate alongside the VMamba-based visual State-Space Model. Equipped with a multi-stage residual design and a CBAM attention mechanism, FSAB enhances the model’s sensitivity to subtle facial artifacts, enabling joint exploitation of global semantic consistency and fine-grained forgery cues within a unified architecture. The proposed LGMamba achieves superior performance compared to existing mainstream approaches. In cross-dataset evaluations, it attains AUC scores of 92.34% on CD1 and 96.01% on CD2, outperforming all compared methods. Full article

► Show Figures

Figure 1

22 pages, 3493 KB

Open AccessArticle

Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers

by Joe Soundararajan and Dong Xu

AI 2026, 7(3), 115; https://doi.org/10.3390/ai7030115 - 19 Mar 2026

Viewed by 1344

Abstract

Background: Deepfakes pose a growing threat to the integrity of visual media, motivating detectors that remain reliable as forgeries become increasingly realistic. Methods: We propose a deepfake detection framework built on CLIP-derived SigLIP-2 vision transformers and a multi-task design that jointly performs (i) [...] Read more.

Background: Deepfakes pose a growing threat to the integrity of visual media, motivating detectors that remain reliable as forgeries become increasingly realistic. Methods: We propose a deepfake detection framework built on CLIP-derived SigLIP-2 vision transformers and a multi-task design that jointly performs (i) classification and (ii) manipulated-region localization when pixel-level supervision is available. We evaluated the approach on three public benchmarks of increasing complexity—HiDF, SID_Set (SIDA), and CiFake—using each dataset’s official partitions where provided (SID_Set uses the predefined train/validation split) and a standardized preprocessing and training pipeline across experiments. Results: On HiDF, our model achieved strong performance on both video and image tracks (AUC up to 0.931 on video and 0.968 on images), yielding large gains relative to previously reported HiDF baselines under their published settings. On SID_Set, the model achieved 99.1% three-class accuracy (real/synthetic/tampered) and produced accurate localization masks for many tampered regions, while we explicitly documented the split protocol and leakage checks to support the validity of the evaluation. On CiFake, the model exceeded 95% accuracy and attained an AUC of 0.986. Conclusions: Overall, the results indicate that SigLIP-2 representations combined with multi-task training can deliver high detection accuracy and interpretable localization on challenging, realistic forgeries, while highlighting the importance of clearly stated evaluation protocols for fair comparison. Full article

(This article belongs to the Section AI Systems: Theory and Applications)

► Show Figures

Figure 1

20 pages, 23952 KB

Open AccessArticle

Deepfake Speech Detection Using Perceptual Pathological Features Related to Timbral Attributes and Deep Learning

by Anuwat Chaiwongyen, Khalid Zaman, Kai Li, Suradej Duangpummet, Jessada Karnjana, Waree Kongprawechnon and Masashi Unoki

Appl. Sci. 2026, 16(4), 2077; https://doi.org/10.3390/app16042077 - 20 Feb 2026

Viewed by 547

Abstract

The detection of deepfake speech has become a significant research area due to rapid advancements in generative AI for speech synthesis. These technologies pose significant security risks in applications such as biometric authentication, voice-controlled systems, and automatic speaker verification (ASV) systems. Therefore, enhancing [...] Read more.

The detection of deepfake speech has become a significant research area due to rapid advancements in generative AI for speech synthesis. These technologies pose significant security risks in applications such as biometric authentication, voice-controlled systems, and automatic speaker verification (ASV) systems. Therefore, enhancing the detection capabilities of such applications is essential to mitigate potential threats. This study investigates perceptual speech-pathological features, which are commonly used to evaluate the unnaturalness of voice disorders in clinical settings, as potential indicators for detecting deepfake speech. Specifically, the timbral attributes of hardness, depth, brightness, roughness, sharpness, warmth, boominess, and reverberation are examined. The analysis reveals that these attributes provide meaningful distinctions between genuine and synthetic speech. Furthermore, the detection performance is enhanced by extending the dimensional representation of timbral attributes, enabling a more comprehensive characterization of the speech signal. This paper proposes a method that combines two models: one utilizing the different dimensions of speech-pathological features with a deep neural network (DNN), and another employing a gammatone filterbank model that simulates the auditory processing mechanism of the human cochlea with ResNet-18 architecture, improving deepfake speech detection. The proposed method is evaluated on the Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof) 2019 dataset. Experimental results demonstrate that the proposed approach outperforms baseline models in terms of Equal Error Rate (EER), achieving an EER of

5.93

%. Full article

(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)

► Show Figures

Figure 1

35 pages, 1304 KB

Open AccessArticle

AI-Powered Social Engineering: Emerging Attack Vectors, Vulnerabilities, and Multi-Layered Defense Strategies

by Kely Gonzaga, Sérgio Serra, Marco Gomes and Silvestre Malta

Computers 2026, 15(2), 128; https://doi.org/10.3390/computers15020128 - 17 Feb 2026

Viewed by 2110

Abstract

In the past decade, a growing number of cyberattacks have been reported, enabling unprecedented levels of personalization, automation, and deception. For instance, recent industry surveys have reported sharp increases in unique social engineering attacks within a single month of 2023, coinciding with the [...] Read more.

In the past decade, a growing number of cyberattacks have been reported, enabling unprecedented levels of personalization, automation, and deception. For instance, recent industry surveys have reported sharp increases in unique social engineering attacks within a single month of 2023, coinciding with the public release of ChatGPT-3.5. This trend highlights how Artificial Intelligence (AI)-powered phishing campaigns have become a significant threat to digital ecosystems. The present study provides an integrative analysis of how generative and deepfake technologies have reshaped the landscape of a Social Engineering (SE) attack, categorizing the main attack strategies and examining their psychological, technological, and ethical implications. In addition, to reviewing enabling technologies, our study conducts a comparative analysis of frameworks and analytical models across technical, empirical, and quantitative perspectives that model AI-driven SE operations and their defensive countermeasures. The convergence of these frameworks reveals three core capabilities—realism, personalization, and automation—that systematically amplify attack efficiency. Building on these insights, the study proposes the Unified Model for AI-Driven Social Engineering (UM-AISE), a conceptual framework that integrates these dimensions across the attack lifecycle and employs a theoretical Markov Decision Process (MDP) analysis. This formalization demonstrates how these capabilities can shift the attacker’s optimal strategy, offering a formal economic perspective distinct from empirical validation. Finally, the study discusses emerging ethical and regulatory challenges associated with AI-mediated deception, highlighting risks related to opacity, accountability, and large-scale manipulation. Taken together, these elements inform evolving approaches for detection, defense, and governance relevant to researchers, policymakers, and practitioners. Full article

► Show Figures

Figure 1

12 pages, 766 KB

Open AccessArticle

Evaluation of the Human Capacity to Detect Spanish Deepfake Audios with a Paraguayan Accent

by María Vianella Giménez Ramos, Juan Pinto-Ríos, Pastor Pérez-Estigarribia and Enrique Dávalos

Appl. Sci. 2026, 16(4), 1910; https://doi.org/10.3390/app16041910 - 14 Feb 2026

Viewed by 592

Abstract

Deepfakes, synthetic multimedia files generated by artificial intelligence, are drastically undermining digital credibility. Their ability to manipulate our perception of reality has created a new and complex battleground for disinformation, posing a critical threat to non-English-speaking audio with distinctive accents. Consequently, the objective [...] Read more.

Deepfakes, synthetic multimedia files generated by artificial intelligence, are drastically undermining digital credibility. Their ability to manipulate our perception of reality has created a new and complex battleground for disinformation, posing a critical threat to non-English-speaking audio with distinctive accents. Consequently, the objective of this study is to determine the human capacity to detect deepfake audio in Spanish with a Paraguayan accent through an experiment conducted with an Android application called ReFake (developed specifically for this research). In this experiment, 450 participants, aged 16–72, evaluated 10 audio samples of up to 15 s each, classifying them as authentic (belonging to Paraguayan journalists) or fake (generated with ElevenLabs). The findings suggests that human ear is more accurate than artificial intelligence (AI) at detecting vocal ‘naturalness’. This ability is influenced by generational age and educational level, with younger people and those with postgraduate degrees demonstrating greater performance. Conversely, gender and nationality do not influence detection, although the high prosodic quality of deepfakes still leads to errors in human judgment. Given these results, it is crucial to adapt and develop new strategies for a secure and resilient online ecosystem. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

42 pages, 3053 KB

Open AccessReview

A Comprehensive Review of Deepfake Detection Techniques: From Traditional Machine Learning to Advanced Deep Learning Architectures

by Ahmad Raza, Abdul Basit, Asjad Amin, Zeeshan Ahmad Arfeen, Muhammad I. Masud, Umar Fayyaz and Touqeer Ahmed Jumani

AI 2026, 7(2), 68; https://doi.org/10.3390/ai7020068 - 11 Feb 2026

Viewed by 5162

Abstract

Deepfake technology is causing unprecedented threats to the authenticity of digital media, and demand is high for reliable digital media detection systems. This systematic review focuses on an analysis of deepfake detection methods using deep learning approaches, machine learning methods, and the classical [...] Read more.

Deepfake technology is causing unprecedented threats to the authenticity of digital media, and demand is high for reliable digital media detection systems. This systematic review focuses on an analysis of deepfake detection methods using deep learning approaches, machine learning methods, and the classical methods of image processing from 2018 to 2025 with a specific focus on the trade-off between accuracy, computing efficiency, and cross-dataset generalization. Through lavish analysis of a robust peer-reviewed studies using three benchmark data sets (FaceForensics++, DFDC, Celeb-DF) we expose important truths to bring some of the field’s prevailing assumptions into question. Our analysis produces three important results that radically change the understanding of detection abilities and limitations. Transformer-based architectures have significantly better cross-dataset generalization (11.33% performance decline) than CNN-based (more than 15% decline), at the expense of computation (3–5× more). To the contrary, there is no strong reason to assume the superiority of deep learning, and the performance of traditional machine learning methods (in our case, Random Forest) is quite comparable (accuracy of 99.64% on the DFDC) with dramatically lower computing needs, which opens up the prospects for their application in resource-constrained deployment scenarios. Most critically, we demonstrate deterioration of performance (10–15% on average) systematically across all methodological classes and we provide empirical support for the fact that current detection systems are, to a high degree, learning dataset specific compression artifacts, rather than deepfake characteristics that are generalizable. These results highlight the importance of moving from an accuracy-focused evaluation approach toward more comprehensive evaluation approaches that balance either generalization capability, computational feasibility, or practical deployment constraints, and therefore further direct future research efforts towards designing systems for detection that could be deployed in practical applications. Full article

(This article belongs to the Section Medical & Healthcare AI)

► Show Figures

Figure 1

25 pages, 2900 KB

Open AccessArticle

SDEQ-Net: A Deepfake Video Anomaly Detection Method Integrating Stochastic Differential Equations and Hermitian-Symmetric Quantum Representations

by Ruixing Zhang, Bin Li and Degang Xu

Symmetry 2026, 18(2), 259; https://doi.org/10.3390/sym18020259 - 30 Jan 2026

Viewed by 519

Abstract

With the rapid advancement of deepfake generation technologies, forged videos have become increasingly realistic in visual quality and temporal consistency, posing serious threats to multimedia security. Existing detection methods often struggle to effectively model temporal dynamics and capture subtle inter-frame anomalies. To address [...] Read more.

With the rapid advancement of deepfake generation technologies, forged videos have become increasingly realistic in visual quality and temporal consistency, posing serious threats to multimedia security. Existing detection methods often struggle to effectively model temporal dynamics and capture subtle inter-frame anomalies. To address these challenges, we propose a Stochastic Differential Equation and Quantum Uncertainty Network (SDEQ-Net), a novel deepfake video anomaly detection framework that integrates continuous time stochastic modeling with quantum uncertainty mechanisms. First, a Continuous Time Neural Stochastic Differential Filtering Module (CNSDFM) is introduced to characterize the continuous evolution of latent inter-frame states using neural stochastic differential equations, enabling robust temporal filtering and uncertainty estimation. Second, a Quantum Uncertainty Aware Fusion Module (QUAFM) incorporates Hermitian-symmetric density matrix representations and von Neumann entropy to enhance feature fusion under uncertainty, leveraging the mathematical symmetry properties of quantum state representations for principled uncertainty quantification. Third, a Fractional Order Temporal Anomaly Detection Module (FOTADM) is proposed to generate fine grained temporal anomaly scores based on fractional order residuals, which are used as dynamic weights to guide attention toward anomalous frames. Extensive experiments on three benchmark datasets, including FaceForensics++, Celeb-DF, and DFDC, demonstrate the effectiveness of the proposed method. SDEQ-Net achieves AUC scores of 99.81% on FF++ (c23) and 97.91% on FF++ (c40). In cross dataset evaluations, it obtains 89.55% AUC on Celeb-DF and 86.21% AUC on DFDC, consistently outperforming existing state-of-the-art methods in both detection accuracy and generalization capability. Full article

(This article belongs to the Section Computer)

► Show Figures

Figure 1

19 pages, 1747 KB

Open AccessArticle

Video Deepfake Detection Based on Multimodality Semantic Consistency Fusion

by Fang Sun, Xiaoxuan Guo, Tong Zhang, Yang Liu and Jing Zhang

Future Internet 2026, 18(2), 67; https://doi.org/10.3390/fi18020067 - 23 Jan 2026

Viewed by 737

Abstract

Deepfake detection in video data typically relies on mining deep embedded representations across multiple modalities to obtain discriminative fused features and thereby improve detection accuracy. However, existing approaches predominantly focus on how to exploit complementary information across modalities to ensure effective fusion, while [...] Read more.

Deepfake detection in video data typically relies on mining deep embedded representations across multiple modalities to obtain discriminative fused features and thereby improve detection accuracy. However, existing approaches predominantly focus on how to exploit complementary information across modalities to ensure effective fusion, while often overlooking the impact of noise and interference present in the data. For instance, issues such as small objects, blurring, and occlusions in the visual modality can disrupt the semantic consistency of the fused features. To address this, we propose a Multimodality Semantic Consistency Fusion model for video forgery detection. The model introduces a semantic consistency gating mechanism to enhance the embedding of semantically aligned information across modalities, thereby improving the discriminability of the fused representations. Furthermore, we incorporate an event-level weakly supervised loss to strengthen the global semantic discrimination of the video data. Extensive experiments on standard video forgery detection benchmarks demonstrate the effectiveness of the proposed method, achieving superior performance in both forgery event detection and localization compared to state-of-the-art approaches. Full article

(This article belongs to the Special Issue Information and Future Internet Security, Trust and Privacy—4th Edition)

► Show Figures

Figure 1

30 pages, 6201 KB

Open AccessArticle

AFAD-MSA: Dataset and Models for Arabic Fake Audio Detection

by Elsayed Issa

Computation 2026, 14(1), 20; https://doi.org/10.3390/computation14010020 - 14 Jan 2026

Viewed by 980

Abstract

As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of [...] Read more.

As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of authentic and synthetic Arabic speech designed to advance research on Arabic deepfake and spoofed-speech detection. The synthetic subset is generated with four state-of-the-art proprietary text-to-speech and voice-conversion models. Rich metadata—covering speaker attributes and generation information—is provided to support reproducibility and benchmarking. To establish reference performance, we trained three AASIST models and compared their performance to two baseline transformer detectors (Wav2Vec 2.0 and Whisper). On the AFAD-MSA test split, AASIST-2 achieved perfect accuracy, surpassing the baseline models. However, its performance declined under cross-dataset evaluation. These results underscore the importance of data construction. Detectors generalize best when exposed to diverse attack types. In addition, continual or contrastive training that interleaves bona fide speech with large, heterogeneous spoofed corpora will further improve detectors’ robustness. Full article

(This article belongs to the Special Issue Recent Advances on Computational Linguistics and Natural Language Processing)

► Show Figures

Figure 1

43 pages, 2019 KB

Open AccessReview

Deep Learning for Image Watermarking: A Comprehensive Review and Analysis of Techniques, Challenges, and Applications

by Marta Bistroń, Jacek M. Żurada and Zbigniew Piotrowski

Sensors 2026, 26(2), 444; https://doi.org/10.3390/s26020444 - 9 Jan 2026

Viewed by 1546

Abstract

The growing demand for digital content protection has significantly increased the importance of image watermarking, particularly in light of the rising vulnerability of multimedia content to unauthorized modifications. In recent years, research has increasingly focused on leveraging deep learning architectures to enhance watermarking [...] Read more.

The growing demand for digital content protection has significantly increased the importance of image watermarking, particularly in light of the rising vulnerability of multimedia content to unauthorized modifications. In recent years, research has increasingly focused on leveraging deep learning architectures to enhance watermarking performance, addressing challenges related to transparency, robustness, and payload capacity. Numerous deep learning-based watermarking methods have demonstrated superior effectiveness compared to traditional approaches, particularly those based on Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), Transformers, and diffusion models. This paper presents a comprehensive survey of recent developments in both conventional and deep learning-based image watermarking techniques. While traditional methods remain prevalent, deep learning approaches offer notable improvements in embedding and extraction efficiency, particularly when facing complex attacks, including those generated by advanced AI models. Applications in areas such as deepfake detection, cybersecurity, and Internet of Things (IoT) systems highlight the practical significance of these advancements. Despite substantial progress, challenges remain in achieving an optimal balance between invisibility, robustness, and capacity, particularly in high-resolution and real-time scenarios. This study concludes by outlining future research directions toward develop robust, scalable, and efficient deep learning-based watermarking systems capable of addressing emerging threats in digital media environments. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence: 3rd Edition)

► Show Figures

Figure 1

28 pages, 3179 KB

Open AccessArticle

FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection

by Cesar Pachon and Dora Ballesteros

Big Data Cogn. Comput. 2026, 10(1), 25; https://doi.org/10.3390/bdcc10010025 - 7 Jan 2026

Viewed by 1865

Abstract

AI-based audio generation has advanced rapidly, enabling deepfake audio to reach levels of naturalness that closely resemble real recordings and complicate the distinction between authentic and synthetic signals. While numerous CNN- and Transformer-based detection approaches have been proposed, most adopt a model-centric perspective [...] Read more.

AI-based audio generation has advanced rapidly, enabling deepfake audio to reach levels of naturalness that closely resemble real recordings and complicate the distinction between authentic and synthetic signals. While numerous CNN- and Transformer-based detection approaches have been proposed, most adopt a model-centric perspective in which the spectral representation remains fixed. Parallel data-centric efforts have explored alternative representations such as scalograms and CQT, yet the field still lacks a unified framework that jointly evaluates the influence of model architecture, its hyperparameters (e.g., learning rate, number of epochs), and the spectral representation along with its own parameters (e.g., representation type, window size). Moreover, there is no standardized approach for benchmarking custom architectures against established baselines under consistent experimental conditions. FakeVoiceFinder addresses this gap by providing a systematic framework that enables direct comparison of model-centric, data-centric, and hybrid evaluation strategies. It supports controlled experimentation, flexible configuration of models and representations, and comprehensive performance reporting tailored to the detection task. This framework enhances reproducibility and helps clarify how architectural and representational choices interact in synthetic audio detection. Full article

► Show Figures

Figure 1

13 pages, 2618 KB

Open AccessArticle

Multi-Domain Perception Transformer for Generalized Forgery Image Detection

by Qiaoyue Man, Seok-Jeong Gee and Young-Im Cho

Appl. Sci. 2026, 16(1), 533; https://doi.org/10.3390/app16010533 - 5 Jan 2026

Cited by 1 | Viewed by 699

Abstract

With the rapid advancement of generative AI (AIGC) technology, synthetic images are increasingly approaching real pictures in terms of resolution and semantic consistency. Traditional detection methods face numerous challenges, such as insufficient cross-modal generalization capabilities and difficulty in identifying hidden generative traces. Existing [...] Read more.

With the rapid advancement of generative AI (AIGC) technology, synthetic images are increasingly approaching real pictures in terms of resolution and semantic consistency. Traditional detection methods face numerous challenges, such as insufficient cross-modal generalization capabilities and difficulty in identifying hidden generative traces. Existing solutions primarily design feature extractors for single generative models, struggling to address the complexity of multimodal forgeries. Therefore, we propose a multi-domain feature fusion Transformer network that integrates spatial, frequency, and wavelet transform features and introduce a cross-domain feature fusion module (CDAF) to detect subtle forgery traces in deepfake images. This model demonstrates superior detection performance on current forged images generated by generative adversarial networks (GANs) and diffusion models while exhibiting enhanced robustness. Full article

► Show Figures

Figure 1

Search Results (185)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (185)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI