MDPI - Publisher of Open Access Journals

23 pages, 3410 KB

Open AccessArticle

Human Detection of Voice-Cloned Speech Under GSM, VoLTE and VoIP Conditions

by Jakub Warzych, Michał Łuczyński and Janusz Klink

Acoustics 2026, 8(2), 41; https://doi.org/10.3390/acoustics8020041 - 17 Jun 2026

Viewed by 6

The rapid progress of generative speech synthesis and voice-cloning technologies has enabled the creation of highly natural synthetic voices that pose a serious threat to telecommunication security. While most prior studies evaluate human ability to detect audio deepfakes using high-quality, studio-grade recordings, little [...] Read more.

The rapid progress of generative speech synthesis and voice-cloning technologies has enabled the creation of highly natural synthetic voices that pose a serious threat to telecommunication security. While most prior studies evaluate human ability to detect audio deepfakes using high-quality, studio-grade recordings, little is known about how real-world telecommunication channels affect perceptual detection. This study investigates the influence of three transmission scenarios—GSM (AMR-NB), VoLTE (AMR-WB), and VoIP with packet-loss modeling—on the human ability to distinguish natural speech from AI-generated speech. A custom speech corpus was developed, consisting of natural recordings from nine speakers and corresponding synthetic utterances generated using a state-of-the-art voice cloning system (ElevenLabs). All samples were processed through simulated telecommunication channels using real codec implementations. A listening test with 95 participants was conducted, involving binary classification (human vs. synthetic) and confidence ratings. Results show an overall detection accuracy of 54.8%, confirming that humans are poorly equipped to identify synthetic speech. Surprisingly, the highest accuracy was achieved for the narrowband GSM channel (63.7%), while VoLTE yielded the lowest performance (44.0%). The findings suggest that restricted bandwidth may emphasize prosodic irregularities typical of generative models, whereas high-quality channels mask synthetic artifacts, increasing susceptibility to voice spoofing. The results highlight the necessity of deploying additional security mechanisms in telecommunication systems relying on voice identity verification. Full article

► Show Figures

Figure 1

18 pages, 1777 KB

Open AccessArticle

DeepFakeX: A Comprehensive Multimodal Deepfake Dataset for Research and Analysis

by Sonia Salman, Jawwad Ahmed Shamsi and Rizwan Qureshi

Data 2026, 11(6), 141; https://doi.org/10.3390/data11060141 - 11 Jun 2026

Viewed by 189

Abstract

The expanding capabilities of deep learning-based media synthesis have intensified concerns regarding the authenticity of digital content and the reliability of forensic analysis tools. In response to these challenges, this work introduces DeepFakeX, a collection of 800 synthetically generated videos available under controlled [...] Read more.

The expanding capabilities of deep learning-based media synthesis have intensified concerns regarding the authenticity of digital content and the reliability of forensic analysis tools. In response to these challenges, this work introduces DeepFakeX, a collection of 800 synthetically generated videos available under controlled access for research purposes. The dataset encompasses four distinct categories of AI-driven synthesis: facial identity replacement, audio track substitution, neural voice cloning, and combined audiovisual alteration. Unlike existing deepfake datasets that predominantly focus on facial synthesis, DeepFakeX covers a broader range of manipulation modalities, reflecting the diversity of synthetic media encountered in real-world settings. All deepfakes were generated using state-of-the-art, publicly available tools. Standardized post-processing procedures were applied to each video to ensure uniformity in terms of quality, duration and encoding format. DeepFakeX also emphasizes diversity in gender, age, ethnicity, and language. Video contexts span speeches, informational videos, movie clips, news broadcasts, and interviews that reflect content scenarios commonly encountered in real-world online environments. The dataset includes videos in both English and Urdu. The dataset’s quality and structural variability were assessed through visual and audio analyses using the Structural Similarity Index Measure (SSIM), Mel-Frequency Cepstral Coefficients (MFCCs), and Principal Component Analysis (PCA). The evaluation results revealed substantial variability within each manipulation category, along with clearly distinguishable patterns specific to each modality. DeepFakeX has been developed to facilitate rigorous and transparent research in deepfake detection, cross-modal forensic analysis, and AI-driven media forensics. It is hosted on Zenodo under controlled access for research use. Full article

31 pages, 30018 KB

Open AccessArticle

Sensors-Driven Multimodal Deepfake Detection: A Cross-Attention Fusion Approach with Adaptive Modality Gating

by Syeda Sitara Waseem, Noman Shabbir, Syed Rizwan Hassan and KangYoon Lee

Sensors 2026, 26(12), 3695; https://doi.org/10.3390/s26123695 - 10 Jun 2026

Viewed by 163

Abstract

Deepfakes threaten sensor-based authentication systems, including biometric sensors, surveillance cameras, and IoT edge devices. Unimodal detectors remain vulnerable to modality-specific attacks. We propose a multimodal deepfake detection framework optimized for resource-constrained edge devices, featuring a novel cross-modal attention fusion mechanism with adaptive gating. [...] Read more.

Deepfakes threaten sensor-based authentication systems, including biometric sensors, surveillance cameras, and IoT edge devices. Unimodal detectors remain vulnerable to modality-specific attacks. We propose a multimodal deepfake detection framework optimized for resource-constrained edge devices, featuring a novel cross-modal attention fusion mechanism with adaptive gating. The architecture combines enhanced Res2Net for audio, temporal 3D CNN with SE attention for video, and bidirectional cross-modal attention with quality-based gates. On our benchmark (5472 audio + 1842 video samples), the fusion model achieves 96.7% accuracy, 96.6% F1-score, 0.988 AUC-ROC, and 3.3% EER. Adversarial testing shows 92.3% accuracy under the Fast Gradient Sign Method (FGSM) attack. The model has a 30.3 MB footprint and runs at 20 FPS on edge hardware. Modality contribution analysis reveals adaptive weighting (72% audio for TTS forgery, 78% video for lip-synced attacks). Cross-dataset evaluation on FakeAVCeleb achieves 92.3% overall accuracy, confirming generalization. Full article

(This article belongs to the Special Issue Secure and Resilient Solutions for CCTV, Small Sensor and IoT Device Security)

► Show Figures

Figure 1

18 pages, 741 KB

Open AccessReview

A Review of Tools and Technologies to Combat Deepfakes

by Dmitry Erokhin and Nadejda Komendantova

Information 2026, 17(4), 347; https://doi.org/10.3390/info17040347 - 3 Apr 2026

Cited by 1 | Viewed by 2572

Abstract

Deepfakes and adjacent synthetic-media capabilities have become a systemic challenge for information integrity, security, and digital trust. Countermeasures now span passive detection methods that infer manipulation from content traces, active provenance systems that cryptographically bind metadata to media, and watermarking approaches that embed [...] Read more.

Deepfakes and adjacent synthetic-media capabilities have become a systemic challenge for information integrity, security, and digital trust. Countermeasures now span passive detection methods that infer manipulation from content traces, active provenance systems that cryptographically bind metadata to media, and watermarking approaches that embed detectable signals into content or generative processes. This review presents a rigorous synthesis of tools and technologies to combat deepfakes across modalities (image, video, audio, and selected multimodal settings), drawing primarily from the peer-reviewed literature, standardized benchmarks, and official technical specifications and reports. The review analyzes detection methods, provenance and authentication technologies, with emphasis on cryptographic manifests and threat models, watermarking and content provenance, including diffusion-era watermarking and industrial deployments, adversarial robustness and attacker adaptation, datasets and benchmarks, evaluation metrics across tasks, and deployment and scalability constraints. A dedicated section addresses legal, ethical, and policy issues, focusing on emerging transparency obligations and platform governance. The review finds that no single countermeasure is sufficient in realistic adversarial settings. The strongest practical approach is a layered defense that combines provenance, watermarking, content-based detection, and human oversight. The study concludes with limitations of the current evidence base and prioritized research directions to improve generalization, interoperability, and trustworthy user experiences. Full article

(This article belongs to the Special Issue Surveys in Information Systems and Applications)

► Show Figures

Graphical abstract

12 pages, 766 KB

Open AccessArticle

Evaluation of the Human Capacity to Detect Spanish Deepfake Audios with a Paraguayan Accent

by María Vianella Giménez Ramos, Juan Pinto-Ríos, Pastor Pérez-Estigarribia and Enrique Dávalos

Appl. Sci. 2026, 16(4), 1910; https://doi.org/10.3390/app16041910 - 14 Feb 2026

Viewed by 815

Abstract

Deepfakes, synthetic multimedia files generated by artificial intelligence, are drastically undermining digital credibility. Their ability to manipulate our perception of reality has created a new and complex battleground for disinformation, posing a critical threat to non-English-speaking audio with distinctive accents. Consequently, the objective [...] Read more.

Deepfakes, synthetic multimedia files generated by artificial intelligence, are drastically undermining digital credibility. Their ability to manipulate our perception of reality has created a new and complex battleground for disinformation, posing a critical threat to non-English-speaking audio with distinctive accents. Consequently, the objective of this study is to determine the human capacity to detect deepfake audio in Spanish with a Paraguayan accent through an experiment conducted with an Android application called ReFake (developed specifically for this research). In this experiment, 450 participants, aged 16–72, evaluated 10 audio samples of up to 15 s each, classifying them as authentic (belonging to Paraguayan journalists) or fake (generated with ElevenLabs). The findings suggests that human ear is more accurate than artificial intelligence (AI) at detecting vocal ‘naturalness’. This ability is influenced by generational age and educational level, with younger people and those with postgraduate degrees demonstrating greater performance. Conversely, gender and nationality do not influence detection, although the high prosodic quality of deepfakes still leads to errors in human judgment. Given these results, it is crucial to adapt and develop new strategies for a secure and resilient online ecosystem. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

30 pages, 6201 KB

Open AccessArticle

AFAD-MSA: Dataset and Models for Arabic Fake Audio Detection

by Elsayed Issa

Computation 2026, 14(1), 20; https://doi.org/10.3390/computation14010020 - 14 Jan 2026

Viewed by 1544

Abstract

As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of [...] Read more.

As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of authentic and synthetic Arabic speech designed to advance research on Arabic deepfake and spoofed-speech detection. The synthetic subset is generated with four state-of-the-art proprietary text-to-speech and voice-conversion models. Rich metadata—covering speaker attributes and generation information—is provided to support reproducibility and benchmarking. To establish reference performance, we trained three AASIST models and compared their performance to two baseline transformer detectors (Wav2Vec 2.0 and Whisper). On the AFAD-MSA test split, AASIST-2 achieved perfect accuracy, surpassing the baseline models. However, its performance declined under cross-dataset evaluation. These results underscore the importance of data construction. Detectors generalize best when exposed to diverse attack types. In addition, continual or contrastive training that interleaves bona fide speech with large, heterogeneous spoofed corpora will further improve detectors’ robustness. Full article

(This article belongs to the Special Issue Recent Advances on Computational Linguistics and Natural Language Processing)

► Show Figures

Figure 1

28 pages, 3179 KB

Open AccessArticle

FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection

by Cesar Pachon and Dora Ballesteros

Big Data Cogn. Comput. 2026, 10(1), 25; https://doi.org/10.3390/bdcc10010025 - 7 Jan 2026

Viewed by 2927

Abstract

AI-based audio generation has advanced rapidly, enabling deepfake audio to reach levels of naturalness that closely resemble real recordings and complicate the distinction between authentic and synthetic signals. While numerous CNN- and Transformer-based detection approaches have been proposed, most adopt a model-centric perspective [...] Read more.

AI-based audio generation has advanced rapidly, enabling deepfake audio to reach levels of naturalness that closely resemble real recordings and complicate the distinction between authentic and synthetic signals. While numerous CNN- and Transformer-based detection approaches have been proposed, most adopt a model-centric perspective in which the spectral representation remains fixed. Parallel data-centric efforts have explored alternative representations such as scalograms and CQT, yet the field still lacks a unified framework that jointly evaluates the influence of model architecture, its hyperparameters (e.g., learning rate, number of epochs), and the spectral representation along with its own parameters (e.g., representation type, window size). Moreover, there is no standardized approach for benchmarking custom architectures against established baselines under consistent experimental conditions. FakeVoiceFinder addresses this gap by providing a systematic framework that enables direct comparison of model-centric, data-centric, and hybrid evaluation strategies. It supports controlled experimentation, flexible configuration of models and representations, and comprehensive performance reporting tailored to the detection task. This framework enhances reproducibility and helps clarify how architectural and representational choices interact in synthetic audio detection. Full article

► Show Figures

Figure 1

20 pages, 1070 KB

Open AccessArticle

LJ-TTS: A Paired Real and Synthetic Speech Dataset for Single-Speaker TTS Analysis

by Viola Negroni, Davide Salvi, Luca Comanducci, Taiba Majid Wani, Madleen Uecker, Irene Amerini, Stefano Tubaro and Paolo Bestagini

Electronics 2026, 15(1), 169; https://doi.org/10.3390/electronics15010169 - 30 Dec 2025

Cited by 1 | Viewed by 1750

Abstract

In this paper, we present LJ-TTS, a large-scale single-speaker dataset of real and synthetic speech designed to support research in text-to-speech (TTS) synthesis and analysis. The dataset builds upon high-quality recordings of a single English speaker, alongside outputs generated by 11 state-of-the-art TTS [...] Read more.

In this paper, we present LJ-TTS, a large-scale single-speaker dataset of real and synthetic speech designed to support research in text-to-speech (TTS) synthesis and analysis. The dataset builds upon high-quality recordings of a single English speaker, alongside outputs generated by 11 state-of-the-art TTS models, including both autoregressive and non-autoregressive architectures. By maintaining a controlled single-speaker setting, LJ-TTS enables precise comparison of speech characteristics across different generative models, isolating the effects of synthesis methods from speaker variability. Unlike multi-speaker datasets lacking alignment between real and synthetic samples, LJ-TTS provides exact utterance-level correspondence, allowing fine-grained analyses that are otherwise impractical. The dataset supports systematic evaluation of synthetic speech across multiple dimensions, including deepfake detection, source tracing, and phoneme-level analyses. LJ-TTS provides a standardized resource for benchmarking generative models, assessing the limits of current TTS systems, and developing robust detection and evaluation methods. The dataset is publicly available to the research community to foster reproducible and controlled studies in speech synthesis and synthetic speech detection. Full article

(This article belongs to the Special Issue Emerging Trends in Generative-AI Based Audio Processing)

► Show Figures

Figure 1

21 pages, 920 KB

Open AccessArticle

Audio Deepfake Detection via a Fuzzy Dual-Path Time-Frequency Attention Network

by Jinzi Li, Hexu Wang, Fei Xie, Xiaozhou Feng, Jiayao Chen, Jindong Liu and Juan Wang

Sensors 2025, 25(24), 7608; https://doi.org/10.3390/s25247608 - 15 Dec 2025

Viewed by 1372

Abstract

With the rapid advancement of speech synthesis and voice conversion technologies, audio deepfake techniques have posed serious threats to information security. Existing detection methods often lack robustness when confronted with environmental noise, signal compression, and ambiguous fake features, making it difficult to effectively [...] Read more.

With the rapid advancement of speech synthesis and voice conversion technologies, audio deepfake techniques have posed serious threats to information security. Existing detection methods often lack robustness when confronted with environmental noise, signal compression, and ambiguous fake features, making it difficult to effectively identify highly concealed fake audio. To address this issue, this paper proposes a Dual-Path Time-Frequency Attention Network (DPTFAN) based on Pythagorean Hesitant Fuzzy Sets (PHFS), which dynamically characterizes the reliability and ambiguity of fake features through uncertainty modeling. It introduces a dual-path attention mechanism in both time and frequency domains to enhance feature representation and discriminative capability. Additionally, a Lightweight Fuzzy Branch Network (LFBN) is designed to achieve explicit enhancement of ambiguous features, improving performance while maintaining computational efficiency. On the ASVspoof 2019 LA dataset, the proposed method achieves an accuracy of 98.94%, and on the FoR (Fake or Real) dataset, it reaches an accuracy of 99.40%, significantly outperforming existing mainstream methods and demonstrating excellent detection performance and robustness. Full article

(This article belongs to the Section Sensor Networks)

► Show Figures

Figure 1

21 pages, 1055 KB

Open AccessArticle

FAIR-VID: A Multimodal Pre-Processing Pipeline for Student Application Analysis

by Algirdas Laukaitis, Diana Kalibatienė, Dovilė Jodenytė, Kęstutis Normantas, Julius Jancevičius, Mindaugas Jankauskas and Artūras Serackis

Appl. Sci. 2025, 15(24), 13127; https://doi.org/10.3390/app152413127 - 13 Dec 2025

Viewed by 1424

Abstract

The shift toward remote and automated admission processes in higher education introduces new challenges, including evaluator subjectivity and risks of applicant fraud. The FAIR-VID project addresses these issues by developing an artificial intelligence system that integrates multimodal data fusion with semi-supervised deep learning [...] Read more.

The shift toward remote and automated admission processes in higher education introduces new challenges, including evaluator subjectivity and risks of applicant fraud. The FAIR-VID project addresses these issues by developing an artificial intelligence system that integrates multimodal data fusion with semi-supervised deep learning to assess applicant video interviews, submitted documents, and form data. This paper presents the project’s data preprocessing pipeline, designed to fuse heterogeneous modalities and to support seamless interaction between AI agents and human decision-makers throughout the admission workflow. The proposed process is intentionally general, making it applicable not only to international university admissions but also to broader human resource management and hiring contexts. Emphasis is placed on the need for robust and transparent AI adoption in admission and recruitment, supported by open-source modules and models at every stage of interaction between applicants and institutions. As a proof of concept, we provide open-source solutions for the analysis of video interviews, images, and documents enriched with semantic descriptions generated by large multimodal and complementary AI models. The paper details the multi-phase implementation of this pipeline to create structured, semantically rich datasets suitable for training advanced deep learning systems for comprehensive applicant assessment and fraud detection. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

26 pages, 2820 KB

Open AccessArticle

Forensic Analysis of Manipulated Images and Videos

by Sergio A. Falcón-López, Llanos Tobarra, Antonio Robles-Gómez and Rafael Pastor-Vargas

Appl. Sci. 2025, 15(23), 12664; https://doi.org/10.3390/app152312664 - 29 Nov 2025

Cited by 1 | Viewed by 2411

Abstract

The transition from Industry 4.0 to Industry 5.0 emphasizes the need for ethical, transparent, and human-centric artificial intelligence systems. In this context, ensuring the authenticity of digital information has become crucial for maintaining societal trust. This study addresses the challenge of detecting manipulated [...] Read more.

The transition from Industry 4.0 to Industry 5.0 emphasizes the need for ethical, transparent, and human-centric artificial intelligence systems. In this context, ensuring the authenticity of digital information has become crucial for maintaining societal trust. This study addresses the challenge of detecting manipulated multimedia content, including synthetic images, videos, and audio generated by artificial intelligence, commonly known as Deepfakes. We analyze and compare general-purpose and Deepfake-specific detection methods to assess their effectiveness in real-world scenarios. This work introduces a refined reference model that integrates both application-oriented and methodological criteria, grouping tools into Blind Forensic, Handcrafted Machine Learning, Deep Learning-based methods, and Toolkits. This structured taxonomy provides a clearer comparative framework than existing works, which typically classify detectors using only one of these dimensions. To ensure reproducible evaluation, all experiments were performed using the SAFL dataset, which consolidates real and synthetic multimedia content generated with publicly available tools under a unified protocol. Among the tested tools, Forensically achieved the highest accuracy in image forgery detection 86.9%, while Autopsy reached 69.5% among Deepfake-specific image detectors. In video analysis, Forensically obtained 98.6% accuracy, whereas Deepware Scanner achieved 91.2% as the most effective Deepfake-focused tool. These results highlight that general-purpose methods remain robust for images, while specialized detectors perform competitively in videos. Overall, the proposed model and dataset establish a consistent foundation for advancing hybrid detection strategies aligned with the ethical and transparent AI principles envisioned in Industry 5.0. Full article

(This article belongs to the Special Issue AI from Industry 4.0 to Industry 5.0: Engineering for Social Change)

► Show Figures

Figure 1

18 pages, 3175 KB

Open AccessArticle

AudioFakeNet: A Model for Reliable Speaker Verification in Deepfake Audio

by Samia Dilbar, Muhammad Ali Qureshi, Serosh Karim Noon and Abdul Mannan

Algorithms 2025, 18(11), 716; https://doi.org/10.3390/a18110716 - 13 Nov 2025

Viewed by 1983

Abstract

Deepfake audio refers to the generation of voice recordings using deep neural networks that replicate a specific individual’s voice, often for deceptive or fraud purposes. Although this has been an area of research for quite some time, deepfakes still pose substantial challenges for [...] Read more.

Deepfake audio refers to the generation of voice recordings using deep neural networks that replicate a specific individual’s voice, often for deceptive or fraud purposes. Although this has been an area of research for quite some time, deepfakes still pose substantial challenges for reliable true speaker authentication. To address the issue, we propose AudioFakeNet, a hybrid deep learning architecture that use Convolutional Neural Networks (CNNs) along with Long Short-Term Memory (LSTM) units, and Multi-Head Attention (MHA) mechanisms for robust deepfake detection. CNN extracts spatial and spectral features, LSTM captures temporal dependencies, and MHA enhances to focus on informative audio segments. The model is trained using Mel-Frequency Cepstral Coefficients (MFCCs) from the publicly available dataset and was validated on self-collected dataset, ensuring reproducibility. Performance comparisons with state-of-the-art machine learning and deep learning models show that our proposed AudioFakeNet achieves higher accuracy, better generalization, and lower Equal Error Rate (EER). Its modular design allows for broader adaptability in fake-audio detection tasks, offering significant potential across diverse speech synthesis applications. Full article

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

► Show Figures

Figure 1

22 pages, 1773 KB

Open AccessArticle

ACE-Net: A Fine-Grained Deepfake Detection Model with Multimodal Emotional Consistency

by Shaoqian Yu, Xingyu Chen, Yuzhe Sheng, Han Zhang, Xinlong Li and Sijia Yu

Electronics 2025, 14(22), 4420; https://doi.org/10.3390/electronics14224420 - 13 Nov 2025

Viewed by 1413

Abstract

The alarming realism of Deepfake presents a significant challenge to digital authenticity, yet its inherent difficulty in synchronizing the emotional cues between facial expressions and speech offers a critical opportunity for detection. However, most existing approaches rely on general-purpose backbones for unimodal feature [...] Read more.

The alarming realism of Deepfake presents a significant challenge to digital authenticity, yet its inherent difficulty in synchronizing the emotional cues between facial expressions and speech offers a critical opportunity for detection. However, most existing approaches rely on general-purpose backbones for unimodal feature extraction, resulting in an inadequate representation of fine-grained dynamic emotional expressions. Although a limited number of studies have explored cross-modal emotional consistency of deepfake detection, they typically employ shallow fusion techniques which limit latent expressiveness. To address this, we propose ACE-Net, a novel framework that identifies forgeries via multimodal emotional inconsistency. For the speech modality, we design a bidirectional cross-attention mechanism to fuse acoustic features from a lightweight CNN-based model with textual features, yielding a representation highly sensitive to fine-grained emotional dynamics. For the visual modality, a MobileNetV3-based perception head is proposed to adaptively select keyframes, yielding a representation focused on the most emotionally salient moments. For multimodal emotional consistency discrimination, we develop a multi-dimensional fusion strategy to deeply integrate high-level emotional features from different modalities within a unified latent space. For unimodal emotion recognition, both the audio and visual branches outperform baseline models on the CREMA-D dataset. Building on this, the complete ACE-Net model achieves a state-of-the-art AUC of 0.921 on the challenging DFDC benchmark. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Machine Learning)

► Show Figures

Figure 1

25 pages, 4385 KB

Open AccessArticle

Robust DeepFake Audio Detection via an Improved NeXt-TDNN with Multi-Fused Self-Supervised Learning Features

by Gul Tahaoglu

Appl. Sci. 2025, 15(17), 9685; https://doi.org/10.3390/app15179685 - 3 Sep 2025

Cited by 5 | Viewed by 5617

Abstract

Deepfake audio refers to speech that has been synthetically generated or altered through advanced neural network techniques, often with a degree of realism sufficient to convincingly imitate genuine human voices. As these manipulations become increasingly indistinguishable from authentic recordings, they present significant threats [...] Read more.

Deepfake audio refers to speech that has been synthetically generated or altered through advanced neural network techniques, often with a degree of realism sufficient to convincingly imitate genuine human voices. As these manipulations become increasingly indistinguishable from authentic recordings, they present significant threats to security, undermine media integrity, and challenge the reliability of digital authentication systems. In this study, a robust detection framework is proposed, which leverages the power of self-supervised learning (SSL) and attention-based modeling to identify deepfake audio samples. Specifically, audio features are extracted from input speech using two powerful pretrained SSL models: HuBERT-Large and WavLM-Large. These distinctive features are then integrated through an Attentional Multi-Feature Fusion (AMFF) mechanism. The fused features are subsequently classified using a NeXt-Time Delay Neural Network (NeXt-TDNN) model enhanced with Efficient Channel Attention (ECA), enabling improved temporal and channel-wise feature discrimination. Experimental results show that the proposed method achieves a 0.42% EER and 0.01 min-tDCF on ASVspoof 2019 LA, a 1.01% EER on ASVspoof 2019 PA, and a pooled 6.56% EER on the cross-channel ASVspoof 2021 LA evaluation, thus highlighting its effectiveness for real-world deepfake detection scenarios. Furthermore, on the ASVspoof 5 dataset, the method achieved a 7.23% EER, outperforming strong baselines and demonstrating strong generalization ability. Moreover, the macro-averaged F1-score of 96.01% and balanced accuracy of 99.06% were obtained on the ASVspoof 2019 LA dataset, while the proposed method achieved a macro-averaged F1-score of 98.70% and balanced accuracy of 98.90% on the ASVspoof 2019 PA dataset. On the highly challenging ASVspoof 5 dataset, which includes crowdsourced, non-studio-quality audio, and novel adversarial attacks, the proposed method achieves macro-averaged metrics exceeding 92%, with a precision of 92.07%, a recall of 92.63%, an F1-measure of 92.35%, and a balanced accuracy of 92.63%. Full article

► Show Figures

Figure 1

21 pages, 2789 KB

Open AccessArticle

BIM-Based Adversarial Attacks Against Speech Deepfake Detectors

by Wendy Edda Wang, Davide Salvi, Viola Negroni, Daniele Ugo Leonzio, Paolo Bestagini and Stefano Tubaro

Electronics 2025, 14(15), 2967; https://doi.org/10.3390/electronics14152967 - 24 Jul 2025

Cited by 2 | Viewed by 2519

Abstract

Automatic Speaker Verification (ASV) systems are increasingly employed to secure access to services and facilities. However, recent advances in speech deepfake generation pose serious threats to their reliability. Modern speech synthesis models can convincingly imitate a target speaker’s voice and generate realistic synthetic [...] Read more.

Automatic Speaker Verification (ASV) systems are increasingly employed to secure access to services and facilities. However, recent advances in speech deepfake generation pose serious threats to their reliability. Modern speech synthesis models can convincingly imitate a target speaker’s voice and generate realistic synthetic audio, potentially enabling unauthorized access through ASV systems. To counter these threats, forensic detectors have been developed to distinguish between real and fake speech. Although these models achieve strong performance, their deep learning nature makes them susceptible to adversarial attacks, i.e., carefully crafted, imperceptible perturbations in the audio signal that make the model unable to classify correctly. In this paper, we explore adversarial attacks targeting speech deepfake detectors. Specifically, we analyze the effectiveness of Basic Iterative Method (BIM) attacks applied in both time and frequency domains under white- and black-box conditions. Additionally, we propose an ensemble-based attack strategy designed to simultaneously target multiple detection models. This approach generates adversarial examples with balanced effectiveness across the ensemble, enhancing transferability to unseen models. Our experimental results show that, although crafting universally transferable attacks remains challenging, it is possible to fool state-of-the-art detectors using minimal, imperceptible perturbations, highlighting the need for more robust defenses in speech deepfake detection. Full article

(This article belongs to the Special Issue Selected Papers from Young Researchers in Signal/Image/Video Coding and Processing, 2nd Edition)

► Show Figures

Figure 1

Search Results (37)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (37)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI