Next Article in Journal
Cache-Based Design of Spaceborne Solid-State Storage Systems
Previous Article in Journal
Enhancing Cultural Heritage Engagement with Novel Interactive Extended-Reality Multisensory System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Deepfake Voice Detection: An Approach Using End-to-End Transformer with Acoustic Feature Fusion by Cross-Attention

Department of Electrical and Electronic Engineering, Auckland University of Technology, Auckland 1010, New Zealand
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2025, 14(10), 2040; https://doi.org/10.3390/electronics14102040
Submission received: 8 April 2025 / Revised: 7 May 2025 / Accepted: 13 May 2025 / Published: 16 May 2025
(This article belongs to the Section Artificial Intelligence)

Abstract

Deepfake technology uses artificial intelligence to create highly realistic but fake audio, video, or images, often making it difficult to distinguish from real content. Due to its potential use for misinformation, fraud, and identity theft, deepfake technology has gained a bad reputation in the digital world. Recently, many works have reported on the detection of deepfake videos/images. However, few studies have concentrated on developing robust deepfake voice detection systems. Among most existing studies in this field, a deepfake voice detection system commonly requires a large amount of training data and a robust backbone to detect real and logistic attack audio. For acoustic feature extractions, Mel-frequency Filter Bank (MFB)-based approaches are more suitable for extracting speech signals than applying the raw spectrum as input. Recurrent Neural Networks (RNNs) have been successfully applied to Natural Language Processing (NLP), but these backbones suffer from gradient vanishing or explosion while processing long-term sequences. In addition, the cross-dataset evaluation of most deepfake voice recognition systems has weak performance, leading to a system robustness issue. To address these issues, we propose an acoustic feature-fusion method to combine Mel-spectrum and pitch representation based on cross-attention mechanisms. Then, we combine a Transformer encoder with a convolutional neural network block to extract global and local features as a front end. Finally, we connect the back end with one linear layer for classification. We summarized several deepfake voice detectors’ performances on the silence-segment processed ASVspoof 2019 dataset. Our proposed method can achieve an Equal Error Rate (EER) of 26.41%, while most of the existing methods result in EER higher than 30%. We also tested our proposed method on the ASVspoof 2021 dataset, and found that it can achieve an EER as low as 28.52%, while the EER values for existing methods are all higher than 28.9%.
Keywords: end-to-end; transformer; cross attention; feature fusion; supervised learning; deepfake voice recognition end-to-end; transformer; cross attention; feature fusion; supervised learning; deepfake voice recognition

Share and Cite

MDPI and ACS Style

Gong, L.Y.; Li, X.J. Deepfake Voice Detection: An Approach Using End-to-End Transformer with Acoustic Feature Fusion by Cross-Attention. Electronics 2025, 14, 2040. https://doi.org/10.3390/electronics14102040

AMA Style

Gong LY, Li XJ. Deepfake Voice Detection: An Approach Using End-to-End Transformer with Acoustic Feature Fusion by Cross-Attention. Electronics. 2025; 14(10):2040. https://doi.org/10.3390/electronics14102040

Chicago/Turabian Style

Gong, Liang Yu, and Xue Jun Li. 2025. "Deepfake Voice Detection: An Approach Using End-to-End Transformer with Acoustic Feature Fusion by Cross-Attention" Electronics 14, no. 10: 2040. https://doi.org/10.3390/electronics14102040

APA Style

Gong, L. Y., & Li, X. J. (2025). Deepfake Voice Detection: An Approach Using End-to-End Transformer with Acoustic Feature Fusion by Cross-Attention. Electronics, 14(10), 2040. https://doi.org/10.3390/electronics14102040

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop