AudioFakeNet: A Model for Reliable Speaker Verification in Deepfake Audio
Abstract
1. Introduction
- Voice Conversion (VC) involves transforming or mimicking the original voice in real time to produce the desired speaker’s voice [7].
- fText-to-Speech (TTS) creates synthetic speech using text as input and generates an audio clip in the target speaker’s artificial voice. Once trained, the model can successfully synthesize any text in the speaker’s voice [8].
- Emotion Fake modifies the emotional tone of speech (e.g., happy to sad) while preserving the speaker’s identity and content, using either parallel or nonparallel data-based methods.
- Scene Fake alters the acoustic scene of speech (e.g., from an office to an airport) using speech enhancement technologies while maintaining the speaker’s identity and content.
- Partially Fake audio modifies specific words in an utterance using genuine or synthesized clips, while preserving the original speaker’s identity.
- To effectively capture the long-range temporal dependencies inherent in speech signals, a Long Short-Term Memory (LSTM)-based recurrent architecture is employed.
- To improve detection accuracy, an attention mechanism is integrated to selectively emphasize the most informative segments of the audio signal.
- The performance of the proposed method is validated using both public and self-collected datasets, with results compared against other state-of-the-art methods.
2. Related Work
3. Methodology
3.1. Datasets
3.1.1. Self-Collected Dataset
3.1.2. Public Dataset
3.2. Computing MFCC Features
3.3. The AudioFakeNet Model
4. Experimentation
5. Results and Discussion
6. Ablation Study of Model Components
7. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Bird, J.J.; Lotfi, A. Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion. arXiv 2023, arXiv:2308.12734. [Google Scholar]
- Biswas, D.; Gil, J.-M. Design and Implementation for Research Paper Classification Based on CNN and RNN Models. J. Internet Technol. 2024, 25, 637–645. [Google Scholar] [CrossRef]
- Rabhi, M.; Bakiras, S.; Di Pietro, R. Audio-Deepfake Detection: Adversarial Attacks and Countermeasures. Expert Syst. Appl. 2024, 250, 123941. [Google Scholar] [CrossRef]
- Sun, C.; Jia, S.; Hou, S.; Lyu, S. AI-Synthesized Voice Detection Using Neural Vocoder Artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 904–912. [Google Scholar]
- Rana, S.; Qureshi, M.A.; Majeed, A.; Noon, S.K. Identification of true speakers from disguised voices in anti-forensic scenarios using an efficient framework. Signal Image Video Process. 2024, 18, 7455–7471. [Google Scholar] [CrossRef]
- Chitale, M.; Dhawale, A.; Dubey, M.; Ghane, S. A Hybrid CNN-LSTM Approach for Deepfake Audio Detection. In Proceedings of the 2024 3rd International Conference on Artificial Intelligence for Internet of Things (AIIoT), Vellore, India, 3–4 May 2024; pp. 1–6. [Google Scholar]
- Ashraf, M.; Abid, F.; Din, I.U.; Rasheed, J.; Yesiltepe, M.; Yeo, S.F.; Ersoy, M.T. A Hybrid CNN and RNN Variant Model for Music Classification. Appl. Sci. 2023, 13, 1476. [Google Scholar] [CrossRef]
- Zaman, K.; Islam, J.S.; Sah, M.; Direkoglu, C.; Okada, S.; Unoki, M. Hybrid Transformer Architectures with Diverse Audio Features for Deepfake Speech Classification. IEEE Access 2024, 12, 149221–149237. [Google Scholar] [CrossRef]
- Rana, S.; Qureshi, M.A. A Comprehensive Review of Forensic Phonetics Techniques. Asian Bull. Big Data Manag. 2024, 4, 284–301. [Google Scholar] [CrossRef]
- Akhtar, Z.; Pendyala, T.L.; Athmakuri, V.S. Video and audio deepfake datasets and open issues in deepfake technology: Being ahead of the curve. Forensic Sci. 2024, 4, 289–377. [Google Scholar] [CrossRef]
- Ye, J.; Yan, D.; Fu, S.; Ma, B.; Xia, Z. One-Class Network Leveraging Spectro-Temporal Features for Generalized Synthetic Speech Detection. Speech Commun. 2025, 169, 103200. [Google Scholar] [CrossRef]
- Bendiab, G.; Haiouni, H.; Moulas, I.; Shiaeles, S. Deepfakes in Digital Media Forensics: Generation, AI-Based Detection and Challenges. J. Inf. Secur. Appl. 2025, 88, 103935. [Google Scholar] [CrossRef]
- Bisogni, C.; Loia, V.; Nappi, M.; Pero, C. Acoustic Features Analysis for Explainable Machine Learning-Based Audio Spoofing Detection. Comput. Vis. Image Underst. 2024, 249, 104145. [Google Scholar] [CrossRef]
- Li, X.; Chen, P.-Y.; Wei, W. Where Are We in Audio Deepfake Detection? A Systematic Analysis over Generative and Detection Models. ACM Trans. Internet Technol. 2025, 25, 1–19. [Google Scholar] [CrossRef]
- Nanmalar, M.; Joysingh, S.J.; Vijayalakshmi, P.; Nagarajan, T. A Feature Engineering Approach for Literary and Colloquial Tamil Speech Classification Using 1D-CNN. Speech Commun. 2025, 173, 103254. [Google Scholar] [CrossRef]
- Ahmad, O.; Khan, M.S.; Jan, S.; Khan, I. Deepfake Audio Detection for Urdu Language Using Deep Neural Networks. IEEE Access 2025, 13, 97765–97778. [Google Scholar] [CrossRef]
- Ahmadiadli, Y.; Zhang, X.-P.; Khan, N. Beyond Identity: A Generalizable Approach for Deepfake Audio Detection. arXiv 2025, arXiv:2505.06766. Available online: https://arxiv.org/abs/2505.06766 (accessed on 15 October 2025). [CrossRef]
- Borodin, K.; Kudryavtsev, V.; Korzh, D.; Efimenko, A.; Mkrtchian, G.; Gorodnichev, M.; Rogov, O.Y. AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection Using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge. arXiv 2024, arXiv:2408.17352. [Google Scholar] [CrossRef]
- Pianese, A.; Cozzolino, D.; Poggi, G.; Verdoliva, L. Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models. In Proceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security (IH & MMSec 2024), Baiona, Spain, 24–26 June 2024; ACM: New York, NY, USA, 2024; pp. 289–294. Available online: https://arxiv.org/abs/2405.02179 (accessed on 15 October 2025).
- Hamza, A.; Javed, A.R.; Iqbal, F.; Kryvinska, N.; Almadhor, A.S.; Jalil, Z.; Borghol, R. Deepfake audio detection via MFCC features using machine learning. IEEE Access 2022, 10, 134018–134028. [Google Scholar] [CrossRef]
- Zhang, Q.; Zhang, X.; Sun, M.; Yang, J. A transformer-based deep learning approach for recognition of forgery methods in spoofing speech attribution. Appl. Soft Comput. 2025, 171, 112798. [Google Scholar] [CrossRef]
- Kumar, A.; Singh, D.; Jain, R.; Jain, D.K.; Gan, C.; Zhao, X. Advances in DeepFake Detection Algorithms: Exploring Fusion Techniques in Single and Multi-Modal Approach. Inf. Fusion 2025, 118, 102993. [Google Scholar] [CrossRef]
- Almutairi, Z.M.; Elgibreen, H. Detecting Fake Audio of Arabic Speakers Using Self-Supervised Deep Learning. IEEE Access 2023, 11, 72134–72147. [Google Scholar] [CrossRef]
- Mirza, A.R.; Al-Talabani, A.K. Spoofing Countermeasure for Fake Speech Detection Using Brute Force Features. Comput. Speech Lang. 2025, 90, 101732. [Google Scholar] [CrossRef]
- Shaaban, O.A.; Yildirim, R.; Alguttar, A.A. Audio Deepfake Approaches. IEEE Access 2023, 11, 132652–132682. [Google Scholar] [CrossRef]
- Liang, R.; Xie, Y.; Cheng, J.; Pang, C.; Schuller, B. A Non-Invasive Speech Quality Evaluation Algorithm for Hearing Aids with Multi-Head Self-Attention and Audiogram-Based Features. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2166–2176. [Google Scholar] [CrossRef]
- Akter, R.; Islam, M.R.; Debnath, S.K.; Sarker, P.K.; Uddin, M.K. A Hybrid CNN-LSTM Model for Environmental Sound Classification: Leveraging Feature Engineering and Transfer Learning. Digit. Signal Process. 2025, 163, 105234. [Google Scholar] [CrossRef]
- Xiong, D.; Wen, Z.; Zhang, C.; Ren, D.; Li, W. BMNet: Enhancing Deepfake Detection Through BiLSTM and Multi-Head Self-Attention Mechanism. IEEE Access 2025, 13, 21547–21556. [Google Scholar] [CrossRef]
- Lavrentyeva, G.; Novoselov, S.; Tseren, A.; Volkova, M.; Gorlanov, A.; Kozlov, A. STC Antispoofing Systems for the ASVspoof2019 Challenge. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1033–1037. Available online: https://www.isca-archive.org/interspeech_2019/lavrentyeva19_interspeech.html (accessed on 15 October 2025).
- Huang, L.; Pun, C.-M. Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection. arXiv 2024, arXiv:2401.05614. [Google Scholar]
- Xie, Y.; Cheng, H.; Wang, Y.; Ye, L. Domain Generalization via Aggregation and Separation for Audio Deepfake Detection. IEEE Trans. Inf. Forensics Secur. 2023, 19, 344–358. [Google Scholar] [CrossRef]
- Abdeldayem, M.; Mohamed, A. The Fake or Real Dataset. 2023. Available online: https://www.kaggle.com/datasets/mohammedabdeldayem/the-fake-or-real-dataset/data (accessed on 13 August 2025).
- Yi, J.; Wang, C.; Tao, J.; Zhang, X.; Zhang, C.Y.; Zhao, Y. Audio Deepfake Detection: A Survey. arXiv 2023, arXiv:2308.14970. [Google Scholar] [CrossRef]
- Karthikeyan, V.; Suja Priyadharsini, S. Adaptive Boosted Random Forest-Support Vector Machine Based Classification Scheme for Speaker Identification. Appl. Soft Comput. 2022, 131, 109826. [Google Scholar] [CrossRef]
- Liu, T.; Yan, D.; Wang, R.; Yan, N.; Chen, G. Identification of Fake Stereo Audio Using SVM and CNN. Information 2021, 12, 263. [Google Scholar] [CrossRef]
- Chau, H.-H.; Chau, Y. Audio-Based Classification of Mild Cognitive Impairment Using XGBoost. In Proceedings of the 2024 IEEE 6th Eurasia Conference on Biomedical Engineering, Healthcare and Sustainability (ECBIOS), Tainan, Taiwan, 14–16 June 2024; pp. 263–265. [Google Scholar]
- Wani, T.M.; Qadri, S.A.A.; Comminiello, D.; Amerini, I. Detecting Audio Deepfakes: Integrating CNN and BiLSTM with Multi-Feature Concatenation. In Proceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security, Baiona, Spain, 24–26 June 2024; pp. 271–276. [Google Scholar]
- Doan, T.P.; Hong, K.; Jung, S. GAN Discriminator Based Audio Deepfake Detection. In Proceedings of the 2nd Workshop on Security Implications of Deepfakes and Cheapfakes, Melbourne, VIC, Australia, 10–14 July 2023; pp. 29–32. [Google Scholar]
- Lapates, J.M.; Gerardo, B.D.; Medina, R.P. Performance Evaluation of Enhanced DCGANs for Detecting Deepfake Audio across Selected FoR Datasets. In Proceedings of the 2024 15th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 16–18 October 2024; pp. 54–59. [Google Scholar]
- Wijethunga, R.; Matheesha, D.; Al Noman, A.; De Silva, K.; Tissera, M.; Rupasinghe, L. Deepfake Audio Detection: A Deep Learning Based Solution for Group Conversations. In Proceedings of the 2020 2nd International Conference on Advancements in Computing (ICAC), Colombo, Sri Lanka, 10–11 December 2020; pp. 192–197. [Google Scholar]
- Volkova, M.; Andzhukaev, T.; Lavrentyeva, G.; Novoselov, S.; Kozlov, A. Light CNN Architecture Enhancement for Different Types of Spoofing Attack Detection. In Proceedings of the International Conference on Speech and Computer, Istanbul, Turkey, 20–25 August 2019; Springer: Cham, Switzerland, 2019. [Google Scholar]
- Sheikholeslami, S.; Ghasemirahni, H.; Payberah, A.H.; Wang, T.; Dowling, J.; Vlassov, V. Utilizing Large Language Models for Ablation Studies in Machine Learning and Deep Learning. In Proceedings of the 5th Workshop on Machine Learning and Systems, Rotterdam, The Netherlands, 30 March–3 April 2025; pp. 230–237. [Google Scholar]










| Version & Source | Detail | Significance |
|---|---|---|
| for-original | Unprocessed audio samples (i.e., original) | Baseline version with class and gender imbalance. |
| for-norm | Normalized audio (adjusted sample rate, volume, and channels) | Reduces class and gender bias; useful for generalization. |
| for-2sec | 2-s truncated clips from the for-norm set | Fixed-length inputs allow uniform temporal modeling. |
| for-rerec | Re-recorded version of for-2sec via external devices | Simulates real-world distortions; useful for robustness testing. |
| Hyperparameter | Value |
|---|---|
| Optimizer | Adam |
| Loss Function | Categorical Cross entropy |
| Activation Function | Leaky ReLU(CNN) |
| Epochs | 20 |
| Batch Size | 32 |
| Dropout Rate | 0.3 |
| LSTM Units | 128 |
| Learning Rate | 1 × 10−6 |
| Model | Precision | Recall | F1-Score | EER | Accuracy |
|---|---|---|---|---|---|
| Random Forest [34] | 0.79 | 0.81 | 0.80 | 0.36 | 0.81 |
| SVM [35] | 0.80 | 0.88 | 0.84 | 0.32 | 0.84 |
| MLP [8] | 0.82 | 0.78 | 0.80 | 0.31 | 0.78 |
| XGBoost [36] | 0.79 | 0.84 | 0.81 | 0.34 | 0.84 |
| CNN [5] | 0.84 | 0.82 | 0.83 | 0.29 | 0.88 |
| CNN-BiLSTM [37] | 0.93 | 0.89 | 0.90 | 0.18 | 0.93 |
| GAN [38] | 0.48 | 0.45 | 0.46 | 0.77 | 0.46 |
| DCGAN [39] | 0.52 | 0.49 | 0.50 | 0.67 | 0.50 |
| Dense model [39] | 0.70 | 0.77 | 0.73 | 0.38 | 0.79 |
| RNN [40] | 0.88 | 0.90 | 0.90 | 0.18 | 0.92 |
| RawNet2 [30] | 0.81 | 0.83 | 0.81 | 0.28 | 0.81 |
| AASIST3 [18] | 0.87 | 0.87 | 0.87 | 0.22 | 0.88 |
| LCNN [41] | 0.90 | 0.92 | 0.92 | 0.15 | 0.94 |
| Proposed AudioFakeNet | 0.95 | 0.92 | 0.94 | 0.14 | 0.96 |
| Metric | Value |
|---|---|
| Precision | 0.86 |
| Recall | 0.85 |
| F1-Score | 0.85 |
| EER | 0.23 |
| Accuracy | 0.88 |
| Model Variant | Precision | Recall | F1-Score | EER | Accuracy |
|---|---|---|---|---|---|
| V1. CNN Only | 0.84 | 0.82 | 0.83 | 0.29 | 0.88 |
| V2. CNN + BiLSTM (No MHA) | 0.93 | 0.89 | 0.90 | 0.18 | 0.93 |
| V3. CNN + MHA (No BiLSTM) | 0.90 | 0.88 | 0.89 | 0.16 | 0.95 |
| V4. Proposed AudioFakeNet (Full Model) | 0.95 | 0.92 | 0.94 | 0.14 | 0.96 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dilbar, S.; Qureshi, M.A.; Noon, S.K.; Mannan, A. AudioFakeNet: A Model for Reliable Speaker Verification in Deepfake Audio. Algorithms 2025, 18, 716. https://doi.org/10.3390/a18110716
Dilbar S, Qureshi MA, Noon SK, Mannan A. AudioFakeNet: A Model for Reliable Speaker Verification in Deepfake Audio. Algorithms. 2025; 18(11):716. https://doi.org/10.3390/a18110716
Chicago/Turabian StyleDilbar, Samia, Muhammad Ali Qureshi, Serosh Karim Noon, and Abdul Mannan. 2025. "AudioFakeNet: A Model for Reliable Speaker Verification in Deepfake Audio" Algorithms 18, no. 11: 716. https://doi.org/10.3390/a18110716
APA StyleDilbar, S., Qureshi, M. A., Noon, S. K., & Mannan, A. (2025). AudioFakeNet: A Model for Reliable Speaker Verification in Deepfake Audio. Algorithms, 18(11), 716. https://doi.org/10.3390/a18110716

