FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection
Abstract
1. Introduction
- It enables comprehensive model-centric and data-centric experimentation by allowing the user to fix the architecture while varying the spectral representation, or to fix the representation while exploring multiple architectures. The library supports four transformations: mel-spectrogram, log-spectrogram, scalogram, and the Constant-Q Transform (CQT).
- It offers a hybrid search space in which custom architectures and benchmark models can be systematically combined with the four available spectral representations. This unified experimentation pipeline facilitates controlled comparisons that are rarely addressed explicitly in the audio deepfake detection literature.
- It allows rapid and reproducible comparison between custom solutions and benchmark architectures (e.g., evaluating a custom ConvNext model against ResNet, VGG, or EfficientNet-based baselines) under matched or varied transformation types and hyperparameter configurations. This design enables principled selection of optimal architecture–representation pairs for a given detection scenario.
- It includes an inference module in which a trained model estimates the probability that an input audio sample is synthetic or natural. This supports standalone audio evaluation by end-users and enables robustness testing under adversarial or intentionally altered audio conditions.
2. Background
2.1. Technologies of Synthetic Audio Generation (TTS and Voice/Voice-Conversion)
2.2. Data-Centric Solutions: Selection and Optimization of Spectral Representations
- Selection of the spectral representation. The representation determines the structure of the time–frequency information provided to the classifier. Common options include:
- –
- STFT-based spectrograms: obtained using the Short-Time Fourier Transform, which decomposes the signal into frequency components across short temporal windows. Two widely adopted variants include
- ∗
- Mel-spectrograms, which apply a mel-scale filterbank to approximate human auditory perception, offering high resolution at low frequencies and lower resolution at high frequencies. They capture formants, harmonics, and speech-relevant cues, and have been extensively used in spoofing detection challenges [40].
- ∗
- Log-spectrograms, which emphasize spectral energy variations through log compression, highlighting subtle amplitude differences that may reveal artifacts introduced by vocoders or TTS systems [41].
- –
- Wavelet-based scalograms (DWT-based): Multi-resolution representations computed using the Discrete Wavelet Transform (DWT). Compared with CWT-based scalograms, the DWT is computationally efficient and enables explicit control over the resolution at each decomposition level. Mother wavelets such as Daubechies or Symlets provide sensitivity to transient and non-stationary artifacts that may not be well captured by Fourier-based methods. Wavelet packet variants and scalogram-like representations have also shown promise in spoofing detection [42].
- –
- CQT (Constant-Q Transform): A logarithmically spaced representation aligned with speech perception and harmonic structure. Its constant frequency-to-resolution ratio makes it suitable for capturing formant structure and harmonic distortions typical of synthetic audio [41].
- Selection of representation hyperparameters. Each representation requires choosing a set of hyperparameters that determine its ability to reveal synthesis artifacts:
- –
- STFT parameters: window size, hop length, FFT size, and window type (Hann, Hamming, Blackman).
- –
- mel-spectrogram parameters: number of mel filters, window size, hop length, and FFT size.
- –
- DWT-based scalogram parameters: mother wavelet (Daubechies, Symlet), number of decomposition levels, filter lengths, and dimensionality normalization rules per scale.
- –
- CQT parameters: bins per octave, minimum frequency, window overlap, and kernel selection.
2.3. Model-Centric Solutions: Architecture Design, Training Protocols, and Performance Optimization
- Architecture selection. The choice of architecture determines how the model extracts discriminative patterns from time–frequency representations such as spectrograms or scalograms. Prior work has explored several families of convolutional and attention-based architectures, including the following:
- –
- Sequential CNNs (AlexNet, VGG): simple yet computationally heavy architectures that remain effective for capturing low- and mid-level spectral cues introduced by vocoder artifacts [43].
- –
- Residual CNNs (ResNet): introduce skip connections to enable deeper networks and alleviate vanishing gradients [44]. Their hierarchical representations help detect subtle synthesis distortions distributed across frequency bands.
- –
- Multi-branch CNNs (Inception): apply parallel convolutional kernels of different sizes, making them capable of capturing multi-scale spectral patterns relevant to detecting artifacts occurring at varying resolutions [45].
- –
- –
- Modern CNNs (ConvNext): redesign traditional convolutional blocks by integrating concepts from Transformers—such as large kernels, depthwise convolutions, and layer normalization—achieving performance comparable to state-of-the-art attention-based models [48].
- –
- Transformers (ViT): use self-attention to model long-range dependencies across spectrogram patches. They are effective at capturing prosody, speaker consistency, and temporal correlations that extend beyond local convolutional filters, though they typically require larger datasets and more computational resources.
- Training hyperparameters. Once an architecture is selected, model-centric optimization focuses on tuning its training hyperparameters. Key factors include learning rate schedules, batch size, optimizer choice, number of epochs, regularization strategies, and transfer learning techniques such as partial layer freezing or full fine-tuning [49]. These choices directly influence model convergence, training stability, and generalization capacity.
3. Gaps and Motivation
- Absence of unified model-centric and data-centric comparison frameworks: Existing studies tend to emphasize either architectural improvements or the exploration of specific spectral representations, but seldom offer a structured environment to jointly analyze both dimensions. This lack of integration makes it difficult to disentangle how much of the detection performance is attributable to the model architecture versus the choice of data transformation, particularly now that four widely used representations (mel, log, scalogram, and CQT) coexist in modern pipelines.
- Lack of standardized benchmarking for custom architectures: Researchers frequently design custom models tailored to specific datasets or constraints, yet few platforms allow these models to be directly and fairly compared against established benchmarks such as ResNet, VGG, EfficientNet, or ConvNext. In the absence of such controlled environments, evaluating the true merit of new architectures becomes inconsistent and often irreproducible.
- Limited tools for robustness and adversarial vulnerability analysis: Although adversarial attacks on synthetic audio detectors are an emerging concern, current resources rarely provide mechanisms to expose trained models to intentional perturbations while monitoring probabilistic outputs. Without such tools, it remains challenging to assess the stability, vulnerability, and operational reliability of detectors under realistic threat conditions.
- Fragmentation of datasets, models, and evaluation pipelines: Most available resources address isolated components (e.g., datasets for training, pre-trained models for inference, or scripts for metric evaluation), but few combine these elements within a single unified workflow. This fragmentation complicates reproducibility, slows down systematic benchmarking across architectures and representations, and limits transparent reporting of probabilistic detection outcomes.
4. FakeVoiceFinder
4.1. Data-Centric Approach
- Clip duration: the fixed or minimum length applied to all audio samples.
- Time–frequency transformation: the selected representation (mel, log-spectrogram, scalogram, or CQT).
- Transformation hyperparameters: the specific parameters required to generate the chosen representation.
4.2. Model-Centric Approach
- Convolutional Neural Networks (CNNs): Particularly effective at capturing local patterns in spectrograms. The selected architectures range from classic models (e.g., AlexNet, VGG) to more advanced networks (e.g., EfficientNet). A special case is the ConvNext family, which, while remaining convolutional, incorporates Transformer-inspired design principles such as larger kernel sizes, depthwise convolutions, and layer normalization. This makes ConvNext models an attractive option, as they combine the efficiency of CNNs with performance levels comparable to state-of-the-art Transformers.
- Transformers: Excel at modeling long-range dependencies through self-attention, enabling the capture of global relationships in spectrograms that CNNs may overlook. This makes them well-suited for complex audio signals with long-term temporal patterns.
- Architecture type: the category of the model, such as CNN and Transformer.
- Specific architecture: the particular model within the chosen category, e.g., ResNet18, ViT_B_16, ConvNext_base.
- Training hyperparameters: values used to configure the training of the selected model, such as learning rate, batch size, and number of epochs.
- Input image size: the dimensions of the input image expected by the model, which must match the architecture’s design to ensure proper functioning.
4.3. Custom Models
5. Results
5.1. Metrics in FakeVoiceFinder
5.2. Performance Plots: Hybrid-Approach
- Model-centric: In the first case, bar charts are used for a specific type of transformation for each of the different models obtained by combining architecture and training type (scratch/pretrain).
- data-centric: In the second case, bar charts are also used, but this time the model type (architecture + training type) is fixed across the four transformations (mel, log, DWT, CQT).
- Hybrid-approach: Finally, a heatmap is used, in which both the model (architecture + training type) and the transformation type (mel, log, DWT, CQT) are varied.
- The CQT representation favors pretrained convolutional models such as AlexNet, ConvNext Tiny, and ResNet18, all above 96%, while transformer architectures drop notably, indicating limited compatibility with this transform.
- With DWT, the pretrained and scratch variants of ResNet18 reach 90.8% and 89.6%, followed by VGG16 pretrained and AlexNet scratch near 89%, reflecting the advantage of strong mid level feature extraction.
- The log-spectrogram benefits classical convolutional models, with ResNet18 pretrained reaching 99.2% and VGG16 pretrained 98.8%, while SimpleCNN and ResNet18 scratch remain above 87%.
- The mel-spectrogram provides stable high performance, with ResNet18 reaching 97.9% and 97.5% for pretrained and scratch variants, and VGG16 pretrained reaching 97.1%.
- ResNet18 is the most robust architecture, consistently above 94% across all representations, indicating strong generalization.
- VGG16 pretrained performs best under log and mel, while presenting moderate decreases under CQT and DWT.
- AlexNet and ConvNext Tiny show competitive results when pretrained, particularly with CQT, but their scratch versions have more variable performance.
- ViT B 16 is highly sensitive to the representation, performing moderately with mel and DWT but degrading sharply with CQT and log.
- SimpleCNN remains stable between 85% and 88%, illustrating how lightweight custom models can also be systematically evaluated in the framework.
5.3. Inference Module
5.4. Comparison with Existing Frameworks
6. Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| ACC | Accuracy |
| AI | Artificial Intelligence |
| AUC | Area Under the ROC Curve |
| CNN | Convolutional neural network |
| DL | Deep learning |
| EER | Equal Error Rate |
| LSTM | Long Short-Term Memory |
| GRU | Gated Recurrent Unit |
| TTS | Text to Speech |
| V2V | Voice to Voice |
| ViT | Vision Transformers |
References
- Huang, W.C.; Hayashi, T.; Wu, Y.C.; Kameoka, H.; Toda, T. Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 4676–4680. [Google Scholar] [CrossRef]
- Patel, A.; Madnani, H.; Tripathi, S.; Sharma, P.; Shukla, V. Real-Time Voice Cloning: Artificial Intelligence to Clone and Generate Human Voice. In Intelligent Solutions for Smart Adaptation in Digital Era (InCITe 2024); Lecture Notes in Electrical Engineering; Hasteer, N., Blum, C., Mehrotra, D., Pandey, H., Eds.; Springer: Singapore, 2025; Volume 1278. [Google Scholar] [CrossRef]
- Khan, A.A.; Laghari, A.A.; Inam, S.A.; Ullah, S.; Shahzad, M.; Syed, D. A survey on multimedia-enabled deepfake detection: State-of-the-art tools and techniques, emerging trends, current challenges & limitations, and future directions. Discov. Comput. 2025, 28, 48. [Google Scholar]
- Patel, Y.; Tanwar, S.; Gupta, R.; Bhattacharya, P.; Davidson, I.E.; Nyameko, R.; Aluvala, S.; Vimal, V. Deepfake generation and detection: Case study and challenges. IEEE Access 2023, 11, 143296–143323. [Google Scholar] [CrossRef]
- Yamagishi, J.; Todisco, M.; Sahidullah, M.; Delgado, H.; Wang, X.; Evans, N.; Kinnunen, T.; Lee, K.A.; Vestman, V.; Nautsch, A. Asvspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge Database. 2019. Available online: https://datashare.ed.ac.uk/handle/10283/3336 (accessed on 5 October 2025).
- Yamagishi, J.; Wang, X.; Todisco, M.; Sahidullah, M.; Patino, J.; Nautsch, A.; Liu, X.; Lee, K.A.; Kinnunen, T.; Evans, N.; et al. ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection. arXiv 2021, arXiv:2109.00537. [Google Scholar] [CrossRef]
- Delgado, H.; Evans, N.; Jung, J.W.; Kinnunen, T.; Kukanov, I.; Lee, K.A.; Liu, X.; Shim, H.j.; Sahidullah, M.; Tak, H.; et al. Asvspoof 5 Evaluation Plan. 2024. Available online: https://www.asvspoof.org/file/ASVspoof5___Evaluation_Plan_Phase2.pdf (accessed on 5 October 2025).
- Yan, Z.; Zhao, Y.; Wang, H. VoiceWukong: Benchmarking Deepfake Voice Detection. In Proceedings of the 34th USENIX Security Symposium (USENIX Security 25), Seattle, WA, USA, 13–15 August 2025; pp. 4561–4580. [Google Scholar]
- Dsouza, D.J.; Rodrigues, A.P.; Fernandes, R. Multi-modal Comparative Analysis on Audio Dub Detection using Artificial Intelligence. IEEE Access 2025, 13, 128856–128878. [Google Scholar] [CrossRef]
- Xie, Z.; Li, B.; Xu, X.; Liang, Z.; Yu, K.; Wu, M. FakeSound: Deepfake general audio detection. arXiv 2024, arXiv:2406.08052. [Google Scholar] [CrossRef]
- Cheng, H.; Li, K.; Ye, L.; Wang, J. EnvFake: An Initial Environmental-Fake Audio Dataset for Scene-Consistency Detection. In Proceedings of the 2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP), Beijing, China, 7–10 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 81–85. [Google Scholar]
- Sun, C.; Jia, S.; Hou, S.; Lyu, S. Ai-synthesized voice detection using neural vocoder artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 904–912. [Google Scholar]
- Ahmad, O.; Khan, M.S.; Jan, S.; Khan, I. Deepfake Audio Detection for Urdu Language Using Deep Neural Networks. IEEE Access 2025, 13, 97765–97778. [Google Scholar] [CrossRef]
- Pintelas, E.; Livieris, I.E. Convolutional neural network framework for deepfake detection: A diffusion-based approach. Comput. Vis. Image Underst. 2025, 257, 104375. [Google Scholar] [CrossRef]
- Tahaoglu, G.; Baracchi, D.; Shullani, D.; Iuliani, M.; Piva, A. Deepfake audio detection with spectral features and ResNeXt-based architecture. Knowl.-Based Syst. 2025, 323, 113726. [Google Scholar] [CrossRef]
- Gulsoy, T.; Gulsoy, E.K.; Ustubioglu, A.; Ustubioglu, B.; Kablan, E.B.; Ayas, S.; Ulutas, G.; Tahaoglu, G.; Elhoseny, M. Detecting audio splicing forgery: A noise-robust approach with Swin Transformer and cochleagram. J. Inf. Secur. Appl. 2025, 93, 104130. [Google Scholar] [CrossRef]
- Zaman, K.; Samiul, I.J.A.M.; Sah, M.; Direkoglu, C.; Okada, S.; Unoki, M. Hybrid Transformer Architectures With Diverse Audio Features for Deepfake Speech Classification. IEEE Access 2024, 12, 149221–149237. [Google Scholar] [CrossRef]
- Zaman, K.; Li, K.; Sah, M.; Direkoglu, C.; Okada, S.; Unoki, M. Transformers and audio detection tasks: An overview. Digit. Signal Process. 2025, 158, 104956. [Google Scholar]
- Tang, Y.; Mu, J. ConvTrans-DF: A Deep Fake Detection Method Combining CNN and Transformer. In Proceedings of the International Conference on Intelligent Computing, Ningbo, China, 26–29 July 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 334–345. [Google Scholar]
- Petmezas, G.; Vanian, V.; Konstantoudakis, K.; Almaloglou, E.E.; Zarpalas, D. Video deepfake detection using a hybrid CNN-LSTM-Transformer model for identity verification. Multimed. Tools Appl. 2025, 84, 40617–40636. [Google Scholar]
- Wang, C.; Yi, J.; Tao, J.; Zhang, C.; Zhang, S.; Chen, X. Detection of cross-dataset fake audio based on prosodic and pronunciation features. arXiv 2023, arXiv:2305.13700. [Google Scholar] [CrossRef]
- Liu, T.; Kukanov, I.; Pan, Z.; Wang, Q.; Sailor, H.B.; Lee, K.A. Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing. arXiv 2024, arXiv:2409.08346. [Google Scholar] [CrossRef]
- Shi, H.; Shi, X.; Dogan, S.; Alzubi, S.; Huang, T.; Zhang, Y. Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios. arXiv 2025, arXiv:2504.12423. [Google Scholar] [CrossRef]
- Ballesteros, D.M.; Rodriguez-Ortega, Y.; Renza, D.; Arce, G. Deep4SNet: Deep learning for fake speech classification. Expert Syst. Appl. 2021, 184, 115465. [Google Scholar] [CrossRef]
- Camacho, S.; Ballesteros, D.M.; Renza, D. Fake speech recognition using deep learning. In Proceedings of the Workshop on Engineering Applications, Medellín, Colombia, 6–8 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 38–48. [Google Scholar]
- Zhang, T.; Feng, G.; Liang, J.; An, T. Acoustic scene classification based on Mel spectrogram decomposition and model merging. Appl. Acoust. 2021, 182, 108258. [Google Scholar] [CrossRef]
- Wani, T.M.; Amerini, I. Deepfakes audio detection leveraging audio spectrogram and convolutional neural networks. In Proceedings of the International Conference on Image Analysis and Processing, Udine, Italy, 11–15 September 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 156–167. [Google Scholar]
- Mehra, S.; Ranga, V.; Agarwal, R. A deep learning approach to dysarthric utterance classification with BiLSTM-GRU, speech cue filtering, and log mel spectrograms. J. Supercomput. 2024, 80, 14520–14547. [Google Scholar]
- Fathan, A.; Alam, J.; Kang, W. Multiresolution decomposition analysis via wavelet transforms for audio deepfake detection. In Proceedings of the International Conference on Speech and Computer, Gurugram, India, 14–16 November 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 188–200. [Google Scholar]
- Zbezhkhovska, U.; Khapilin, O. Deepfake Audio Detection with Sinc and Wavelet Filters in RawNet2. In Proceedings of the International Conference on Information and Communication Technologies in Education, Research, and Industrial Applications, Lviv, Ukraine, 23–27 September 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 273–284. [Google Scholar]
- Singh, S.; Bharadwaj, N.K. Waveform and Mel-Frequency Cepstral Coefficients (MFCC) approach for Deepfake Audio Detection. In Proceedings of the 2025 IEEE International Conference on Emerging Technologies and Applications (MPSec ICETA), Gwalior, India, 21–23 February 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–6. [Google Scholar]
- Wang, Y.; Skerry-Ryan, R.; Stanton, D.; Wu, Y.; Weiss, R.J.; Jaitly, N.; Yang, Z.; Xiao, Y.; Chen, Z.; Bengio, S.; et al. Tacotron: Towards end-to-end speech synthesis. arXiv 2017, arXiv:1703.10135. [Google Scholar]
- Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4779–4783. [Google Scholar]
- Kim, J.; Kong, J.; Son, J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Breckenridge, CO, USA, 2021; pp. 5530–5540. [Google Scholar]
- Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv 2020, arXiv:2006.04558. [Google Scholar]
- Lee, S.G.; Ping, W.; Ginsburg, B.; Catanzaro, B.; Yoon, S. Bigvgan: A universal neural vocoder with large-scale training. arXiv 2022, arXiv:2206.04658. [Google Scholar]
- Dhar, S.; Jana, N.D.; Das, S. Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements. arXiv 2025, arXiv:2504.19197. [Google Scholar] [CrossRef]
- Choi, J.E.; Schäfer, K.; Steinebach, M. The Sound of Language: A Bilingual Analysis of Voice Conversion and Text-to-Speech Synthesis. In Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
- Walczyna, T.; Piotrowski, Z. Overview of voice conversion methods based on deep learning. Appl. Sci. 2023, 13, 3100. [Google Scholar] [CrossRef]
- Lavrentyeva, G.; Novoselov, S.; Tseren, A.; Volkova, M.; Gorlanov, A.; Kozlov, A. STC Anti-spoofing Systems for the ASVspoof2019 Challenge. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1033–1037. [Google Scholar] [CrossRef]
- Todisco, M.; Delgado, H.; Evans, N.W.D. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. In Proceedings of the Odyssey: The Speaker and Language Recognition Workshop, Bilbao, Spain, 21–24 June 2017; pp. 27–30. [Google Scholar] [CrossRef]
- Wu, Z.; Das, R.K.; Yang, J.; Li, H. Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 794–798. [Google Scholar] [CrossRef]
- Shaaban, O.A.; Yildirim, R. Audio Deepfake Detection Using Deep Learning. Eng. Rep. 2025, 7, e70087. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; PMLR: Breckenridge, CO, USA, 2019; Volume 97, pp. 6105–6114. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
- Abuhmida, M.; Whittey, R.; Hossain, M.M. Enhancing Audio Deepfake Detection: A Study of Deep Learning Parameters. In Proceedings of the International Conference for Emerging Technologies in Computing, Essex, UK, 15–16 August 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 140–160. [Google Scholar]
- Beltran, M.; Ballesteros L, D.M. Fake Audio Dataset (ElevenLabs & Respeecher). Mendeley Data, V1. 2025. Available online: https://data.mendeley.com/datasets/79g59sp69z/1 (accessed on 5 October 2025).
- Gonzalez, B.; Ballesteros L, D.M. TTS/V2V Audio Deepfake Dataset. Mendeley Data, V2. 2025. Available online: https://data.mendeley.com/datasets/h4zbs27tkr/2 (accessed on 5 October 2025).
- An, W.; Li, R.; Ge, H.; Li, M.; Li, H. An End-to-End Audio Transformer with Multi-student Knowledge Distillation algorithm for Deepfake Speech Detection. In Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition, Tianjin, China, 25–27 October 2024; pp. 366–371. [Google Scholar]
- Rabhi, M.; Bakiras, S.; Di Pietro, R. Audio-deepfake detection: Adversarial attacks and countermeasures. Expert Syst. Appl. 2024, 250, 123941. [Google Scholar] [CrossRef]














| Transformation | Hyperparameter | Description | Default Value |
|---|---|---|---|
| mel-scale | n_fft | Size of the FFT window. | 2048 |
| hop_length | Number of samples between STFT frames. | 512 | |
| n_mels | Number of mel bands. | 128 | |
| Log | n_fft | Size of the FFT window. | 2048 |
| hop_length | Number of samples between STFT frames. | 512 | |
| Scalogram (DWT) | wavelet | Type of mother wavelet. | db4 |
| level | Number of decomposition levels. | 4 | |
| mode | Boundary extension mode. | constant | |
| CQT | hop_length | Number of samples between CQT frames. | 256 |
| n_bins | Total number of CQT frequency bins. | 96 | |
| bins_per_octave | Resolution of frequency axis. | 24 | |
| scale | Scaling option producing a more stable spectral distribution. | True |
| Architecture Type | Available Options |
|---|---|
| CNN | AlexNet, ResNet18, ResNet34 |
| VGG16, VGG19, DenseNet121 | |
| MobileNet_v2, EfficientNet_b0 | |
| SqueezeNet1_0, Inception_v3 | |
| ConvNext_tiny, ConvNext_small, ConvNext_base | |
| Transformer | ViT_B_16 |
| Hyperparameter | Type | Description | Example |
|---|---|---|---|
| epochs | Integer | Number of training epochs. | 50 |
| lr | Float | Learning rate used by the optimizer. | 0.001 |
| bs | Integer | Batch size used during training. | 32 |
| optim_name | String | Optimizer to be used: sgd or adam. | adam |
| patience | Integer | Number of epochs without improvement before early stopping. | 10 |
| seed | Integer | Random seed for reproducibility of results. | 42 |
| type_train | String | Type of training: scratch, pretrained, or both. | pretrained |
| Layer (Custom Model) | Height | Width | Channels | Filter Height | Filter Width | Vector Length |
|---|---|---|---|---|---|---|
| Input | 224 | 224 | 1 | – | – | – |
| Conv1 | 224 | 224 | 32 | 3 | 3 | – |
| MaxPool1 | 112 | 112 | 32 | 2 | 2 | – |
| Conv2 | 112 | 112 | 64 | 3 | 3 | – |
| MaxPool2 | 56 | 56 | 64 | 2 | 2 | – |
| Conv3 | 56 | 56 | 128 | 3 | 3 | – |
| GAP + Flatten | 1 | 1 | 128 | – | – | 128 |
| Linear (output) | – | – | – | – | – | 2 |
| Pred 0 (Real) | Pred 1 (Fake) | |
|---|---|---|
| Actual 0 (real) | TN = 440 | FP = 160 |
| Actual 1 (fake) | FN = 80 | TP = 520 |
| Metric | Value |
|---|---|
| Precision (class 1, fake) | 0.7647 |
| Recall (class 1, fake) | 0.8667 |
| F1 (class 1, fake) | 0.8129 |
| Accuracy (global) | 0.8000 |
| F1 Macro | 0.7995 |
| F1 Micro | 0.8000 |
| Category | Detail | Count/Value |
|---|---|---|
| Generation Tool | ElevenLabs (V2V) | 282 |
| ElevenLabs (TTS) | 53 | |
| Respeecher (V2V) | 210 | |
| Respeecher (TTS) | 55 | |
| Total Audios | V2V | 492 |
| TTS | 108 | |
| Gender | Male | 49% |
| Female | 51% | |
| Other Characteristics | Duration | 8–10 s |
| Sampling rate | 22,050 Hz |
| Framework | Focus | Key Features |
|---|---|---|
| ASVspoof (2019–2025) | Benchmarking anti-spoofing | Fixed datasets, baselines (GMM, CNN, LCNN), standardized protocols |
| VoiceWukong (2025) | Security benchmarking | Multi-attack scenarios, robust evaluation |
| Deep-O-Meter (UB, 2024) | Multimodal deepfake detection | Online platform (image, video, audio); evaluation interface for uploaded content |
| Audio Deepfake Detection (2025 [43]) | CNN-based architectures | Detection of TTS and VC audio |
| End-to-End Audio Transformer (2024 [52]) | Transformer + distillation | End-to-end pipeline for deepfake detection |
| Adversarial Testing (2024 [53]) | Robustness analysis | Evaluation under adversarial attacks |
| FakeVoiceFinder (2025, own) | Model- and data-centric hybrid analysis | Flexible benchmarking with user datasets, probabilistic inference, visualization modes |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Pachon, C.; Ballesteros, D. FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection. Big Data Cogn. Comput. 2026, 10, 25. https://doi.org/10.3390/bdcc10010025
Pachon C, Ballesteros D. FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection. Big Data and Cognitive Computing. 2026; 10(1):25. https://doi.org/10.3390/bdcc10010025
Chicago/Turabian StylePachon, Cesar, and Dora Ballesteros. 2026. "FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection" Big Data and Cognitive Computing 10, no. 1: 25. https://doi.org/10.3390/bdcc10010025
APA StylePachon, C., & Ballesteros, D. (2026). FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection. Big Data and Cognitive Computing, 10(1), 25. https://doi.org/10.3390/bdcc10010025

