Next Article in Journal
Big Data and Cognitive Computing: Five New Journal Sections Established
Previous Article in Journal
Data-Driven Insights: Leveraging Sentiment Analysis and Latent Profile Analysis for Financial Market Forecasting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection

Facultad de Ingenieria, Universidad Militar Nueva Granada, Bogota 110111, Colombia
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2026, 10(1), 25; https://doi.org/10.3390/bdcc10010025
Submission received: 6 October 2025 / Revised: 27 December 2025 / Accepted: 31 December 2025 / Published: 7 January 2026

Abstract

AI-based audio generation has advanced rapidly, enabling deepfake audio to reach levels of naturalness that closely resemble real recordings and complicate the distinction between authentic and synthetic signals. While numerous CNN- and Transformer-based detection approaches have been proposed, most adopt a model-centric perspective in which the spectral representation remains fixed. Parallel data-centric efforts have explored alternative representations such as scalograms and CQT, yet the field still lacks a unified framework that jointly evaluates the influence of model architecture, its hyperparameters (e.g., learning rate, number of epochs), and the spectral representation along with its own parameters (e.g., representation type, window size). Moreover, there is no standardized approach for benchmarking custom architectures against established baselines under consistent experimental conditions. FakeVoiceFinder addresses this gap by providing a systematic framework that enables direct comparison of model-centric, data-centric, and hybrid evaluation strategies. It supports controlled experimentation, flexible configuration of models and representations, and comprehensive performance reporting tailored to the detection task. This framework enhances reproducibility and helps clarify how architectural and representational choices interact in synthetic audio detection.

1. Introduction

The rapid progress of AI-based speech synthesis technologies has enabled the creation of highly realistic synthetic audio, often referred to as deepfake audio. Techniques such as Text-to-Speech (TTS) and Voice Conversion (VC) have advanced considerably in recent years, producing speech that is becoming increasingly difficult to distinguish from natural human voices [1,2]. This growing naturalness creates challenges in areas such as impersonation, audio forensics, misuse for malicious purposes, fake news, and others [3,4].
In recent years, research on fake audio detection has intensified, leading to the development of multiple datasets, organized challenges, and both academic and commercial tools. Benchmarking initiatives, such as the ASVspoof challenge series, have played an important role in standardizing evaluation protocols and fostering the progress of countermeasure systems [5,6,7]. Additional datasets, including In the Wild corpora and language-specific resources, have further extended the scope of the evaluation to different acoustic and linguistic conditions [8]. At the same time, several open-source libraries and verification tools have been introduced, providing reproducible pipelines for synthetic speech detection and supporting the implementation of countermeasures in practical scenarios [9,10,11,12].
Most existing approaches can be categorized as model-centric, where the main effort is placed on the design and optimization of deep learning architectures. Convolutional Neural Networks (CNNs) and Vision Transformers are often used for their ability to capture local spectral patterns in time–frequency representations [13,14,15,16], while recurrent and attention-based models such as LSTMs, GRUs, and Transformers are applied to exploit temporal dependencies and long-range correlations in audio signals [17,18]. More recent work has examined hybrid CNN–Transformer architectures, residual networks, and self-supervised embeddings to improve robustness against unseen spoofing attacks [19,20]. To assess performance, researchers typically report accuracy, precision, recall, F1-score, Area Under the ROC Curve (AUC), and Equal Error Rate (EER). Beyond these metrics, recent studies highlight the importance of robustness evaluation, examining model generalization across datasets [21], languages [22], and real-world communication conditions such as compression and channel noise [23].
In parallel, data-centric approaches focus on the way audio input is represented before being processed by the models [24,25]. Instead of relying only on raw waveforms, which contain fine-grained temporal details but often require large datasets to generalize, researchers frequently adopt time–frequency transformations. mel-spectrograms are widely used to approximate human auditory perception, while log-spectrograms emphasize subtle energy variations across frequency bands [26,27,28]. More recently, scalograms derived from continuous and discrete wavelet transforms have been investigated to provide multi-resolution analysis, offering complementary insights beyond Fourier-based methods [29,30,31]. These representations, combined with CNNs, Transformers, or hybrid architectures, have shown a substantial impact on detection performance, sometimes even greater than improvements in model design. This line of research highlights the importance of representation learning, confirming that the choice of spectrograms at different scales, or wavelet-based representations, plays a central role in distinguishing synthetic from natural speech.
Despite these contributions, only a limited number of studies have systematically explored Hybrid-approaches that integrate both model-centric and data-centric perspectives. Moreover, the comparison of custom architectures with benchmark architectures under identical experimental conditions is critical to allowing a rigorous evaluation of whether a proposed solution exceeds existing models. Such tools also serve as a mechanism for resource-constrained researchers to validate their solutions beyond the scope of established challenges or benchmarking initiatives.
Based on the above, we present FakeVoiceFinder, a library that provides the following contributions:
  • It enables comprehensive model-centric and data-centric experimentation by allowing the user to fix the architecture while varying the spectral representation, or to fix the representation while exploring multiple architectures. The library supports four transformations: mel-spectrogram, log-spectrogram, scalogram, and the Constant-Q Transform (CQT).
  • It offers a hybrid search space in which custom architectures and benchmark models can be systematically combined with the four available spectral representations. This unified experimentation pipeline facilitates controlled comparisons that are rarely addressed explicitly in the audio deepfake detection literature.
  • It allows rapid and reproducible comparison between custom solutions and benchmark architectures (e.g., evaluating a custom ConvNext model against ResNet, VGG, or EfficientNet-based baselines) under matched or varied transformation types and hyperparameter configurations. This design enables principled selection of optimal architecture–representation pairs for a given detection scenario.
  • It includes an inference module in which a trained model estimates the probability that an input audio sample is synthetic or natural. This supports standalone audio evaluation by end-users and enables robustness testing under adversarial or intentionally altered audio conditions.
To better situate our contribution, the rest of the paper is organized as follows. Section 2 provides a background on synthetic audio generation, detection methods, and commonly used datasets and metrics. Section 3 outlines the main open gaps in the literature, such as the absence of unified evaluation frameworks and limited benchmarking of custom models, which directly motivate the design of FakeVoiceFinder. Section 4 introduces this open-source framework, structured into two complementary perspectives: a data-centric approach, focused on input processing and transformations, and a model-centric approach, centered on architectural choices and training hyperparameters. The results are presented in Section 5.

2. Background

This section first describes the main technologies for synthetic audio generation, and then introduces the key concepts of model-centric approaches (architectures, hyperparameter tuning, lighter architectures) and data-centric approaches (datasets, data transformation, and adversarial attacks). In addition, it presents a comparison between Convolutional Neural Networks, and Transformers, emphasizing their differences and relative advantages for detecting synthetic audio. This comparison provides a justification for including the three types of architecture in the FakeVoiceFinder framework, with the aim of supporting a fair and comprehensive evaluation framework.

2.1. Technologies of Synthetic Audio Generation (TTS and Voice/Voice-Conversion)

Text-to-Speech (TTS) and Voice Conversion (VC, also called V2V) are two main categories of methods for generating AI-based audio. In the case of TTS, a model maps text input (optionally enriched with style, prosody, or speaker embeddings) into an audio waveform. Modern approaches are usually end-to-end neural TTS systems, for instance, Tacotron [32,33], VITS [34], or FastSpeech [35], combined with high-quality neural vocoders such as HiFi-GAN or BigVGAN [36]. In VC, the objective is to transform the speech of a source speaker so that it resembles the voice of a target speaker, while preserving the linguistic content and modifying the speaker’s identity, accent or speaking style [37,38]. Recent surveys present different encoder–decoder and disentanglement pipelines, zero shot voice conversion strategies, and the use of latent representations in neural VC [39].
These technologies have improved markedly in naturalness, speaker similarity, and variability. Such improvements make synthetic audio harder to distinguish from real speech, even for human listeners, and pose growing challenges for automatic detection systems.

2.2. Data-Centric Solutions: Selection and Optimization of Spectral Representations

A fundamental component in audio deepfake detection is the transformation of the raw waveform into a spectral representation that can be processed by deep learning models. Data-centric approaches focus precisely on this step: selecting the most informative representation and tuning its hyperparameters to expose synthesis artifacts. Under this perspective, the architecture remains fixed, and performance gains arise from how the signal is represented rather than from modifications to the model itself.
Two core decisions define this approach:
  • Selection of the spectral representation. The representation determines the structure of the time–frequency information provided to the classifier. Common options include:
    STFT-based spectrograms: obtained using the Short-Time Fourier Transform, which decomposes the signal into frequency components across short temporal windows. Two widely adopted variants include
    Mel-spectrograms, which apply a mel-scale filterbank to approximate human auditory perception, offering high resolution at low frequencies and lower resolution at high frequencies. They capture formants, harmonics, and speech-relevant cues, and have been extensively used in spoofing detection challenges [40].
    Log-spectrograms, which emphasize spectral energy variations through log compression, highlighting subtle amplitude differences that may reveal artifacts introduced by vocoders or TTS systems [41].
    Wavelet-based scalograms (DWT-based): Multi-resolution representations computed using the Discrete Wavelet Transform (DWT). Compared with CWT-based scalograms, the DWT is computationally efficient and enables explicit control over the resolution at each decomposition level. Mother wavelets such as Daubechies or Symlets provide sensitivity to transient and non-stationary artifacts that may not be well captured by Fourier-based methods. Wavelet packet variants and scalogram-like representations have also shown promise in spoofing detection [42].
    CQT (Constant-Q Transform): A logarithmically spaced representation aligned with speech perception and harmonic structure. Its constant frequency-to-resolution ratio makes it suitable for capturing formant structure and harmonic distortions typical of synthetic audio [41].
  • Selection of representation hyperparameters. Each representation requires choosing a set of hyperparameters that determine its ability to reveal synthesis artifacts:
    STFT parameters: window size, hop length, FFT size, and window type (Hann, Hamming, Blackman).
    mel-spectrogram parameters: number of mel filters, window size, hop length, and FFT size.
    DWT-based scalogram parameters: mother wavelet (Daubechies, Symlet), number of decomposition levels, filter lengths, and dimensionality normalization rules per scale.
    CQT parameters: bins per octave, minimum frequency, window overlap, and kernel selection.
Each representation has distinct strengths: mel-spectrograms are perceptually grounded, log-spectrograms highlight energy-based anomalies, CQT provides harmonic resolution, and DWT-based scalograms capture multi-scale transient behavior. For this reason, the FakeVoiceFinder framework incorporates all these transformations, enabling systematic evaluation of how the choice of representation and its hyperparameters impacts deepfake detection. This integration helps disentangle whether improvements stem from the model architecture or from the data representation itself.

2.3. Model-Centric Solutions: Architecture Design, Training Protocols, and Performance Optimization

Model-centric approaches focus on improving the model itself, assuming that architectural selection and hyperparameter tuning are the primary factors driving performance in synthetic audio detection. Under this perspective, the spectral representation is typically fixed, and performance gains arise from designing or selecting an appropriate architecture and refining its training configuration. Two main decisions define this category:
  • Architecture selection. The choice of architecture determines how the model extracts discriminative patterns from time–frequency representations such as spectrograms or scalograms. Prior work has explored several families of convolutional and attention-based architectures, including the following:
    Sequential CNNs (AlexNet, VGG): simple yet computationally heavy architectures that remain effective for capturing low- and mid-level spectral cues introduced by vocoder artifacts [43].
    Residual CNNs (ResNet): introduce skip connections to enable deeper networks and alleviate vanishing gradients [44]. Their hierarchical representations help detect subtle synthesis distortions distributed across frequency bands.
    Multi-branch CNNs (Inception): apply parallel convolutional kernels of different sizes, making them capable of capturing multi-scale spectral patterns relevant to detecting artifacts occurring at varying resolutions [45].
    Lightweight CNNs (MobileNet, EfficientNet): architectures optimized for reduced parameter count and computational cost [46,47]. They facilitate real-time or resource-limited forensic deployments while maintaining competitive accuracy.
    Modern CNNs (ConvNext): redesign traditional convolutional blocks by integrating concepts from Transformers—such as large kernels, depthwise convolutions, and layer normalization—achieving performance comparable to state-of-the-art attention-based models [48].
    Transformers (ViT): use self-attention to model long-range dependencies across spectrogram patches. They are effective at capturing prosody, speaker consistency, and temporal correlations that extend beyond local convolutional filters, though they typically require larger datasets and more computational resources.
  • Training hyperparameters. Once an architecture is selected, model-centric optimization focuses on tuning its training hyperparameters. Key factors include learning rate schedules, batch size, optimizer choice, number of epochs, regularization strategies, and transfer learning techniques such as partial layer freezing or full fine-tuning [49]. These choices directly influence model convergence, training stability, and generalization capacity.
Although model-centric approaches often achieve strong performance in controlled intra-dataset settings, their robustness tends to decline under cross-dataset evaluation, with audio generated by newer synthesis technologies, or under manipulated conditions. This highlights the need for frameworks such as FakeVoiceFinder that enable systematic comparison of architectural families and hyperparameter configurations under consistent experimental conditions.

3. Gaps and Motivation

Despite notable progress in the study of synthetic audio detection, several methodological and practical gaps remain:
  • Absence of unified model-centric and data-centric comparison frameworks: Existing studies tend to emphasize either architectural improvements or the exploration of specific spectral representations, but seldom offer a structured environment to jointly analyze both dimensions. This lack of integration makes it difficult to disentangle how much of the detection performance is attributable to the model architecture versus the choice of data transformation, particularly now that four widely used representations (mel, log, scalogram, and CQT) coexist in modern pipelines.
  • Lack of standardized benchmarking for custom architectures: Researchers frequently design custom models tailored to specific datasets or constraints, yet few platforms allow these models to be directly and fairly compared against established benchmarks such as ResNet, VGG, EfficientNet, or ConvNext. In the absence of such controlled environments, evaluating the true merit of new architectures becomes inconsistent and often irreproducible.
  • Limited tools for robustness and adversarial vulnerability analysis: Although adversarial attacks on synthetic audio detectors are an emerging concern, current resources rarely provide mechanisms to expose trained models to intentional perturbations while monitoring probabilistic outputs. Without such tools, it remains challenging to assess the stability, vulnerability, and operational reliability of detectors under realistic threat conditions.
  • Fragmentation of datasets, models, and evaluation pipelines: Most available resources address isolated components (e.g., datasets for training, pre-trained models for inference, or scripts for metric evaluation), but few combine these elements within a single unified workflow. This fragmentation complicates reproducibility, slows down systematic benchmarking across architectures and representations, and limits transparent reporting of probabilistic detection outcomes.
Based on these gaps, we introduce FakeVoiceFinder, a library designed to provide a flexible and unified environment for systematic analysis of deepfake detection models. The toolkit supports both model-centric and data-centric experimentation by allowing controlled variation of architectures and audio transformations, now including four spectral representations: mel-spectrogram, log-spectrogram, scalogram, and the Constant-Q Transform (CQT). It also enables hybrid analyses in which custom architectures can be combined with any of the four transformations, facilitating controlled comparative studies that are rarely addressed in the literature. Furthermore, FakeVoiceFinder allows rapid benchmarking of custom solutions against established models (e.g., comparing a custom ConvNext to ResNet-, VGG-, or EfficientNet-based baselines) under matched or varied transformation types. Finally, the library includes an inference module that estimates the probability that an audio sample is synthetic or natural, supporting both user-level evaluation of specific files and robustness assessment under intentional or adversarial manipulations for vulnerability analysis.

4. FakeVoiceFinder

FakeVoiceFinder possesses an architecture designed to streamline the entire pipeline of synthetic audio detection, from dataset preparation to model training and evaluation. The framework is organized into modular scripts that interact through a centralized configuration system (See Figure 1).
The process begins with the dataset preparation module (prepare_dataset.py), which takes as input the real and fake audio archives, the transformation parameters (e.g., mel-spectrograms, log-spectrograms, scalograms, and CQT), and user-defined options such as clipping duration or image size.
As part of this stage, the dataset is split into two subsets (training and validation) according to a user-defined ratio. To avoid methodological ambiguity, it is important to clarify that FakeVoiceFinder does not use these subsets for testing. The real.zip and fake.zip archives provided by the user are employed exclusively for training and validation under a stratified split, while all testing is performed on external and previously unseen audio samples through the inference module (Section 5.3). This design ensures a strict separation between training, validation, and test stages, supporting transparent and reproducible evaluation.
Then, the spectral representations are obtained for every audio using the selected type of transformation and its corresponding hyperparameters (see Figure 2).
Next, the model loader (model_loader.py) manages the selection of architectures (e.g., ResNet18, VGG16) and training modes (scratch, pretrained, or both). The trainer module (trainer.py) then executes the training procedure, handling hyperparameters such as learning rate, batch size, number of epochs, and optimizer type, as well as hardware configurations (CPU/GPU).
Finally, the experiment manager (experiment.py) orchestrates the overall workflow by linking dataset preparation, model loading, and training into a reproducible process. Performance is summarized through the metrics reporter (metrics.py), which generates reports in multiple formats (CSV, JSON, plots).
This modular architecture ensures reproducibility, flexibility, and scalability, enabling researchers to adapt the framework to different datasets, transformations, and model architectures while providing end users with a reliable tool for deepfake audio detection.
The inference stage is handled by the inference runner (inference.py), which integrates the necessary inputs to evaluate new audio samples See (Figure 3). The module receives the audio path of the sample under analysis, the model path pointing to the trained network, and the selected transformation parameters (e.g., mel filterbank size, hop length) associated with the chosen feature extraction method (e.g., mel-spectrograms). These elements are combined to preprocess the input, apply the trained model, and compute the final prediction. The output is expressed as an inference score, typically represented as class probabilities (e.g., {‘real’: 0.87, ‘fake’: 0.13}), which quantify the likelihood of the audio being authentic or synthetic. This design guarantees that once models are trained, they can be consistently deployed to evaluate unseen samples, ensuring both reproducibility and practical applicability.
In addition, it provides three comparative visualization modes: (1) model-oriented, (2) transformation-oriented, and (3) hybrid, which facilitate the analysis of the best model–transformation combinations. Finally, once the models are trained, the framework allows the estimation of the probability that an external audio sample belongs to the fake category.
The subsequent section outlines the functionalities provided by the library for each of the considered approaches.

4.1. Data-Centric Approach

In this approach, the user must provide two compressed folders: one named real.zip and the other fake.zip. These should contain, respectively, the natural audio samples and the AI-generated ones.
The first step is to define the audio duration, which can be either fixed (default: 3 s) or automatically set to the minimum duration among all audio samples in the dataset. This parameter is configured through cfg.clip_seconds, either by selecting "min_sec" or by specifying a constant value. In the latter case, if shorter audio files are encountered, they are zero-padded prior to the transformation stage.
Once the duration has been established, the user must select the type of transformation to be applied to the audio files. Four time–frequency representations are available: mel-spectrograms, log-spectrograms, scalograms, and CQT. These options are specified in cfg.transform_list using the identifiers "mel", "log", "DWT", and "CQT".
Each transformation type includes its own hyperparameters. For the mel representation, relevant hyperparameters include the number of mel filterbanks (n_mels), the size of the FFT window (n_fft), and the hop length between STFT frames (hop_length). For the log and mel representations, the last two hyperparameters are shared. For scalograms, the main parameters are the choice of mother wavelet (wavelet), the number of decomposition levels (level), and the analysis mode (mode). In the case of the CQT representation, the key hyperparameters include the hop length between frames (hop_length), the total number of frequency bins (n_bins), the number of bins per octave (bins_per_octave), and the scaling option applied to the filter responses (scale). See Table 1 for details.
It is important to note that within the framework, the generated spectrograms, scalograms, or CQT representations are not plotted. Instead, they are stored directly as tensors, which are then used as input to the selected model. This design choice significantly reduces the preprocessing time of this stage.
In summary, in the data-centric approach, the user can adjust three categories of hyperparameters:
  • Clip duration: the fixed or minimum length applied to all audio samples.
  • Time–frequency transformation: the selected representation (mel, log-spectrogram, scalogram, or CQT).
  • Transformation hyperparameters: the specific parameters required to generate the chosen representation.

4.2. Model-Centric Approach

For the model-centric approach, FakeVoiceFinder provides a diverse set of benchmark architectures for image classification tasks, suitable for time-frequency representations of audio (e.g., spectrograms) rather than raw audio files. The architectures are organized into two groups, each with distinct characteristics.
  • Convolutional Neural Networks (CNNs): Particularly effective at capturing local patterns in spectrograms. The selected architectures range from classic models (e.g., AlexNet, VGG) to more advanced networks (e.g., EfficientNet). A special case is the ConvNext family, which, while remaining convolutional, incorporates Transformer-inspired design principles such as larger kernel sizes, depthwise convolutions, and layer normalization. This makes ConvNext models an attractive option, as they combine the efficiency of CNNs with performance levels comparable to state-of-the-art Transformers.
  • Transformers: Excel at modeling long-range dependencies through self-attention, enabling the capture of global relationships in spectrograms that CNNs may overlook. This makes them well-suited for complex audio signals with long-term temporal patterns.
Table 2 summarizes the available model architectures in FakeVoiceFinder, organized by type. This categorization highlights the diversity of architectures available for training in time-frequency representations of audio.
Next, a hyperparameter corresponding to the input image size has been included, since some models, such as ViT and ConvNext, may encounter issues if the input size does not exactly match the one they were originally designed for. This parameter is adjusted using cfg.image_size.
Finally, training hyperparameters have also been included, which allow the adjustment of experimental conditions, such as the number of epochs used to train each model. The list of these hyperparameters in FakeVoiceFinder, along with their corresponding type, description, and example, is presented in Table 3.
For the case of training_type, three options are available. The first, scratch, corresponds to using only the architecture without weight transfer. In this case, the model is trained from scratch. The second option, pretrained, applies transfer learning followed by fine-tuning. Finally, the both option combines both training strategies.
In summary, in the model-centric approach, the user can select from four types of model-related options:
  • Architecture type: the category of the model, such as CNN and Transformer.
  • Specific architecture: the particular model within the chosen category, e.g., ResNet18, ViT_B_16, ConvNext_base.
  • Training hyperparameters: values used to configure the training of the selected model, such as learning rate, batch size, and number of epochs.
  • Input image size: the dimensions of the input image expected by the model, which must match the architecture’s design to ensure proper functioning.

4.3. Custom Models

In addition to the benchmark architectures included in the framework, FakeVoiceFinder allows users to integrate custom models for experimentation and benchmarking purposes. This functionality provides flexibility for researchers who wish to evaluate novel architectures or lightweight designs under the same experimental conditions as standard CNN- or Transformer-based models.
A custom model must follow the standard structure of a PyTorch nn.Module, defining both the __init__() and forward() methods. Once implemented, the model can be placed in the designated models directory, where it will be automatically detected by the framework. Training is performed using the same configuration files and training pipeline employed by the predefined architectures, without requiring manual checkpoint handling or export steps.
The integration process, illustrated in Figure 4, enables users to incorporate their own architectures into FakeVoiceFinder in a straightforward manner. The user simply stores the Python definitions of the models in a dedicated folder (e.g., ../models) and specifies this path in the experiment configuration file. After executing the loader.prepare_user_models() function, all models in that directory are automatically loaded, validated, and registered as available architectures for experimentation. Once registered, custom models become fully compatible with the framework and can be used in any experimental configuration alongside the standard architectures (e.g., ResNet, ConvNext, or ViT). This procedure allows seamless integration of new architectures without modifying the internal structure of the library, ensuring reproducibility and comparability under identical experimental conditions.
As an example of this capability, Figure 5 presents the custom model used in this work: a lightweight convolutional neural network named SimpleCNN, designed specifically for binary classification of real and synthetic audio samples. The model includes three convolutional layers, each followed by a pooling operation, and a Global Average Pooling (GAP) layer that enables input-size agnosticism by aggregating spatial information before the final dense layer. The architecture is intentionally compact, allowing rapid experimentation and serving as a baseline for comparison with deeper or hybrid models. An example of creating and saving a custom CNN (TorchScript) compatible with FakeVoiceFinder is available in the following notebook: https://github.com/DEEP-CGPS/FakeVoiceFinder/blob/main/notebooks/Create_And_Save_Custom_CNN.ipynb (accessed on 5 October 2025).
Table 4 summarizes the intermediate tensor dimensions for an input of 224 × 224 × 1 , corresponding to a single-channel time-frequency representation. The progressive reduction of spatial dimensions and expansion of feature channels can be observed through the convolutional and pooling layers, culminating in a 128-dimensional feature vector used for two-class classification.
This example demonstrates how FakeVoiceFinder allows users to include and evaluate their own architectures under controlled and comparable experimental conditions. By maintaining consistent preprocessing, transformation parameters, and evaluation metrics, the framework ensures that performance differences are attributable to architectural design rather than to variations in experimental configuration.

5. Results

This section is divided into three parts. The first presents the performance metrics obtained from the different models trained by selecting hyperparameters from both model-centric and data-centric approaches. Next, we provide comparative performance plots focused on the model, the data, and a hybrid combination of both. Finally, we report the prediction results on previously unseen audio samples, where the output corresponds to the probability of belonging to the fake category. A detailed breakdown of each of these subsections is presented below.

5.1. Metrics in FakeVoiceFinder

FakeVoiceFinder computes several performance metrics commonly used in classification models, for both balanced and imbalanced datasets. Specifically, it calculates the accuracy, which corresponds to the overall proportion of correct predictions; the precision, which measures the proportion of audios predicted as positive that truly belong to the positive class; the recall, which represents the proportion of positive audios that are correctly identified as positive; the F1, which is the harmonic mean between precision and recall; the F1micro, which is especially useful for imbalanced datasets as it assigns weight proportional to the number of instances in each class; and the F1macro, which evaluates the performance of the model equally across all classes, regardless of class imbalance.
These metrics are obtained from the following equations:
A c c u r a c y = T P + T N T P + T N + F P + F N ,
P r e c i s i o n = T P T P + F P ,
R e c a l l = T P T P + F N ,
F 1 = 2 · P r e c i s i o n · R e c a l l P r e c i s i o n + R e c a l l ,
F 1 micro = 2 · ( T P 0 + T P 1 ) 2 · ( T P 0 + T P 1 ) + ( F P 0 + F P 1 ) + ( F N 0 + F N 1 ) ,
F 1 macro = F 1 0 + F 1 1 2 ,
where T P , T N , F P , and F N denote true positives, true negatives, false positives, and false negatives, respectively. When T P 0 is used, it means that class “0” is considered as the positive class; while T P 1 means that class “1” is taken as the positive class. Similarly, F 1 0 and F 1 1 denote the F1 scores obtained when class “0” and class “1” are considered as positive, respectively.
To exemplify the application of these metrics, we use the confusion matrix from Table 5 as an example.
Based on the confusion matrix, the following values are derived by applying the equations presented in this section.
A c c u r a c y = T P + T N T P + T N + F P + F N = 520 + 440 520 + 440 + 160 + 80 = 0.80
P r e c i s i o n ( c l a s s 1 ) = T P T P + F P = 520 520 + 160 = 0.7647
R e c a l l ( c l a s s 1 ) = T P T P + F N = 520 520 + 80 = 0.8667
F 1 ( c l a s s 1 ) = 2 · P r e c i s i o n · R e c a l l P r e c i s i o n + R e c a l l = 2 · 0.7647 · 0.8667 0.7647 + 0.8667 = 0.8129
P r e c i s i o n ( c l a s s 0 ) = T N T N + F N = 440 440 + 80 = 0.8462
R e c a l l ( c l a s s 0 ) = T N T N + F P = 440 440 + 160 = 0.7333
F 1 ( c l a s s 0 ) = 2 · 0.8462 · 0.7333 0.8462 + 0.7333 = 0.7861
F 1 micro = A c c u r a c y = 0.80
F 1 macro = F 1 ( c l a s s 0 ) + F 1 ( c l a s s 1 ) 2 = 0.7861 + 0.8129 2 = 0.7995
It should be emphasized that, in binary classification, F1 micro is equivalent to accuracy, irrespective of class balance.
In summary, the values are presented in Table 6. A comparison of the metrics shows that, in this case, the model performs better at classifying fake audio samples (class = 1) than natural audio samples (class = 0), i.e., recall is higher than precision.
Thus, upon execution, the library generates a CSV file containing the results of each experiment. The file includes the following columns: model (architecture used), variant (pretrain or scratch), transform (DWT, mel, log, CQT), accuracy, f1, f1_macro, f1_micro, precision, and recall. An example is shown in Figure 6.
In addition, the FakeVoiceFinder library allows users to specify different random seed values to run multiple experimental repetitions when desired. In the basic example, a fixed seed is used to reduce the computational cost associated with training a large number of models simultaneously. For instance, selecting the four spectral representations for fifteen architectures in both pretrained and scratch configurations results in a total of 120 models, assuming fixed data-related and model-related hyperparameters. This capability enables users to estimate variability or confidence intervals when required, while the fixed-seed configuration in the example prioritizes computational efficiency. Users can modify the experiment seed using the instruction cfg.seed = value, as demonstrated in Section 3 of the experiment notebook available at https://github.com/DEEP-CGPS/FakeVoiceFinder/blob/main/notebooks/3-%20Experiment_All_models_and_transforms.ipynb (accessed on 5 October 2025).
From the obtained values, it can be inferred whether the model is biased and, if so, which class is classified more effectively. This is evidenced by the precision and recall values: the greater the difference between them, the higher the bias of the model. Conversely, similar values indicate that the performance across both classes is comparable.

5.2. Performance Plots: Hybrid-Approach

In this section, three modes of visualization of the performance results obtained by the different models are presented:
  • Model-centric: In the first case, bar charts are used for a specific type of transformation for each of the different models obtained by combining architecture and training type (scratch/pretrain).
  • data-centric: In the second case, bar charts are also used, but this time the model type (architecture + training type) is fixed across the four transformations (mel, log, DWT, CQT).
  • Hybrid-approach: Finally, a heatmap is used, in which both the model (architecture + training type) and the transformation type (mel, log, DWT, CQT) are varied.
To illustrate the three types of visualization, a Basic Experiment has been created within the framework, which is available at https://github.com/DEEP-CGPS/FakeVoiceFinder/blob/main/notebooks/3-%20Experiment_All_models_and_transforms.ipynb (accessed on 5 October 2025). The experiments presented in this example were conducted using a dataset of 600 synthetic audios and 600 real audios. The synthetic audios were obtained from the Mendeley dataset entitled Fake Audio Dataset (ElevenLabs & Respeecher) [50], which was created during the first semester of 2025. On the other hand, the real audios were obtained from public repositories or social networks.
Table 7 presents the metadata of the Fake Audio Dataset (ElevenLabs & Respeecher available at https://data.mendeley.com/datasets/79g59sp69z/1 (accessed on 5 October 2025)) [50]. This dataset was selected because it includes two widely used and publicly accessible synthetic audio generation tools, ElevenLabs and Respeecher, making it realistic for potential misuse scenarios. It contains both TTS and V2V audios, providing diversity and different artifacts that enrich the classification task and improve robustness. The inclusion of male and female voices helps mitigate gender bias, while the uniform sampling rate (22,050 Hz) ensures consistency without requiring additional preprocessing. Moreover, the audio duration (8–10 s) provides sufficient information for models to capture relevant patterns without incurring high computational costs, achieving a balance between effectiveness and efficiency.
On the other hand, the hyperparameters used for both the data-centric approach and the model-centric approach are presented in (Section 2) Experiment Configuration of the Basic Example. Specifically, the following architectures were used: alexnet, resnet18, vgg16, vit_b_16, and ConvNext_tiny, both with scratch/pretrain training choices.
Figure 7 shows the results of the model-centric approach, where the transformation is fixed to mel and the trained models are compared across the selected architectures and training variants. The objective of this figure is not to identify the best-performing architecture, but to illustrate how the framework enables systematic comparison between CNN- and Transformer-based models, both with and without transfer learning. As larger datasets are used, models become increasingly capable of distinguishing natural from synthetic audio under a fixed representation, highlighting the value of controlled comparisons. This demonstrates the usefulness of FakeVoiceFinder for researchers developing synthetic audio detection solutions, as it allows custom models to be evaluated alongside benchmark architectures under identical experimental conditions. The only requirement is that custom models be provided in .pt or .pth format, which the framework fully supports.
For this particular experiment, the best option was ConvNext_tiny with transfer learning, followed by ResNet18 also with transfer learning, and then the same type of architecture but trained from scratch. Additionally, a significant difference can be observed between using transfer learning in models such as ConvNext_tiny compared to training them from scratch.
Now, Figure 8 presents the data-centric results. In this case, the AlexNet architecture with transfer learning was evaluated across the four available spectral representations. For the same architecture and dataset, the results vary substantially depending on the selected transformation. In this particular example, the performance difference between the best and worst cases exceeds 20%, with the CQT transform achieving the highest number of correctly classified audio samples.
Analyzing results separately with only the model-centric or data-centric option makes it difficult to identify optimal classifier conditions. Therefore, FakeVoiceFinder includes a third form of visualization, corresponding to the Hybrid-approach. To facilitate this type of analysis, a heatmap is used, which allows both previous approaches to be examined simultaneously (See Figure 9).
Now, Figure 9 presents the hybrid analysis, where each spectral representation is evaluated jointly with all architectures. This view shows that performance emerges from the interaction between data representation and model design rather than from either component alone.
In terms of the data representation:
  • The CQT representation favors pretrained convolutional models such as AlexNet, ConvNext Tiny, and ResNet18, all above 96%, while transformer architectures drop notably, indicating limited compatibility with this transform.
  • With DWT, the pretrained and scratch variants of ResNet18 reach 90.8% and 89.6%, followed by VGG16 pretrained and AlexNet scratch near 89%, reflecting the advantage of strong mid level feature extraction.
  • The log-spectrogram benefits classical convolutional models, with ResNet18 pretrained reaching 99.2% and VGG16 pretrained 98.8%, while SimpleCNN and ResNet18 scratch remain above 87%.
  • The mel-spectrogram provides stable high performance, with ResNet18 reaching 97.9% and 97.5% for pretrained and scratch variants, and VGG16 pretrained reaching 97.1%.
In terms of the architecture:
  • ResNet18 is the most robust architecture, consistently above 94% across all representations, indicating strong generalization.
  • VGG16 pretrained performs best under log and mel, while presenting moderate decreases under CQT and DWT.
  • AlexNet and ConvNext Tiny show competitive results when pretrained, particularly with CQT, but their scratch versions have more variable performance.
  • ViT B 16 is highly sensitive to the representation, performing moderately with mel and DWT but degrading sharply with CQT and log.
  • SimpleCNN remains stable between 85% and 88%, illustrating how lightweight custom models can also be systematically evaluated in the framework.
From the user’s perspective, the hybrid analysis demonstrates the importance of having both a broad set of architectures and multiple spectral representations. The library allows any custom model to be evaluated under identical conditions, making it possible to determine whether the observed performance variations originate from architectural choices, from the selected transform, or from their interaction. This systematic exploration is essential, since no representation and no architecture is universally superior, and only their combined assessment reveals the most suitable configuration for a given detection task.

5.3. Inference Module

The inference module of FakeVoiceFinder extends its applicability beyond academic research. Users can evaluate individual audio files with a pre-trained model, obtaining probabilistic scores for the classes real and fake. Starting from a checkpoint and configurable audio transformation (mel, log, or DWT), the system performs pre-processing, generates the input representation, and applies the model to compute softmax probabilities, reported as percentages. To improve interpretability, the module integrates a gauge-style visualization that highlights the probability of fake with qualitative bands (low, medium, high). This dual role, serving as both a research platform and a practical tool, makes the framework useful in contexts such as digital forensics, media verification, or robustness testing.
In the inference demo of the proposed framework, available at https://github.com/DEEP-CGPS/FakeVoiceFinder/blob/main/notebooks/4-%20inference_demo_all.ipynb (accessed on 5 October 2025), the process for performing inference on a single audio file with a pre-trained model is presented. The first step is to configure the input parameters, which include: the path to the pre-trained model, the type of transformation used during training (mel, log, DWT, or CQT), the sampling rate of the audio, the duration of the audio clip in seconds, the size of the input image, and the transformation-specific hyperparameters applied in the training process (See Figure 10).
Once the configuration is defined, the system executes preprocessing, generates the corresponding representation, and applies the model to obtain softmax probabilities for the classes real and fake, reported as percentages. To improve interpretability, the module integrates a gauge-style visualization that highlights the probability of fake with qualitative bands (low, medium, high). This makes the framework suitable not only for academic experiments but also for applied contexts such as digital forensics, media verification, or robustness testing.
For the evaluation of the inference module, the TTS/V2V Audio Deepfake Dataset, available at https://data.mendeley.com/datasets/h4zbs27tkr/2 (accessed on 5 October 2025) was used [51]. This dataset contains both text-to-speech (TTS) and voice-to-voice (V2V) synthetic speech samples. Specifically, 32 trained models were used to evaluate 60 natural and 60 fake audio files.
First, the result for a single audio sample is presented in Figure 11. Probabilities above 50% are interpreted as fake, while values below this threshold are considered real. In this example, the file 592.wav is correctly classified as fake by the ResNet18_pretrained model, with a probabilistic score of 97.58%, which corresponds to its true label.
Subsequently, the 120 selected audio samples from the TTS/V2V Audio Deepfake Dataset were evaluated, and the results of the 32 models for each of the four spectral representations are presented in Figure 12.
Figure 12 shows a wide dispersion in accuracy values across both model architectures and spectral representations, highlighting the importance of evaluating multiple configurations under identical training conditions. The heatmap reveals that certain architectures, such as ConvNext Tiny (pretrained), exhibit large performance fluctuations depending on the spectral representation used: high accuracy under CQT transform, but substantially lower performance under DWT. This variability demonstrates that architectural strength alone does not guarantee robust performance, and that the choice of spectral representation can dramatically alter a model’s effectiveness in detecting synthetic audio.
In contrast, some models display consistently poor performance regardless of the representation, such as ViT_B_16 (pretrained), which remains among the lowest-performing architectures across all transforms. These cases underscore the value of benchmarking many models simultaneously: the framework makes it possible to identify not only strong global performers, but also architectures that are highly sensitive to feature design or fundamentally unsuitable for the task. Overall, the figure illustrates how model architecture and spectral representation interact in complex ways, reinforcing the necessity of comparative, controlled experimentation in audio deepfake detection research.
As a next step, only synthetic audios generated using TTS technology (Figure 13) and V2V technology (Figure 14) were evaluated, with the aim of determining whether the models exhibited greater difficulty when the synthetic samples originated from either of these two types.
One of the main advantages of training a large set of models under identical experimental conditions is the ability to extract reliable comparative insights regarding how architectures and spectral representations behave when confronted with different types of synthetic audio. Since the 32 models were trained using a dataset in which approximately 82% of the synthetic samples were generated with V2V technology, the evaluation highlights the extent to which each model can generalize beyond the dominant manipulation type seen during training.
The comparison between the TTS-only and V2V-only evaluations reveals several consistent patterns. First, most architectures achieve considerably higher accuracy when detecting V2V-generated audio, suggesting that V2V artifacts are more easily captured by the models, especially under the CQT and DWT representations. In contrast, the TTS-generated audios produce more variability and lower accuracy, particularly in weaker architectures trained from scratch, which indicates that TTS manipulations introduce signal characteristics that are more challenging to model under limited training diversity.
Second, systematic cross-representation analysis shows that DWT and CQT provide the most stable performance across both scenarios, whereas log and mel are more sensitive to the choice of architecture.
Finally, the side-by-side comparison makes it possible to identify models that generalize well across both manipulation types, as well as those that perform strongly under V2V but degrade substantially under TTS. These findings demonstrate the analytical potential of the framework: by enforcing identical hyperparameters and preprocessing choices, it becomes possible to uncover nuanced interactions between model architecture, spectral representation, and the nature of the synthetic audio to be detected.

5.4. Comparison with Existing Frameworks

Several open-source initiatives have addressed the problem of synthetic audio and deepfake voice detection. While many focus on specific architectures or evaluation protocols, few provide an integrated environment comparable to FakeVoiceFinder, where both model-centric and data-centric analyses can be performed systematically. Below we summarize some of the most representative frameworks and repositories.
The ASVspoof challenge repositories (2019, 2021, 2025) [5,6] provide baseline systems, datasets, and standardized evaluation protocols for anti-spoofing. They are the de facto benchmark in the field, but they are primarily focused on fixed datasets and do not provide modular tools for user-defined experimentation.
The more recent VoiceWukong project emphasizes benchmarking under realistic threat models, integrating diverse attacks and evaluation settings. However, its focus is on standardized security evaluation rather than providing a flexible environment for hybrid analysis.
Another relevant initiative is Deep-O-Meter (University at Buffalo), accessible at https://zinc.cse.buffalo.edu/ubmdfl/deep-o-meter/home_login (accessed on 5 October 2025). It is designed as a multimodal platform for deepfake detection (image, video, audio), where users can upload content and receive classification scores. Unlike FakeVoiceFinder, which emphasizes reproducibility and experimentation with user-defined datasets and architectures, Deep-O-Meter operates mainly as an evaluation interface rather than a research framework.
Other research-oriented repositories include implementations of specific architectures or strategies, such as CNN-based pipelines for audio deepfake detection [43], end-to-end Transformers with knowledge distillation, and frameworks oriented towards adversarial robustness testing. These are valuable contributions, but they generally address one methodological dimension in isolation.
As seen in Table 8, most existing solutions provide either datasets, baselines, or evaluation protocols, but not an integrated environment that allows researchers to systematically explore the interaction between data representations and architectures. In contrast, FakeVoiceFinder is explicitly designed as a flexible framework where users can upload their own datasets, test standard or custom architectures, and obtain comparable results across multiple experimental conditions.
This unified view, where both the architectural dimension and the signal-representation dimension can be explored systematically under identical preprocessing, splits, and metrics, is not supported by any of the existing open-source frameworks reviewed.

6. Discussion

The primary aim of FakeVoiceFinder is not to establish the superiority of a particular model or spectral representation, but to offer a flexible and reproducible environment in which users can systematically evaluate their own datasets and architectures. The results presented in this article serve as illustrative examples of the types of metrics, visualizations, and inference outputs that the framework can generate. The main strength of the library lies in its capacity to adapt to diverse experimental configurations while ensuring methodological consistency.
A key contribution of the framework is its ability to benchmark custom architectures against well-established baselines under strictly controlled and comparable conditions. Users can evaluate models using identical datasets, fixed training schedules, unified hyperparameters, and a standardized set of spectral representations (mel, log, DWT, and CQT). This addresses a recurrent limitation in the deepfake-audio literature, where differences in preprocessing steps, model configurations, or evaluation methodologies often hinder direct comparison of results across studies. By enforcing uniform experimental conditions, FakeVoiceFinder enables a rigorous and unbiased assessment of whether specific architectural choices provide genuine performance improvements.
The framework also supports model-centric, data-centric, and hybrid evaluation strategies. This allows researchers to analyze whether improvements in detection performance stem from architectural innovations, from the choice of spectral representation, or from the interaction between both. Such a comprehensive perspective remains uncommon in current research, where studies typically focus on one of these dimensions in isolation. Additionally, the possibility of specifying custom random seeds (cfg.seed) allows for repeated experimental runs when desired, supporting the estimation of variability or confidence intervals and strengthening reproducibility.
The probabilistic inference module expands the practical applicability of the library. Users can train models with their own data and subsequently deploy them to evaluate individual audio samples across all supported spectral transformations, including mel, log, DWT, and CQT. The resulting probabilistic scores are useful not only for academic research but also for applied scenarios such as digital forensics, media verification, and robustness evaluation against adversarial or signal-level perturbations.
Overall, FakeVoiceFinder provides a solid foundation for advancing reproducibility, comparability, and transparency in synthetic-audio detection research. Its open-source nature, modular design, compatibility with custom datasets and models, and coherent experimental workflow position it as a versatile tool for both scientific inquiry and practical analysis.

7. Conclusions

This work introduced FakeVoiceFinder, an open-source framework that enables systematic analysis of synthetic audio detection from both model-centric and data-centric perspectives. The library integrates multiple audio transformations (mel, log, DWT, and CQT) with diverse deep learning architectures (CNNs and Transformers), providing a flexible environment for benchmarking under consistent and reproducible experimental conditions.
The results obtained from the Basic Example highlight three main findings. First, detection performance depends not only on the choice of architecture but also on the selected spectral representation, with differences reaching up to 40% in some cases (for example, the ConvNext_tiny pretrained configuration). Second, the hybrid approach shows that no single transformation or architecture consistently outperforms the others, underscoring the importance of combined and comparative evaluations. Third, the probabilistic inference module extends the applicability of the framework beyond academic research, enabling practitioners to evaluate individual audio samples and assess model robustness under adversarial or signal-level manipulations.
A distinctive feature of the framework is that users can upload their own custom models and directly compare them against well-established benchmark architectures under identical experimental conditions, including the type of spectral transformation. This functionality addresses a recurrent difficulty in the field, namely the challenge of producing fair and meaningful comparisons. Instead of relying on performance metrics reported across heterogeneous publications, often based on different datasets, preprocessing steps, or evaluation protocols, researchers can rigorously determine whether a proposed solution offers genuine improvements or whether it falls short of competitive performance.
By making this framework openly available, we aim to support reproducibility, comparability, and methodological transparency in the study of synthetic and deepfake audio detection.

Author Contributions

Conceptualization, D.B. and C.P.; methodology, D.B.; software, C.P.; validation, C.P.; formal analysis, D.B.; investigation, C.P.; writing—original draft preparation, D.B.; writing—review and editing, C.P.; supervision, D.B.; funding acquisition, D.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was sponsored by the Universidad Militar Nueva Granada—Vicerrectoría de investigaciones, with project INV-ING-4152.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in Mendeley Data at https://github.com/DEEP-CGPS/FakeVoiceFinder (accessed on 5 October 2025).

Acknowledgments

The authors would like to thank the research assistants of the project, Miguel A. Beltrán Barrantes and Brayan F. González Velasco, who participated in the preparation of the datasets (Fake Audio Dataset and the TTS/V2V Audio Deepfake Dataset) used in the experimental phase of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ACCAccuracy
AIArtificial Intelligence
AUCArea Under the ROC Curve
CNNConvolutional neural network
DLDeep learning
EEREqual Error Rate
LSTMLong Short-Term Memory
GRUGated Recurrent Unit
TTSText to Speech
V2VVoice to Voice
ViTVision Transformers

References

  1. Huang, W.C.; Hayashi, T.; Wu, Y.C.; Kameoka, H.; Toda, T. Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 4676–4680. [Google Scholar] [CrossRef]
  2. Patel, A.; Madnani, H.; Tripathi, S.; Sharma, P.; Shukla, V. Real-Time Voice Cloning: Artificial Intelligence to Clone and Generate Human Voice. In Intelligent Solutions for Smart Adaptation in Digital Era (InCITe 2024); Lecture Notes in Electrical Engineering; Hasteer, N., Blum, C., Mehrotra, D., Pandey, H., Eds.; Springer: Singapore, 2025; Volume 1278. [Google Scholar] [CrossRef]
  3. Khan, A.A.; Laghari, A.A.; Inam, S.A.; Ullah, S.; Shahzad, M.; Syed, D. A survey on multimedia-enabled deepfake detection: State-of-the-art tools and techniques, emerging trends, current challenges & limitations, and future directions. Discov. Comput. 2025, 28, 48. [Google Scholar]
  4. Patel, Y.; Tanwar, S.; Gupta, R.; Bhattacharya, P.; Davidson, I.E.; Nyameko, R.; Aluvala, S.; Vimal, V. Deepfake generation and detection: Case study and challenges. IEEE Access 2023, 11, 143296–143323. [Google Scholar] [CrossRef]
  5. Yamagishi, J.; Todisco, M.; Sahidullah, M.; Delgado, H.; Wang, X.; Evans, N.; Kinnunen, T.; Lee, K.A.; Vestman, V.; Nautsch, A. Asvspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge Database. 2019. Available online: https://datashare.ed.ac.uk/handle/10283/3336 (accessed on 5 October 2025).
  6. Yamagishi, J.; Wang, X.; Todisco, M.; Sahidullah, M.; Patino, J.; Nautsch, A.; Liu, X.; Lee, K.A.; Kinnunen, T.; Evans, N.; et al. ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection. arXiv 2021, arXiv:2109.00537. [Google Scholar] [CrossRef]
  7. Delgado, H.; Evans, N.; Jung, J.W.; Kinnunen, T.; Kukanov, I.; Lee, K.A.; Liu, X.; Shim, H.j.; Sahidullah, M.; Tak, H.; et al. Asvspoof 5 Evaluation Plan. 2024. Available online: https://www.asvspoof.org/file/ASVspoof5___Evaluation_Plan_Phase2.pdf (accessed on 5 October 2025).
  8. Yan, Z.; Zhao, Y.; Wang, H. VoiceWukong: Benchmarking Deepfake Voice Detection. In Proceedings of the 34th USENIX Security Symposium (USENIX Security 25), Seattle, WA, USA, 13–15 August 2025; pp. 4561–4580. [Google Scholar]
  9. Dsouza, D.J.; Rodrigues, A.P.; Fernandes, R. Multi-modal Comparative Analysis on Audio Dub Detection using Artificial Intelligence. IEEE Access 2025, 13, 128856–128878. [Google Scholar] [CrossRef]
  10. Xie, Z.; Li, B.; Xu, X.; Liang, Z.; Yu, K.; Wu, M. FakeSound: Deepfake general audio detection. arXiv 2024, arXiv:2406.08052. [Google Scholar] [CrossRef]
  11. Cheng, H.; Li, K.; Ye, L.; Wang, J. EnvFake: An Initial Environmental-Fake Audio Dataset for Scene-Consistency Detection. In Proceedings of the 2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP), Beijing, China, 7–10 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 81–85. [Google Scholar]
  12. Sun, C.; Jia, S.; Hou, S.; Lyu, S. Ai-synthesized voice detection using neural vocoder artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 904–912. [Google Scholar]
  13. Ahmad, O.; Khan, M.S.; Jan, S.; Khan, I. Deepfake Audio Detection for Urdu Language Using Deep Neural Networks. IEEE Access 2025, 13, 97765–97778. [Google Scholar] [CrossRef]
  14. Pintelas, E.; Livieris, I.E. Convolutional neural network framework for deepfake detection: A diffusion-based approach. Comput. Vis. Image Underst. 2025, 257, 104375. [Google Scholar] [CrossRef]
  15. Tahaoglu, G.; Baracchi, D.; Shullani, D.; Iuliani, M.; Piva, A. Deepfake audio detection with spectral features and ResNeXt-based architecture. Knowl.-Based Syst. 2025, 323, 113726. [Google Scholar] [CrossRef]
  16. Gulsoy, T.; Gulsoy, E.K.; Ustubioglu, A.; Ustubioglu, B.; Kablan, E.B.; Ayas, S.; Ulutas, G.; Tahaoglu, G.; Elhoseny, M. Detecting audio splicing forgery: A noise-robust approach with Swin Transformer and cochleagram. J. Inf. Secur. Appl. 2025, 93, 104130. [Google Scholar] [CrossRef]
  17. Zaman, K.; Samiul, I.J.A.M.; Sah, M.; Direkoglu, C.; Okada, S.; Unoki, M. Hybrid Transformer Architectures With Diverse Audio Features for Deepfake Speech Classification. IEEE Access 2024, 12, 149221–149237. [Google Scholar] [CrossRef]
  18. Zaman, K.; Li, K.; Sah, M.; Direkoglu, C.; Okada, S.; Unoki, M. Transformers and audio detection tasks: An overview. Digit. Signal Process. 2025, 158, 104956. [Google Scholar]
  19. Tang, Y.; Mu, J. ConvTrans-DF: A Deep Fake Detection Method Combining CNN and Transformer. In Proceedings of the International Conference on Intelligent Computing, Ningbo, China, 26–29 July 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 334–345. [Google Scholar]
  20. Petmezas, G.; Vanian, V.; Konstantoudakis, K.; Almaloglou, E.E.; Zarpalas, D. Video deepfake detection using a hybrid CNN-LSTM-Transformer model for identity verification. Multimed. Tools Appl. 2025, 84, 40617–40636. [Google Scholar]
  21. Wang, C.; Yi, J.; Tao, J.; Zhang, C.; Zhang, S.; Chen, X. Detection of cross-dataset fake audio based on prosodic and pronunciation features. arXiv 2023, arXiv:2305.13700. [Google Scholar] [CrossRef]
  22. Liu, T.; Kukanov, I.; Pan, Z.; Wang, Q.; Sailor, H.B.; Lee, K.A. Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing. arXiv 2024, arXiv:2409.08346. [Google Scholar] [CrossRef]
  23. Shi, H.; Shi, X.; Dogan, S.; Alzubi, S.; Huang, T.; Zhang, Y. Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios. arXiv 2025, arXiv:2504.12423. [Google Scholar] [CrossRef]
  24. Ballesteros, D.M.; Rodriguez-Ortega, Y.; Renza, D.; Arce, G. Deep4SNet: Deep learning for fake speech classification. Expert Syst. Appl. 2021, 184, 115465. [Google Scholar] [CrossRef]
  25. Camacho, S.; Ballesteros, D.M.; Renza, D. Fake speech recognition using deep learning. In Proceedings of the Workshop on Engineering Applications, Medellín, Colombia, 6–8 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 38–48. [Google Scholar]
  26. Zhang, T.; Feng, G.; Liang, J.; An, T. Acoustic scene classification based on Mel spectrogram decomposition and model merging. Appl. Acoust. 2021, 182, 108258. [Google Scholar] [CrossRef]
  27. Wani, T.M.; Amerini, I. Deepfakes audio detection leveraging audio spectrogram and convolutional neural networks. In Proceedings of the International Conference on Image Analysis and Processing, Udine, Italy, 11–15 September 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 156–167. [Google Scholar]
  28. Mehra, S.; Ranga, V.; Agarwal, R. A deep learning approach to dysarthric utterance classification with BiLSTM-GRU, speech cue filtering, and log mel spectrograms. J. Supercomput. 2024, 80, 14520–14547. [Google Scholar]
  29. Fathan, A.; Alam, J.; Kang, W. Multiresolution decomposition analysis via wavelet transforms for audio deepfake detection. In Proceedings of the International Conference on Speech and Computer, Gurugram, India, 14–16 November 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 188–200. [Google Scholar]
  30. Zbezhkhovska, U.; Khapilin, O. Deepfake Audio Detection with Sinc and Wavelet Filters in RawNet2. In Proceedings of the International Conference on Information and Communication Technologies in Education, Research, and Industrial Applications, Lviv, Ukraine, 23–27 September 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 273–284. [Google Scholar]
  31. Singh, S.; Bharadwaj, N.K. Waveform and Mel-Frequency Cepstral Coefficients (MFCC) approach for Deepfake Audio Detection. In Proceedings of the 2025 IEEE International Conference on Emerging Technologies and Applications (MPSec ICETA), Gwalior, India, 21–23 February 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–6. [Google Scholar]
  32. Wang, Y.; Skerry-Ryan, R.; Stanton, D.; Wu, Y.; Weiss, R.J.; Jaitly, N.; Yang, Z.; Xiao, Y.; Chen, Z.; Bengio, S.; et al. Tacotron: Towards end-to-end speech synthesis. arXiv 2017, arXiv:1703.10135. [Google Scholar]
  33. Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4779–4783. [Google Scholar]
  34. Kim, J.; Kong, J.; Son, J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Breckenridge, CO, USA, 2021; pp. 5530–5540. [Google Scholar]
  35. Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv 2020, arXiv:2006.04558. [Google Scholar]
  36. Lee, S.G.; Ping, W.; Ginsburg, B.; Catanzaro, B.; Yoon, S. Bigvgan: A universal neural vocoder with large-scale training. arXiv 2022, arXiv:2206.04658. [Google Scholar]
  37. Dhar, S.; Jana, N.D.; Das, S. Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements. arXiv 2025, arXiv:2504.19197. [Google Scholar] [CrossRef]
  38. Choi, J.E.; Schäfer, K.; Steinebach, M. The Sound of Language: A Bilingual Analysis of Voice Conversion and Text-to-Speech Synthesis. In Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
  39. Walczyna, T.; Piotrowski, Z. Overview of voice conversion methods based on deep learning. Appl. Sci. 2023, 13, 3100. [Google Scholar] [CrossRef]
  40. Lavrentyeva, G.; Novoselov, S.; Tseren, A.; Volkova, M.; Gorlanov, A.; Kozlov, A. STC Anti-spoofing Systems for the ASVspoof2019 Challenge. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1033–1037. [Google Scholar] [CrossRef]
  41. Todisco, M.; Delgado, H.; Evans, N.W.D. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. In Proceedings of the Odyssey: The Speaker and Language Recognition Workshop, Bilbao, Spain, 21–24 June 2017; pp. 27–30. [Google Scholar] [CrossRef]
  42. Wu, Z.; Das, R.K.; Yang, J.; Li, H. Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 794–798. [Google Scholar] [CrossRef]
  43. Shaaban, O.A.; Yildirim, R. Audio Deepfake Detection Using Deep Learning. Eng. Rep. 2025, 7, e70087. [Google Scholar] [CrossRef]
  44. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  45. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
  46. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
  47. Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; PMLR: Breckenridge, CO, USA, 2019; Volume 97, pp. 6105–6114. [Google Scholar]
  48. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
  49. Abuhmida, M.; Whittey, R.; Hossain, M.M. Enhancing Audio Deepfake Detection: A Study of Deep Learning Parameters. In Proceedings of the International Conference for Emerging Technologies in Computing, Essex, UK, 15–16 August 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 140–160. [Google Scholar]
  50. Beltran, M.; Ballesteros L, D.M. Fake Audio Dataset (ElevenLabs & Respeecher). Mendeley Data, V1. 2025. Available online: https://data.mendeley.com/datasets/79g59sp69z/1 (accessed on 5 October 2025).
  51. Gonzalez, B.; Ballesteros L, D.M. TTS/V2V Audio Deepfake Dataset. Mendeley Data, V2. 2025. Available online: https://data.mendeley.com/datasets/h4zbs27tkr/2 (accessed on 5 October 2025).
  52. An, W.; Li, R.; Ge, H.; Li, M.; Li, H. An End-to-End Audio Transformer with Multi-student Knowledge Distillation algorithm for Deepfake Speech Detection. In Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition, Tianjin, China, 25–27 October 2024; pp. 366–371. [Google Scholar]
  53. Rabhi, M.; Bakiras, S.; Di Pietro, R. Audio-deepfake detection: Adversarial attacks and countermeasures. Expert Syst. Appl. 2024, 250, 123941. [Google Scholar] [CrossRef]
Figure 1. Overview of the experimental pipeline and configuration flow used in the proposed framework. Colored blocks indicate the inputs associated with each main module: blue corresponds to experiment-level settings (experiment.py), green to model selection and initialization (model_loader.py), red to dataset preparation and feature extraction (prepare_dataset.py), and purple to training and optimization parameters (trainer.py). Ellipses (“…”) denote additional parameters of the same nature that are omitted for clarity and space reasons; these omissions do not affect the scientific interpretation of the workflow. The arrows represent data and configuration dependencies among modules, converging in a unified configuration object (ExperimentConfig), which orchestrates dataset preparation, model loading, training, validation, and metric reporting. No overlapping or incomplete elements compromise the understanding of the experimental process.
Figure 1. Overview of the experimental pipeline and configuration flow used in the proposed framework. Colored blocks indicate the inputs associated with each main module: blue corresponds to experiment-level settings (experiment.py), green to model selection and initialization (model_loader.py), red to dataset preparation and feature extraction (prepare_dataset.py), and purple to training and optimization parameters (trainer.py). Ellipses (“…”) denote additional parameters of the same nature that are omitted for clarity and space reasons; these omissions do not affect the scientific interpretation of the workflow. The arrows represent data and configuration dependencies among modules, converging in a unified configuration object (ExperimentConfig), which orchestrates dataset preparation, model loading, training, validation, and metric reporting. No overlapping or incomplete elements compromise the understanding of the experimental process.
Bdcc 10 00025 g001
Figure 2. Example of the dataset preparation summary and generated experiment manifest. The figure shows the console output produced after executing the dataset preparation stage, including class distribution, train–test split statistics, and the number of samples generated for each spectral transformation. Text displayed in different font colors corresponds to standard console formatting used to visually distinguish commands, labels, and output values; font color does not encode additional semantic or experimental information and is shown for readability purposes only.
Figure 2. Example of the dataset preparation summary and generated experiment manifest. The figure shows the console output produced after executing the dataset preparation stage, including class distribution, train–test split statistics, and the number of samples generated for each spectral transformation. Text displayed in different font colors corresponds to standard console formatting used to visually distinguish commands, labels, and output values; font color does not encode additional semantic or experimental information and is shown for readability purposes only.
Bdcc 10 00025 g002
Figure 3. Inference workflow for synthetic audio detection. Colored blocks indicate the functional role of each component in the inference stage: green represents the set of input parameters provided to the inference script (inference.py), including the selected spectral transform, its associated parameters, the input audio file, and the trained model path. The central block corresponds to the inference engine (InferenceRunner), which processes the input audio using the specified transformation and the loaded model. The pink block denotes the inference output, expressed as class probability scores (e.g., real vs. fake). Arrows indicate the flow of information from inputs to the final inference score.
Figure 3. Inference workflow for synthetic audio detection. Colored blocks indicate the functional role of each component in the inference stage: green represents the set of input parameters provided to the inference script (inference.py), including the selected spectral transform, its associated parameters, the input audio file, and the trained model path. The central block corresponds to the inference engine (InferenceRunner), which processes the input audio using the specified transformation and the loaded model. The pink block denotes the inference output, expressed as class probability scores (e.g., real vs. fake). Arrows indicate the flow of information from inputs to the final inference score.
Bdcc 10 00025 g003
Figure 4. Workflow for integrating custom models into the training pipeline. The figure summarizes the main steps for incorporating user-defined and benchmark models into the experimental framework: configuration of experiment parameters, model loading, and unified training with checkpoint storage. Numbered labels indicate execution order, while dashed boxes group related operations for clarity. Partial code snippets and cropped elements are included for illustrative purposes only and do not affect scientific understanding. Font colors follow standard code-highlighting conventions and do not encode additional semantic information. Any references originally shown inside the figure have been moved to the caption in compliance with MDPI guidelines.
Figure 4. Workflow for integrating custom models into the training pipeline. The figure summarizes the main steps for incorporating user-defined and benchmark models into the experimental framework: configuration of experiment parameters, model loading, and unified training with checkpoint storage. Numbered labels indicate execution order, while dashed boxes group related operations for clarity. Partial code snippets and cropped elements are included for illustrative purposes only and do not affect scientific understanding. Font colors follow standard code-highlighting conventions and do not encode additional semantic information. Any references originally shown inside the figure have been moved to the caption in compliance with MDPI guidelines.
Bdcc 10 00025 g004
Figure 5. Architecture of the custom SimpleCNN model integrated into FakeVoiceFinder. The network receives a single-channel time–frequency input of 224 × 224 × 1 and processes it through three convolutional–pooling blocks that progressively reduce spatial dimensions ( 224 112 56 ) while increasing feature depth ( 1 32 64 128 ). A Global Average Pooling (GAP) layer condenses the final feature map into a 128-dimensional vector, which is then classified into two output classes: real and fake.
Figure 5. Architecture of the custom SimpleCNN model integrated into FakeVoiceFinder. The network receives a single-channel time–frequency input of 224 × 224 × 1 and processes it through three convolutional–pooling blocks that progressively reduce spatial dimensions ( 224 112 56 ) while increasing feature depth ( 1 32 64 128 ). A Global Average Pooling (GAP) layer condenses the final feature map into a 128-dimensional vector, which is then classified into two output classes: real and fake.
Bdcc 10 00025 g005
Figure 6. Example of the comparison capabilities of FakeVoiceFinder. The table reports performance metrics for both benchmark architectures (e.g., AlexNet, ConvNext, ResNet, VGG, ViT) and a custom user-defined model (i.e., usermodel_SimpleCNN), evaluated under two spectral representations (CQT and DWT). The results illustrate how the framework supports systematic benchmarking across models and transformations, enabling fair comparison under identical experimental conditions.
Figure 6. Example of the comparison capabilities of FakeVoiceFinder. The table reports performance metrics for both benchmark architectures (e.g., AlexNet, ConvNext, ResNet, VGG, ViT) and a custom user-defined model (i.e., usermodel_SimpleCNN), evaluated under two spectral representations (CQT and DWT). The results illustrate how the framework supports systematic benchmarking across models and transformations, enabling fair comparison under identical experimental conditions.
Bdcc 10 00025 g006
Figure 7. Model-centric results of the Basic Example within the FakeVoiceFinder framework, illustrating the classification accuracy obtained by different neural network architectures and training strategies (scratch and pretrain) when evaluated on the Fake Audio Dataset (ElevenLabs & Respeecher) using the mel-spectrogram representation. The figure compares convolutional and transformer-based models, including benchmark and user-defined architectures, highlighting the influence of model selection and initialization on synthetic audio detection performance.
Figure 7. Model-centric results of the Basic Example within the FakeVoiceFinder framework, illustrating the classification accuracy obtained by different neural network architectures and training strategies (scratch and pretrain) when evaluated on the Fake Audio Dataset (ElevenLabs & Respeecher) using the mel-spectrogram representation. The figure compares convolutional and transformer-based models, including benchmark and user-defined architectures, highlighting the influence of model selection and initialization on synthetic audio detection performance.
Bdcc 10 00025 g007
Figure 8. Data-centric results illustrating the classification accuracy obtained by the AlexNet architecture trained on the Fake Audio Dataset (ElevenLabs & Respeecher) across different time–frequency representations. The bar chart compares the model performance using constant-Q transform (CQT), discrete wavelet transform (DWT), logarithmic spectrogram (log), and mel-spectrogram (mel) features, highlighting the impact of the selected spectral representation on synthetic audio detection accuracy.
Figure 8. Data-centric results illustrating the classification accuracy obtained by the AlexNet architecture trained on the Fake Audio Dataset (ElevenLabs & Respeecher) across different time–frequency representations. The bar chart compares the model performance using constant-Q transform (CQT), discrete wavelet transform (DWT), logarithmic spectrogram (log), and mel-spectrogram (mel) features, highlighting the impact of the selected spectral representation on synthetic audio detection accuracy.
Bdcc 10 00025 g008
Figure 9. Hybrid-approach results of the Basic Example, showing the classification accuracy obtained by different neural network architectures across multiple time–frequency transformations when evaluated on the Fake Audio Dataset (ElevenLabs & Respeecher). The figure jointly analyzes the impact of model selection and spectral representation, highlighting how architectural choices and input feature design interact in synthetic audio detection performance.
Figure 9. Hybrid-approach results of the Basic Example, showing the classification accuracy obtained by different neural network architectures across multiple time–frequency transformations when evaluated on the Fake Audio Dataset (ElevenLabs & Respeecher). The figure jointly analyzes the impact of model selection and spectral representation, highlighting how architectural choices and input feature design interact in synthetic audio detection performance.
Bdcc 10 00025 g009
Figure 10. Example of the inference module. The code illustrates the definition of model paths, selection of the time–frequency transformation, and specification of transformation parameters used during inference, including sampling rate, clip duration, image size, and spectral settings. All numerical values are expressed using standard Python notation, where numbers are written without thousand separators or locale-specific delimiters.
Figure 10. Example of the inference module. The code illustrates the definition of model paths, selection of the time–frequency transformation, and specification of transformation parameters used during inference, including sampling rate, clip duration, image size, and spectral settings. All numerical values are expressed using standard Python notation, where numbers are written without thousand separators or locale-specific delimiters.
Bdcc 10 00025 g010
Figure 11. Example of the FakeVoiceFinder inference module applied to an audio file (592.wav). The system outputs a probabilistic score (in this case, 97.58%) indicating the likelihood that the sample is synthetic. The visualization uses a color-coded gauge, where green denotes low probability (0–50%), yellow medium (50–75%), and red high (75–100%).
Figure 11. Example of the FakeVoiceFinder inference module applied to an audio file (592.wav). The system outputs a probabilistic score (in this case, 97.58%) indicating the likelihood that the sample is synthetic. The visualization uses a color-coded gauge, where green denotes low probability (0–50%), yellow medium (50–75%), and red high (75–100%).
Bdcc 10 00025 g011
Figure 12. Overall accuracy of the 32 models trained on the Fake Audio Dataset (ElevenLabs & Respeecher) and evaluated on 120 natural and synthetic audio files from the TTS/V2V Audio Deepfake Dataset. Each cell reports the classification accuracy computed under identical training and preprocessing conditions. The heatmap highlights performance differences across the four spectral representations (CQT, DWT, log-spectrogram, and mel), illustrating how model architecture and feature design influence synthetic-audio detection accuracy.
Figure 12. Overall accuracy of the 32 models trained on the Fake Audio Dataset (ElevenLabs & Respeecher) and evaluated on 120 natural and synthetic audio files from the TTS/V2V Audio Deepfake Dataset. Each cell reports the classification accuracy computed under identical training and preprocessing conditions. The heatmap highlights performance differences across the four spectral representations (CQT, DWT, log-spectrogram, and mel), illustrating how model architecture and feature design influence synthetic-audio detection accuracy.
Bdcc 10 00025 g012
Figure 13. Overall accuracy of the 32 models trained on the Fake Audio Dataset (ElevenLabs & Respeecher), where the majority of synthetic samples were generated using V2V technology, and evaluated on 30 TTS-generated audio files from the TTS/V2V Audio Deepfake Dataset. The heatmap shows the accuracy obtained across the four spectral representations (CQT, DWT, log-spectrogram, and mel), highlighting how model architecture and spectral transformation behave when detecting synthetic audio produced by a manipulation type that is poorly represented during training.
Figure 13. Overall accuracy of the 32 models trained on the Fake Audio Dataset (ElevenLabs & Respeecher), where the majority of synthetic samples were generated using V2V technology, and evaluated on 30 TTS-generated audio files from the TTS/V2V Audio Deepfake Dataset. The heatmap shows the accuracy obtained across the four spectral representations (CQT, DWT, log-spectrogram, and mel), highlighting how model architecture and spectral transformation behave when detecting synthetic audio produced by a manipulation type that is poorly represented during training.
Bdcc 10 00025 g013
Figure 14. Overall accuracy of the 32 models trained on the Fake Audio Dataset (ElevenLabs & Respeecher), where the majority of synthetic samples were generated using V2V technology, and evaluated on 30 V2V-generated audio files from the TTS/V2V Audio Deepfake Dataset. The heatmap shows the accuracy obtained across the four spectral representations (CQT, DWT, log, and mel), highlighting how model architecture and spectral transformation behave when detecting synthetic audio produced by a manipulation type that is well represented during training.
Figure 14. Overall accuracy of the 32 models trained on the Fake Audio Dataset (ElevenLabs & Respeecher), where the majority of synthetic samples were generated using V2V technology, and evaluated on 30 V2V-generated audio files from the TTS/V2V Audio Deepfake Dataset. The heatmap shows the accuracy obtained across the four spectral representations (CQT, DWT, log, and mel), highlighting how model architecture and spectral transformation behave when detecting synthetic audio produced by a manipulation type that is well represented during training.
Bdcc 10 00025 g014
Table 1. Hyperparameters for audio transformations.
Table 1. Hyperparameters for audio transformations.
TransformationHyperparameterDescriptionDefault Value
mel-scalen_fftSize of the FFT window.2048
hop_lengthNumber of samples between STFT frames.512
n_melsNumber of mel bands.128
Logn_fftSize of the FFT window.2048
hop_lengthNumber of samples between STFT frames.512
Scalogram (DWT)waveletType of mother wavelet.db4
levelNumber of decomposition levels.4
modeBoundary extension mode.constant
CQThop_lengthNumber of samples between CQT frames.256
n_binsTotal number of CQT frequency bins.96
bins_per_octaveResolution of frequency axis.24
scaleScaling option producing a more stable spectral distribution.True
Table 2. Available model architectures in FakeVoiceFinder.
Table 2. Available model architectures in FakeVoiceFinder.
Architecture TypeAvailable Options
CNNAlexNet, ResNet18, ResNet34
VGG16, VGG19, DenseNet121
MobileNet_v2, EfficientNet_b0
SqueezeNet1_0, Inception_v3
ConvNext_tiny, ConvNext_small, ConvNext_base
TransformerViT_B_16
Table 3. Hyperparameters for training.
Table 3. Hyperparameters for training.
HyperparameterTypeDescriptionExample
epochsIntegerNumber of training epochs.50
lrFloatLearning rate used by the optimizer.0.001
bsIntegerBatch size used during training.32
optim_nameStringOptimizer to be used: sgd or adam.adam
patienceIntegerNumber of epochs without improvement before early stopping.10
seedIntegerRandom seed for reproducibility of results.42
type_trainStringType of training: scratch, pretrained, or both.pretrained
Table 4. Intermediate tensor sizes for the custom SimpleCNN model for an input of 224 × 224 × 1 .
Table 4. Intermediate tensor sizes for the custom SimpleCNN model for an input of 224 × 224 × 1 .
Layer
(Custom Model)
HeightWidthChannelsFilter
Height
Filter
Width
Vector
Length
Input2242241
Conv12242243233
MaxPool11121123222
Conv21121126433
MaxPool256566422
Conv3565612833
GAP
+ Flatten
11128128
Linear
(output)
2
Table 5. Confusion matrix (class 1 = fake, class 0 = real). Accuracy = 0.8.
Table 5. Confusion matrix (class 1 = fake, class 0 = real). Accuracy = 0.8.
Pred 0 (Real)Pred 1 (Fake)
Actual 0 (real)TN = 440FP = 160
Actual 1 (fake)FN = 80TP = 520
Table 6. Metrics for class 1 (fake) and global values computed from the confusion matrix in Table 5.
Table 6. Metrics for class 1 (fake) and global values computed from the confusion matrix in Table 5.
MetricValue
Precision (class 1, fake)0.7647
Recall (class 1, fake)0.8667
F1 (class 1, fake)0.8129
Accuracy (global)0.8000
F1 Macro0.7995
F1 Micro0.8000
Table 7. Summary of the https://data.mendeley.com/datasets/79g59sp69z/1 (accessed on 5 October 2025).
Table 7. Summary of the https://data.mendeley.com/datasets/79g59sp69z/1 (accessed on 5 October 2025).
CategoryDetailCount/Value
Generation ToolElevenLabs (V2V)282
ElevenLabs (TTS)53
Respeecher (V2V)210
Respeecher (TTS)55
Total AudiosV2V492
TTS108
GenderMale49%
Female51%
Other CharacteristicsDuration8–10 s
Sampling rate22,050 Hz
Table 8. Comparison of open-source frameworks for synthetic audio detection.
Table 8. Comparison of open-source frameworks for synthetic audio detection.
FrameworkFocusKey Features
ASVspoof (2019–2025)Benchmarking anti-spoofingFixed datasets, baselines (GMM, CNN, LCNN), standardized protocols
VoiceWukong (2025)Security benchmarkingMulti-attack scenarios, robust evaluation
Deep-O-Meter (UB, 2024)Multimodal deepfake detectionOnline platform (image, video, audio); evaluation interface for uploaded content
Audio Deepfake Detection (2025 [43])CNN-based architecturesDetection of TTS and VC audio
End-to-End Audio Transformer (2024 [52])Transformer + distillationEnd-to-end pipeline for deepfake detection
Adversarial Testing (2024 [53])Robustness analysisEvaluation under adversarial attacks
FakeVoiceFinder (2025, own)Model- and data-centric hybrid analysisFlexible benchmarking with user datasets, probabilistic inference, visualization modes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pachon, C.; Ballesteros, D. FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection. Big Data Cogn. Comput. 2026, 10, 25. https://doi.org/10.3390/bdcc10010025

AMA Style

Pachon C, Ballesteros D. FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection. Big Data and Cognitive Computing. 2026; 10(1):25. https://doi.org/10.3390/bdcc10010025

Chicago/Turabian Style

Pachon, Cesar, and Dora Ballesteros. 2026. "FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection" Big Data and Cognitive Computing 10, no. 1: 25. https://doi.org/10.3390/bdcc10010025

APA Style

Pachon, C., & Ballesteros, D. (2026). FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection. Big Data and Cognitive Computing, 10(1), 25. https://doi.org/10.3390/bdcc10010025

Article Metrics

Back to TopTop