AudioFakeNet: A Model for Reliable Speaker Verification in Deepfake Audio

Samia Dilbar; Muhammad Ali Qureshi; Serosh Karim Noon; Abdul Mannan

doi:10.3390/a18110716

,

and

¹

Department of Electrical Engineering, Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan

²

Department of Information and Communication Engineering, Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan

³

Department of Electrical Engineering, NFC Institute of Engineering & Technology, Multan 60000, Pakistan

⁴

Department of Biomedical Engineering, NFC Institute of Engineering & Technology, Multan 60000, Pakistan

Algorithms2025, 18(11), 716;https://doi.org/10.3390/a18110716

This article belongs to the Section Algorithms for Multidisciplinary Applications

Version Notes

Order Reprints

Abstract

Deepfake audio refers to the generation of voice recordings using deep neural networks that replicate a specific individual’s voice, often for deceptive or fraud purposes. Although this has been an area of research for quite some time, deepfakes still pose substantial challenges for reliable true speaker authentication. To address the issue, we propose AudioFakeNet, a hybrid deep learning architecture that use Convolutional Neural Networks (CNNs) along with Long Short-Term Memory (LSTM) units, and Multi-Head Attention (MHA) mechanisms for robust deepfake detection. CNN extracts spatial and spectral features, LSTM captures temporal dependencies, and MHA enhances to focus on informative audio segments. The model is trained using Mel-Frequency Cepstral Coefficients (MFCCs) from the publicly available dataset and was validated on self-collected dataset, ensuring reproducibility. Performance comparisons with state-of-the-art machine learning and deep learning models show that our proposed AudioFakeNet achieves higher accuracy, better generalization, and lower Equal Error Rate (EER). Its modular design allows for broader adaptability in fake-audio detection tasks, offering significant potential across diverse speech synthesis applications.

Keywords:

voice disguise; acoustic speaker verification; forensic phonetics; deepfake audio

1. Introduction

Deepfakes are synthetic media content created using deep learning techniques to produce realistic yet entirely fabricated content [1]. Audio deepfakes are a specific type of deepfakes that use Artificial Intelligence (AI) techniques to manipulate or generate audio recordings that sound authentic but are synthetically altered and deceptive [2]. These synthetic audio clips can mimic the voice, tone, and speech patterns of a specific individual [3]. Traditional speaker recognition systems rely on vocal characteristics and patterns for speaker authentication. As speaker authentication technology advances, so do the methods being explored and implemented to recreate or mimic the human voice [4].

In an era marked by increased cyber crime and social media fraud, deep fakes introduce increased security risks that underscore the need for more robust and adaptive speaker verification systems [5]. Audio deepfakes are generated using advanced deep learning techniques such as Generative Adversarial Networks (GANs) or autoencoders and fall into five main categories [6] listed below:

Voice Conversion (VC) involves transforming or mimicking the original voice in real time to produce the desired speaker’s voice [7].
fText-to-Speech (TTS) creates synthetic speech using text as input and generates an audio clip in the target speaker’s artificial voice. Once trained, the model can successfully synthesize any text in the speaker’s voice [8].
Emotion Fake modifies the emotional tone of speech (e.g., happy to sad) while preserving the speaker’s identity and content, using either parallel or nonparallel data-based methods.
Scene Fake alters the acoustic scene of speech (e.g., from an office to an airport) using speech enhancement technologies while maintaining the speaker’s identity and content.
Partially Fake audio modifies specific words in an utterance using genuine or synthesized clips, while preserving the original speaker’s identity.

A graphical description of the broad categories of deep fakes generated using artificial intelligence techniques is given in Figure 1.

Figure 1. Classification of audio deepfake generation methods.

In order to authenticate possible deepfake audio clips, a wide range of machine learning and deep network techniques have been developed, many of which originate from well-established architectures designed for image processing tasks [9,10]. Recent studies indicate that deep learning-based approaches generally outperform traditional machine learning methods in public benchmark datasets [3,11]. In addition, there has been a growing adoption of Recurrent Neural Networks (RNN) to model temporal dependencies within audio signals [12]. Furthermore, attention-based mechanisms have been employed to selectively emphasize the most important segments of the signal [13]. Despite these advances, accurately distinguishing between genuine and fake audio signals remains a significant and unresolved challenge [14].

It is also important to note that most deep learning models still rely on hand-crafted features, such as MFCCs [15], due to their long-standing effectiveness and proven reliability across diverse audio analysis tasks over several decades [6]. In parallel, end-to-end deep learning methods have also emerged [16], allowing networks to learn relevant features directly from raw audio data without the need for manual feature engineering.

In forensic contexts, speaker identification extends beyond classifying a speaker; it must also verify whether a voice is genuinely human or artificially generated. Recent studies [17,18,19] emphasize that authenticity verification is essential in modern speaker recognition systems. Addressing this necessity, the model proposed in this study is designed as a pre-identification module, ensuring that only authentic speech progresses to the speaker identification or verification stage, thereby enhancing overall system reliability. As the work introduces a novel deep learning framework that integrates recurrent and attention mechanisms for robust fake detection. By bridging the gap between deepfake detection and forensic speaker authentication, the proposed model not only determines whether a voice is real or fake but also lays the groundwork for integration into forensic speaker identification systems, distinguishing it from prior studies focused solely on classification. The primary contributions of the proposed approach are summarized as follows.

To effectively capture the long-range temporal dependencies inherent in speech signals, a Long Short-Term Memory (LSTM)-based recurrent architecture is employed.
To improve detection accuracy, an attention mechanism is integrated to selectively emphasize the most informative segments of the audio signal.
The performance of the proposed method is validated using both public and self-collected datasets, with results compared against other state-of-the-art methods.

The remainder of the paper is structured as follows: Section 2 reviews previous work in the field of audio deepfake detection. The methodology, datasets, and the proposed AudioFakeNet model are presented in Section 3. The experiments and results are discussed in Section 4 and Section 5, respectively, including a comparison with the state-of-the-art and recent methods. Finally, Section 7 provides concluding remarks and highlights potential future directions for this research.

2. Related Work

A wide range of approaches have been developed to detect synthetic or manipulated audio, particularly in response to the rise of AI-generated speech and voice conversion techniques [14]. Early methods relied on hand-crafted acoustic characteristics and statistical modeling [20], while recent research has shifted towards deep learning-based models, including CNNs, RNNs, and Transformer architectures [21]. These models aim to capture subtle anomalies in prosody, spectrogram patterns, and frequency distributions that differentiate synthetic audio from human speech.

A recent survey [22] investigates the challenges associated with deepfake detection and categorizes existing methods into uni-modal and multi-modal approaches. The study critically analyzes deepfake generation and detection models, focusing on four commonly used network architectures for fake audio generation—CNN, RNN, GAN, and encoder–decoder models. Furthermore, the review discusses conventional classification models alongside hybrid and multimodal frameworks, highlighting key insights, limitations, and future research directions.

In an attempt to counteract the threat of fake-generated speech, the authors of [1] presented a dataset, namely, the “DEEP VOICE” dataset, that comprises converted samples of real and converted audio of public figures. Using machine learning models, they performed classification on this public dataset to identify the imitated or fake voice.

In recently published literature, end-to-end models are also reported. They employ Deep Learning (DL) based models, which require excessive training and preprocessing on large datasets. To counteract this, the authors [23] explored methods based on Self-Supervised Learning (SSL) to identify fake Arabic audio from speakers. Their model performed well in their self-collected data set with a detection accuracy of

97 %

and an Equal Error Rate (EER) value of

0.027 %

.

Acoustic features play a vital role in the detection of spoof audio, which is extracted using various modern techniques. Mirza et al. [24] used hand-crafted acoustic feature extraction and then applied fused features on both SVM and deep neural networks (DNN) for spoof detection. The impact of feature fusion strategies was explored to assess their impact on model performance in the balanced AVSpoof dataset.

Contextual features such as the speaker’s location, profile, and topic of speech have been shown to improve the effectiveness of deepfake detection in certain scenarios. To achieve this complex integration, various techniques—including CNNs, DNNs, and transfer learning-based approaches—have been employed. Shaaban et al. [25] conducted an extensive comparison between feature-based and image-based techniques for deepfake detection, using multiple datasets and evaluation metrics. The study emphasizes that transfer learning-based CNN models are robust and achieve a lower value of ERR on the ASVspoof 2019 dataset [24], highlighting their effectiveness in audio deepfake detection.

To improve the accuracy of fake audio detection, Zaman et al. proposed a hybrid transformer-based model using three types of audio feature representation [8]. The findings showed that the hybrid transformer-based models outperformed the Short Time Fourier Transform (STFT)-based input features. Considering that the audio inputs are sequential, the authors used both CNNs and RNNs.

To address the limitations of existing methods to counter the rising threats of AI-generated voice impersonation, the trend has recently shifted towards CNNs and LSTM-based end-to-end models [26]. LSTM models accurately identify the long-term temporal features of speech signals [27]. However, attention mechanisms can handle the subtle changes that occur in fake audio. Omair et al. [16] use the combination of LSTM and attention with CNN models to classify deepfake audio in the Urdu language dataset. In another notable contribution, the authors [28] used multi-head attention in combination with LSTM to capture fine-grained temporal features to correctly identify fake videos. By performing experimentation and ablation studies, their model demonstrated robust performance in identifying forged videos.

It is pertinent to note that recent advancements in spoofing and deepfake detection have been largely driven by international challenges such as ASVspoof 2019–2021–2025, the Audio Deepfake Detection (ADD 2023), and Spoofing-Aware Speaker Verification (SASV 2023) initiatives. Leading architectures from these benchmarks including RawNet2 [18,29,30] and SSL-based methods such as wav2vec2.0 and HuBERT embeddings [31], have achieved state-of-the-art performance on AVSpoof datasets with clearly defined spoofing methods, such as replay, TTS, and VC. In contrast, the present study introduces a CNN–BiLSTM–Multi-Head Attention framework that addresses deepfake detection for fake audio generated by diverse and uncontrolled sources; making it more suitable for practical forensic scenarios where the nature of the manipulation is unknown. Borodin et al., [18] highlighted that authenticity of speaker (real or fake) is the preliminary step in speaker identification hence bridging a connection between deepfake detection and identity verification.

3. Methodology

This section presents the detailed methodology adopted for the proposed audio deepfake detection framework. The approach leverages mel-frequency cepstral coefficients as input features to a three-stage deep learning architecture comprising CNN, LSTM, and MHA mechanism. A block diagram of the proposed approach is shown in Figure 2.

Figure 2. A block diagram of the proposed AudioFakeNet model.

In the first stage, the CNN module automatically extracts local time-frequency features from the MFCC input. The second stage, implemented through an LSTM, is designed to capture long-term temporal dependencies inherent in speech signals. Finally, the MHA stage enables the model to focus selectively on the most informative regions of the input sequence. The attention block output is combined with the LSTM output using the residual connection, thereby enhancing the network’s ability to distinguish between real and fake audio samples.

Before discussing the proposed methodology, it is essential to describe the datasets used in this study and explain the process of extracting MFCCs, which serve as primary input features. Providing insight into the dataset composition and MFCC calculation not only contextualizes the model design but also helps to understand the nature and quality of the audio inputs used for training and evaluation.

3.1. Datasets

The proposed model was trained, validated and tested on large public dataset namely The “Fake or Real Dataset” [32]. However, to further demonstrate the model adaptibility to unseen and real world dataset it was validated on our self-collected dataset.

3.1.1. Self-Collected Dataset

The self-collected dataset is primarily used to validate the robustness and generalization of AudioFakeNet. It comprises 105 speech recordings, of which 35 are from female speakers and 70 are from male speakers. Each recording lasts between 14 and 16 s and was captured in a quiet university laboratory using mono-channel WAV format at a 48 kHz sampling rate with 16-bit resolution.

The participants repeated the same phrase exactly and informed consent was obtained accordingly. To generate voice-converted variants of each raw recording, it is passed through a (Super Effect Studio, Version 2.1.8, mobile application available at https://play.google.com/store/apps/details?id=com.voicechanger.audioeffect.editor.funnyvoice&pcampaignid=web_share, accessed on 11 November 2025). Which performs gender swapping and the process involves pitch shifting & formant modification, respectively, in the time and frequency domains. Let

x (t)

be the raw speech signal. Shifting it by a factor

α

, we call it signal

x^{'} (t)

as given by Equation (1).

x^{'} (t) = ρ_{α} x (t)

(1)

where

α = \{\begin{matrix} > 1, & if male \to female \\ < 1, & if female \to male \end{matrix}

The typical range of

α

is from

0.5

to

2.0

. The Formant modification alters the resonance frequencies of the audio signals by a ratio

β

as shown in Equation (2).

X^{'} (w) = F {x^{'} (t)}

(2)

where

X (w)

is the STFT of the pitch shifted signal.

After performing the pitch and formant shifting operation the signal is reconstructed using Inverse short time Fourier Transform (ISTFT). In an attempt to introduce deepfake style manipulations; time-domain variations were added. Time stretching and compression manipulations introduce nonlinear temporal artifacts as depicted in Equation (3). Here,

x (t)

is the raw input as a function of time and

γ

is the temporal scaling factor.

x_{γ} (t) = x (γ t), γ = \{\begin{matrix} > 1, & time compression \\ < 1, & time stretching \\ = 1, & no change \end{matrix}

(3)

where

x (γ t)

is the stretched/compressed time scale version of the original signal

x_{γ} (t)

. It is worth mentioning that the range of

γ

used in our dataset varies from

0.5

to

2.0

.

By integrating pitch and formant modifications with time-domain scaling, the dataset more realistically simulates deepfake-style manipulations. Subsequently, the modified speech signals were processed through equalization and harmonic enhancement stages to produce more natural-sounding converted samples. Consequently, the self-collected dataset comprises both original (real) and synthetically transformed (fake) speech signals.

3.1.2. Public Dataset

The experimentation is carried out on the publicly available “The Fake or Real Dataset”. It is a benchmark dataset which for detecting synthetic speech. The dataset comprises of 195,000 utterances of human real and computer-generated speech samples. The abundance of speech samples makes it suitable for training deep learning models. The dataset provides several versions of real human voice and fake audio samples generated using WaveNet and Deep Voice 3 as discussed in the table, to be used for varying experimental settings. Each version of the dataset consists of binary classes, i.e., real or fake, but differs only in duration and recording environment.

The details of these settings are given in Table 1.

Table 1. Different versions of The Fake or Real Dataset [32,33].

To provide a more robust and realistic realization for fake audio detection experimentation is carries out on for-rerec version of the dataset. The version contains 12,452 re-recorded utterances for training & validation.

3.2. Computing MFCC Features

Audio data preparation techniques play a critical role in ensuring the effectiveness of the proposed model. The raw speech recordings are first preprocessed to remove noise and irrelevant segments, focusing on clean and informative speech signals. The audio is then segmented into manageable portions, followed by extracting key acoustic features such as amplitude analysis, spectrograms, and MFCCs. This dataset is then used to effectively train and evaluate the speaker identification model. The dynamics of original & fake audio dynamics vary significantly. As seen in Figure 3a, the overall variation of amplitude or energy of the original audio is smooth and natural. However, disguised audios have inconsistent energy levels, abrupt transitions, or unnatural silence, as can be seen in Figure 3b.

Figure 3. Waveform comparison showing amplitude variation in both original and disguised voice sample.

To examine frequency-relevant artifacts that are not visible in amplitude waveforms, spectrograms are used. The comparison of spectrogram shows the magnitude of different frequencies present in the original & fake audio samples over time. The consistent time-frequency representations of the real and fake signals for comparison. Observing MFCC spectrograms of real and fake audios reveals that the original voice sample has natural frequency transitions, as shown in Figure 4a. However, fake audios often have unnatural spikes and sudden artifacts or noise, as can be seen in Figure 4b. Each spectrogram includes a color bar to indicate the frequency intensity. This comprehensive visualization highlights the differences between the original and fake recordings, as shown in Figure 4. The Short-time-Fourier transform audio input signals by mapping them to 128 mel bands, keeping the maximum frequency limit at 8 kHz. By doing so, resultant spectrograms are generated.

Figure 4. Spectrogram comparison of original and disguised voice samples.

We perform a step-by-step windowing operation of dividing the overlapping audio signal frames to present a meaningful visualization of the audio features shown in Equation (4).

y_{n} (t) = y (t) \cdot w (t - n T)

(4)

where

y (t)

is the input audio signal,

w (t)

is the window function, and T is the frame shift (hop size).

Equations (5) and (6) represent the fundamental steps in generating Mel-spectrogram features: the first computes the Short-Time Fourier Transform (STFT) to obtain the time-frequency representation of the signal, while the second applies Mel-scale filter banks to compress the spectral information according to human auditory perception.

y (n, ω) = \sum_{t = 0}^{N - 1} y_{n} (t) \cdot e^{- j ω t}

(5)

S_{m} = \sum_{k = 0}^{K - 1} {| X (k) |}^{2} \cdot H_{m} (k)

(6)

The discrete cosine transform (DCT) is performed to acquire the most significant spectral information shown in Equation (7).

{MFCC}_{n} = \sum_{m = 1}^{M} \log (S_{m}) \cdot \cos [\frac{π n (m - 0.5)}{M}], for n = 1, 2, \dots, 13

(7)

The acquired MFCCs provide temporal information necessary for the first 13 MFCCs to differentiate between real and deepfake audio, as shown in Figure 5. Smoother and more natural transitions can be observed in Figure 5a, and the lack of richness in lower quantities can be seen in the fake voice shown in Figure 5b.

Figure 5. MFCC comparison of both original and disguised voice samples.

3.3. The AudioFakeNet Model

The proposed framework uses a hybrid architecture that effectively combines CNN and RNN with MHA to differentiate between the true and fake speakers directly from the audio signal input. The block diagram of the proposed AudioFakeNet model is shown in Figure 2.

The model first extracts the local spectral features from the input audio representation, which are input into the MFCC matrix. Each element of the matrix represents a particular time and frequency coefficient. The convolutional operation processes this temporal and spectral information to generate a feature map

y_{i, j, c}

to expand the receptive field without increasing the number of parameters, as shown in Equation (8):

y_{i, j, c} = \sum_{q = 0}^{h - 1} \sum_{r = 0}^{w - 1} x_{i + q, j + r, c} \cdot {weight}_{q, r, c, k} + b_{k}

(8)

The convolutional operation shown in Equation (8) computes the value

y_{i, j, c}

, which corresponds to the output feature map at position

(i, j)

for channel c. The computation involves an element-wise multiplication between the convolutional filter

w e i g h t_{q, r, c, k}

and the input values

x_{i + q, j + r, c}

within the filter’s receptive field, followed by a summation. The bias term

b_{k}

is then added to the sum. Here, the filter has a height (h) and width (w) of 3.

The feature maps generated are then normalized using a batch normalization operation. Leaky ReLU is then applied as an activation function due to its negative slope. Due to its time-frequency representations, the audio input MFCC matrix may contain some negative values. So a small negative slope of leaky ReLU is allowed keeping

α = 0.01

, hence, preserving the gradient flow. Then, a max-pooling operation is applied to down-sample the spatial information and highlight the important features.

The model extracts multi-scale feature information by varying the number of filters, as shown in Figure 2. To enhance the robustness of the model, A single LSTM layer is added to capture the long-term time-dependent audio features. The model distinguished the real voice by processing sequential audio data through the proposed model. The data from the CNN layer contains sequential feature map data from spectrograms represented as

X_{T}

. The input to the RNN model is the output from the CNN model, which is a

4 D

tensor expressed as:

X_{T} = (T_{n}, H, W, C_{n})

(9)

where H, W, and

C_{n}

are height, width & number of channels, respectively.

T_{n}

is the time step in the audio sequence.

The spatial features are flattened independently at each time step t, where t has a feature map of size

(H, W, C_{n})

. The process is carried out through the time-distributed layer, which processes the feature map of each time step to create a tensor

Z_{T}

as shown in the following equation:

Z_{T} = (T_{n}, F_{n})

(10)

The feature sequence generated by the convolutional layers is passed to the LSTM layer to capture the temporal dependencies within the audio signal. Unlike Bidirectional LSTMs, which process sequences in both forward and backward directions, a single LSTM processes data in the forward temporal direction only.

Let

z_{t}

represents the input feature vector in time step t derived from flattening spatial CNN feature maps over time using operation

F_{n}

. The LSTM processes the sequence from

t = 1, 2, 3 \dots T

updating its hidden state at each step as shown in Equation (11).

h d_{t} = L S T M (z_{t}, h d_{t - 1})

(11)

here,

h d_{t}

represents the hidden state at time step t which is computed based on the current input

z_{t}

and the previous hidden state

h d_{t - 1}

.

This allows the model to learn temporal patterns only from the previous features and in turn influence current states. The output of the LSTM layer is then passed to the multi-head attention mechanism, enabling the model to focus selectively on different parts of the sequence while forming a global representation.

To further enhance contextual understanding, a MHA layer is integrated, using four parallel attention heads that learn to focus on different parts of the input sequence simultaneously. The proposed MHA step was implemented using TensorFlow’s MultiHeadAttention layer. The same input tensor X is used as query, key, and value, allowing self-attention. Each head computes scaled dot-product attention as given in Equation (12).

A t t e n t i o n (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}})

(12)

with

Q = X W^{Q}

,

K = X W^{K}

, and

V = X W^{V}

. For h heads (here,

h = 4

), the outputs are concatenated and linearly transformed:

Y_{M H A} = Concat (h_{1}, \dots, h_{h}) W^{O}

(13)

A residual connection and layer normalization are applied as given in Equation (14). Where

Y_{M H A}

and

Y_{L S T M}

are outputs of MHA and LSTM respectively.

O u t p u t_{r e s i d u a l} = Y_{M H A} + Y_{L S T M}

(14)

This allows the model to focus on different temporal dependencies simultaneously, enriching feature representation for downstream classification.

The Global Average Pooling (GAP) layer further averages the feature maps across the temporal dimension, producing a flattened and parameter-efficient representation. Dropout layers are incorporated to reduce overfitting and control model complexity. Finally, a Softmax activation function is applied to generate the output probabilities, classifying each input as either real or fake. The final output layer, therefore, provides class probabilities for each speaker, as illustrated in Figure 2.

The model is compiled using the Adam optimizer, which dynamically adjusts the learning rate throughout the training process. It is trained on the training dataset using categorical cross-entropy as the loss function, and its performance is subsequently evaluated on the test set.

4. Experimentation

All experiments were conducted using TensorFlow and Keras frameworks on the Google Colab environment.

The model was trained using the rerecorded (for-rerec) version of the “Fake-or-Real dataset” mentioned in Table 1. This version of the dataset simulates real-world audio communication scenarios. The MFCC spectrograms of size (40, 64, 1) are used as input features, capturing both spectral and temporal characteristics of the audio signals.

The dataset is split into training and validation sets using an 80:20 ratio. Training epochs are set to 20 with a batch size of 32. To avoid overfitting and reduce training time, early stopping is employed with a patience value of 5. To save the best performance, model checkpoints are saved based on validation accuracy during training.

The dataset is placed into separate directories for both real and fake audio samples. Mel-spectrograms were acquired by loading audio of 2.5 s duration with an offset of 0.6 s, hence making the speech samples more meaningful. The spectrogram of 128 Mel bands was computed as discussed in Section 3.2. These labels are then one-hot encoded to make them compatible with the CNN model. The optimized hyperparameters of the model are summarized in Table 2.

Table 2. Optimized Hyperparameters for the CNN-RNN Model.

5. Results and Discussion

The proposed AudioFakeNet model outperformed other classification models for speaker identification in AI-generated speech. The model was trained, validated and tested on “Fake or Real Dataset”. In addition, it was further validated on our self-collected dataset to verify its robustness. The “Fake or Real dataset” includes preprocessed audio classification data, while the self-collected dataset adds diversity and robustness to the evaluation.

For the public “Fake or Real dataset”, our proposed model “AudioFakeNet” achieved accuracy of 96%. In addition, a validation accuracy of 88% was achieved for the self-collected dataset. The plot given in Figure 6 illustrates the model training loss and the accuracy achieved versus increasing number of epochs for both the public and our self-collected datasets.

Figure 6. Accuracy and loss on the public dataset versus validation accuracy and loss on the self-collected dataset.

The performance of the proposed AudioFakeNet model was benchmarked against several state-of-the-art deepfake audio detection methods. Comparative experiments were conducted using the same dataset, and a detailed comparison of the performance metrics is presented in Table 3. Key evaluation metrics used for this comparison are Precision, F1-score, Test Accuracy, and Equal Error Rate (EER). These metrics provide critical insights into the model’s classification effectiveness and its ability to balance false acceptances and false rejections.

Table 3. A comparison of the performance of the AudioFakeNet model with various classification models on the “Fake or Real” public dataset.

Further, the self-collected dataset was used to evaluate generalization ability of proposed model. Its validation results are given in Table 4.

Table 4. AudioFakeNet Validation Performance on Self-collected Dataset.

The analysis in Table 3 highlights the performance improvements from traditional machine learning models to modern deep learning architectures, leading to the development of the AudioFakeNet model. The superior experimental results of AudioFakeNet can be attributed to its ability to capture temporal-spectral inconsistencies present in AI-generated speech through the combined use of LSTM and multi-head attention mechanisms. Classical classifiers like Random Forest, SVM, and XGBoost demonstrate moderate detection capabilities due to their reliance on hand-crafted spectral features. In contrast, CNN-based models effectively handle spatial details but lack temporal modeling. At the same time, LSTM models captures temporal dependencies but not spectral details to differentiate real and fake audio signals. The proposed model shows better accuracy as it captures both short-term spectral cues and long-term temporal inconsistencies in AI-generated speech.

In both the machine learning and deep learning models, MFCCs were extracted using STFT with a windowing of 2048 samples and a sample length of 512. However, feature-based machine learning models averaged the MFCCs over time, during which the temporal information is lost. At the same time, deep learning models retained the entire MFCC spectrogram, enabling CNN-BiLSTM networks to use the temporal dynamics of speech signals, which is crucial for detecting real and fake distortions.

During experimentation the CNN-BiLSTM, LCNN, and RNN models achieve comparable accuracy on the same dataset when compared to our proposed AudioFakeNet under the same experimental conditions. The model shows consistent generalization as can be seen in Figure 6 with a balanced equal error rate (EER) and accuracy, making it a more reliable option for forensic deepfake detection.

State-of-the-art architectures from ASVspoof challenges, such as RawNet2 [30], AASIST3 [18], and LCNN [41], enhance deepfake detection using advanced feature-level attention mechanisms. However, AudioFakeNet, with its hybrid structure and Attention framework, achieves the best results, i.e., an F1-score of 0.94 and an EER of 0.14. The proposed model utilizes the effectiveness of combining convolutional, recurrent, and multi-head attention layers to capture diverse speech dynamics.

Figure 7 presents the confusion matrices of the proposed AudioFakeNet for both the public and self-collected datasets, showcasing its ability to differentiate between real and fake speech samples in a binary classification scenario.

Figure 7. Confusion matrix of the AudioFakeNet model on various datasets.

The matrix indicates that of the real utterances, 398 were accurately classified as real, while only 10 were mistakenly labeled as fake. In contrast, for fake statements, 279 were accurately identified as counterfeit, while 129 were misclassified as real. This suggests that the model demonstrates high precision and recall for genuine speech, indicating a strong ability to detect authentic audio, even in re-recorded situations. However, the model reveals a significant false negative rate when identifying fake samples, likely due to the acoustic similarities between re-recorded fake samples and real utterances in this version of the dataset. The color bar on the right represents the intensity of the correct classifications, with darker shades indicating higher values.

Figure 8 shows the ROC curve for the proposed AudioFakeNet model, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different classification thresholds. The Area Under the Curve (AUC) is

0.9593

, indicating that the model has a high ability to discriminate between real and fake audio recordings. An AUC value closer to 1 signifies better model performance; thus, reflecting excellent performance in detecting deepfake voices.

Figure 8. Receiver Operating Characteristic (ROC) curve of AudioFakeNet Model.

Figure 9 shows the Precision-Recall (PR) curve for the model, which has an Average Precision (AP) score of 0.9608. The curve illustrates a strong balance between precision and recall across different thresholds. A high AP score indicates that the model effectively minimizes false positives while accurately identifying most instances of fake audio. This level of performance is particularly crucial in forensic scenarios, where it is essential to reduce false alarms while ensuring that actual deep-fake cases are not overlooked.

Figure 9. Precision-Recall (PR) of AudioFakeNet Model.

6. Ablation Study of Model Components

Ablation studies are commonly used in Machine Learning (ML) and Deep Learning (DL) research to evaluate the contributions of specific architectural components within a model. Sheikholeslami et al. [42] highlight the importance of these experiments in understanding how individual modules affect overall performance and justifying architectural enhancements. Following this approach, an ablation study was conducted on AudioFakeNet to assess the impact of its key components, namely the BiLSTM and Multi-Head Attention (MHA) modules.

The results, summarized in Table 5 illustrates the importance of each component in the proposed model. The baseline CNN model (V1) captures basic spectral artifacts but lacks temporal discrimination. Adding BiLSTM (V2) improves Recall and Accuracy, indicating that sequential modeling helps detect inconsistencies in synthetic speech. Combining CNN with MHA attention (V3) further reduces the Equal Error Rate (EER) by focusing on key spectral regions. The full model (V4), which combines BiLSTM and Attention, achieves the best performance and lowest EER, showing that both approaches together enhance deepfake detection reliability.

Table 5. Ablation study showing the contribution of BiLSTM and Multi-Head Attention in the proposed AudioFakeNet model.

Figure 10 presents an ablation study illustrating the impact of BiLSTM and self-attention. The full AudioFakeNet architecture achieves the highest F1-score and accuracy, highlighting the complementary benefits of both methods in detecting subtle deepfake artifacts.

Figure 10. Ablation study: Contribution of BiLSTM and Attention.

7. Conclusions and Future Work

This work highlights that deepfake audio detection, while crucial, remains a relatively underexplored area. We have presented an effective scheme to differentiate between authentic and artificially generated audio samples, employing a three-stage architecture comprising CNN, RNN, and an MHA mechanism applied to MFCCs. In this framework, CNN automatically extracts discriminative features from MFCCs, RNN captures long-term temporal dependencies, and MHA selectively focuses on speech segments containing critical cues to distinguish deep-fake audio from genuine voice recordings. The proposed system was implemented as an open-source module on the Google Colab platform and was evaluated using two datasets: a self-collected dataset and the publicly available ’Fake or Real’ dataset. Experimental results demonstrate that our approach outperforms several state-of-the-art and recently proposed methods in terms of accuracy, establishing it as a strong candidate for forensic and security applications.

As future research, we wish to focus on leveraging cross-language datasets to enhance the robustness of deepfake audio detection, accounting for variations in accent and linguistic tone across regions. We also aim to ensure real-time applicability of detection models by optimizing latency for practical deployment scenarios. Another future direction is the exploration of multimodal input techniques that combine either raw audio waveforms with transformed representations or the outputs of various transformation techniques to preserve and exploit spectral features. As voice conversion tools and deepfake generation methods continue to advance, integrating more sophisticated detection strategies will become essential.

Author Contributions

Conceptualization, S.D. and M.A.Q.; methodology, S.D. and S.K.N.; software, S.D., M.A.Q. and S.K.N.; validation, S.D., S.K.N. and A.M.; formal analysis, S.D., S.K.N. and A.M.; investigation, M.A.Q.; resources, S.D. and S.K.N.; data curation, M.A.Q. and A.M.; writing—original draft preparation, S.D. and M.A.Q.; writing, S.K.N. review and editing, M.A.Q. and A.M.; supervision, M.A.Q. and S.K.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Dataset will be shared on request.

Conflicts of Interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial or non-financial interest in the subject matter or materials discussed in this manuscript.

References

Bird, J.J.; Lotfi, A. Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion. arXiv 2023, arXiv:2308.12734. [Google Scholar]
Biswas, D.; Gil, J.-M. Design and Implementation for Research Paper Classification Based on CNN and RNN Models. J. Internet Technol. 2024, 25, 637–645. [Google Scholar] [CrossRef]
Rabhi, M.; Bakiras, S.; Di Pietro, R. Audio-Deepfake Detection: Adversarial Attacks and Countermeasures. Expert Syst. Appl. 2024, 250, 123941. [Google Scholar] [CrossRef]
Sun, C.; Jia, S.; Hou, S.; Lyu, S. AI-Synthesized Voice Detection Using Neural Vocoder Artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 904–912. [Google Scholar]
Rana, S.; Qureshi, M.A.; Majeed, A.; Noon, S.K. Identification of true speakers from disguised voices in anti-forensic scenarios using an efficient framework. Signal Image Video Process. 2024, 18, 7455–7471. [Google Scholar] [CrossRef]
Chitale, M.; Dhawale, A.; Dubey, M.; Ghane, S. A Hybrid CNN-LSTM Approach for Deepfake Audio Detection. In Proceedings of the 2024 3rd International Conference on Artificial Intelligence for Internet of Things (AIIoT), Vellore, India, 3–4 May 2024; pp. 1–6. [Google Scholar]
Ashraf, M.; Abid, F.; Din, I.U.; Rasheed, J.; Yesiltepe, M.; Yeo, S.F.; Ersoy, M.T. A Hybrid CNN and RNN Variant Model for Music Classification. Appl. Sci. 2023, 13, 1476. [Google Scholar] [CrossRef]
Zaman, K.; Islam, J.S.; Sah, M.; Direkoglu, C.; Okada, S.; Unoki, M. Hybrid Transformer Architectures with Diverse Audio Features for Deepfake Speech Classification. IEEE Access 2024, 12, 149221–149237. [Google Scholar] [CrossRef]
Rana, S.; Qureshi, M.A. A Comprehensive Review of Forensic Phonetics Techniques. Asian Bull. Big Data Manag. 2024, 4, 284–301. [Google Scholar] [CrossRef]
Akhtar, Z.; Pendyala, T.L.; Athmakuri, V.S. Video and audio deepfake datasets and open issues in deepfake technology: Being ahead of the curve. Forensic Sci. 2024, 4, 289–377. [Google Scholar] [CrossRef]
Ye, J.; Yan, D.; Fu, S.; Ma, B.; Xia, Z. One-Class Network Leveraging Spectro-Temporal Features for Generalized Synthetic Speech Detection. Speech Commun. 2025, 169, 103200. [Google Scholar] [CrossRef]
Bendiab, G.; Haiouni, H.; Moulas, I.; Shiaeles, S. Deepfakes in Digital Media Forensics: Generation, AI-Based Detection and Challenges. J. Inf. Secur. Appl. 2025, 88, 103935. [Google Scholar] [CrossRef]
Bisogni, C.; Loia, V.; Nappi, M.; Pero, C. Acoustic Features Analysis for Explainable Machine Learning-Based Audio Spoofing Detection. Comput. Vis. Image Underst. 2024, 249, 104145. [Google Scholar] [CrossRef]
Li, X.; Chen, P.-Y.; Wei, W. Where Are We in Audio Deepfake Detection? A Systematic Analysis over Generative and Detection Models. ACM Trans. Internet Technol. 2025, 25, 1–19. [Google Scholar] [CrossRef]
Nanmalar, M.; Joysingh, S.J.; Vijayalakshmi, P.; Nagarajan, T. A Feature Engineering Approach for Literary and Colloquial Tamil Speech Classification Using 1D-CNN. Speech Commun. 2025, 173, 103254. [Google Scholar] [CrossRef]
Ahmad, O.; Khan, M.S.; Jan, S.; Khan, I. Deepfake Audio Detection for Urdu Language Using Deep Neural Networks. IEEE Access 2025, 13, 97765–97778. [Google Scholar] [CrossRef]
Ahmadiadli, Y.; Zhang, X.-P.; Khan, N. Beyond Identity: A Generalizable Approach for Deepfake Audio Detection. arXiv 2025, arXiv:2505.06766. Available online: https://arxiv.org/abs/2505.06766 (accessed on 15 October 2025). [CrossRef]
Borodin, K.; Kudryavtsev, V.; Korzh, D.; Efimenko, A.; Mkrtchian, G.; Gorodnichev, M.; Rogov, O.Y. AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection Using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge. arXiv 2024, arXiv:2408.17352. [Google Scholar] [CrossRef]
Pianese, A.; Cozzolino, D.; Poggi, G.; Verdoliva, L. Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models. In Proceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security (IH & MMSec 2024), Baiona, Spain, 24–26 June 2024; ACM: New York, NY, USA, 2024; pp. 289–294. Available online: https://arxiv.org/abs/2405.02179 (accessed on 15 October 2025).
Hamza, A.; Javed, A.R.; Iqbal, F.; Kryvinska, N.; Almadhor, A.S.; Jalil, Z.; Borghol, R. Deepfake audio detection via MFCC features using machine learning. IEEE Access 2022, 10, 134018–134028. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, X.; Sun, M.; Yang, J. A transformer-based deep learning approach for recognition of forgery methods in spoofing speech attribution. Appl. Soft Comput. 2025, 171, 112798. [Google Scholar] [CrossRef]
Kumar, A.; Singh, D.; Jain, R.; Jain, D.K.; Gan, C.; Zhao, X. Advances in DeepFake Detection Algorithms: Exploring Fusion Techniques in Single and Multi-Modal Approach. Inf. Fusion 2025, 118, 102993. [Google Scholar] [CrossRef]
Almutairi, Z.M.; Elgibreen, H. Detecting Fake Audio of Arabic Speakers Using Self-Supervised Deep Learning. IEEE Access 2023, 11, 72134–72147. [Google Scholar] [CrossRef]
Mirza, A.R.; Al-Talabani, A.K. Spoofing Countermeasure for Fake Speech Detection Using Brute Force Features. Comput. Speech Lang. 2025, 90, 101732. [Google Scholar] [CrossRef]
Shaaban, O.A.; Yildirim, R.; Alguttar, A.A. Audio Deepfake Approaches. IEEE Access 2023, 11, 132652–132682. [Google Scholar] [CrossRef]
Liang, R.; Xie, Y.; Cheng, J.; Pang, C.; Schuller, B. A Non-Invasive Speech Quality Evaluation Algorithm for Hearing Aids with Multi-Head Self-Attention and Audiogram-Based Features. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2166–2176. [Google Scholar] [CrossRef]
Akter, R.; Islam, M.R.; Debnath, S.K.; Sarker, P.K.; Uddin, M.K. A Hybrid CNN-LSTM Model for Environmental Sound Classification: Leveraging Feature Engineering and Transfer Learning. Digit. Signal Process. 2025, 163, 105234. [Google Scholar] [CrossRef]
Xiong, D.; Wen, Z.; Zhang, C.; Ren, D.; Li, W. BMNet: Enhancing Deepfake Detection Through BiLSTM and Multi-Head Self-Attention Mechanism. IEEE Access 2025, 13, 21547–21556. [Google Scholar] [CrossRef]
Lavrentyeva, G.; Novoselov, S.; Tseren, A.; Volkova, M.; Gorlanov, A.; Kozlov, A. STC Antispoofing Systems for the ASVspoof2019 Challenge. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1033–1037. Available online: https://www.isca-archive.org/interspeech_2019/lavrentyeva19_interspeech.html (accessed on 15 October 2025).
Huang, L.; Pun, C.-M. Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection. arXiv 2024, arXiv:2401.05614. [Google Scholar]
Xie, Y.; Cheng, H.; Wang, Y.; Ye, L. Domain Generalization via Aggregation and Separation for Audio Deepfake Detection. IEEE Trans. Inf. Forensics Secur. 2023, 19, 344–358. [Google Scholar] [CrossRef]
Abdeldayem, M.; Mohamed, A. The Fake or Real Dataset. 2023. Available online: https://www.kaggle.com/datasets/mohammedabdeldayem/the-fake-or-real-dataset/data (accessed on 13 August 2025).
Yi, J.; Wang, C.; Tao, J.; Zhang, X.; Zhang, C.Y.; Zhao, Y. Audio Deepfake Detection: A Survey. arXiv 2023, arXiv:2308.14970. [Google Scholar] [CrossRef]
Karthikeyan, V.; Suja Priyadharsini, S. Adaptive Boosted Random Forest-Support Vector Machine Based Classification Scheme for Speaker Identification. Appl. Soft Comput. 2022, 131, 109826. [Google Scholar] [CrossRef]
Liu, T.; Yan, D.; Wang, R.; Yan, N.; Chen, G. Identification of Fake Stereo Audio Using SVM and CNN. Information 2021, 12, 263. [Google Scholar] [CrossRef]
Chau, H.-H.; Chau, Y. Audio-Based Classification of Mild Cognitive Impairment Using XGBoost. In Proceedings of the 2024 IEEE 6th Eurasia Conference on Biomedical Engineering, Healthcare and Sustainability (ECBIOS), Tainan, Taiwan, 14–16 June 2024; pp. 263–265. [Google Scholar]
Wani, T.M.; Qadri, S.A.A.; Comminiello, D.; Amerini, I. Detecting Audio Deepfakes: Integrating CNN and BiLSTM with Multi-Feature Concatenation. In Proceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security, Baiona, Spain, 24–26 June 2024; pp. 271–276. [Google Scholar]
Doan, T.P.; Hong, K.; Jung, S. GAN Discriminator Based Audio Deepfake Detection. In Proceedings of the 2nd Workshop on Security Implications of Deepfakes and Cheapfakes, Melbourne, VIC, Australia, 10–14 July 2023; pp. 29–32. [Google Scholar]
Lapates, J.M.; Gerardo, B.D.; Medina, R.P. Performance Evaluation of Enhanced DCGANs for Detecting Deepfake Audio across Selected FoR Datasets. In Proceedings of the 2024 15th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 16–18 October 2024; pp. 54–59. [Google Scholar]
Wijethunga, R.; Matheesha, D.; Al Noman, A.; De Silva, K.; Tissera, M.; Rupasinghe, L. Deepfake Audio Detection: A Deep Learning Based Solution for Group Conversations. In Proceedings of the 2020 2nd International Conference on Advancements in Computing (ICAC), Colombo, Sri Lanka, 10–11 December 2020; pp. 192–197. [Google Scholar]
Volkova, M.; Andzhukaev, T.; Lavrentyeva, G.; Novoselov, S.; Kozlov, A. Light CNN Architecture Enhancement for Different Types of Spoofing Attack Detection. In Proceedings of the International Conference on Speech and Computer, Istanbul, Turkey, 20–25 August 2019; Springer: Cham, Switzerland, 2019. [Google Scholar]
Sheikholeslami, S.; Ghasemirahni, H.; Payberah, A.H.; Wang, T.; Dowling, J.; Vlassov, V. Utilizing Large Language Models for Ablation Studies in Machine Learning and Deep Learning. In Proceedings of the 5th Workshop on Machine Learning and Systems, Rotterdam, The Netherlands, 30 March–3 April 2025; pp. 230–237. [Google Scholar]

Figure 1. Classification of audio deepfake generation methods.

Figure 2. A block diagram of the proposed AudioFakeNet model.

Figure 3. Waveform comparison showing amplitude variation in both original and disguised voice sample.

Figure 4. Spectrogram comparison of original and disguised voice samples.

Figure 5. MFCC comparison of both original and disguised voice samples.

Figure 6. Accuracy and loss on the public dataset versus validation accuracy and loss on the self-collected dataset.

Figure 7. Confusion matrix of the AudioFakeNet model on various datasets.

Figure 8. Receiver Operating Characteristic (ROC) curve of AudioFakeNet Model.

Figure 9. Precision-Recall (PR) of AudioFakeNet Model.

Figure 10. Ablation study: Contribution of BiLSTM and Attention.

Table 1. Different versions of The Fake or Real Dataset [32,33].

Version & Source	Detail	Significance
`for-original`	Unprocessed audio samples (i.e., original)	Baseline version with class and gender imbalance.
`for-norm`	Normalized audio (adjusted sample rate, volume, and channels)	Reduces class and gender bias; useful for generalization.
`for-2sec`	2-s truncated clips from the `for-norm` set	Fixed-length inputs allow uniform temporal modeling.
`for-rerec`	Re-recorded version of `for-2sec` via external devices	Simulates real-world distortions; useful for robustness testing.

Table 2. Optimized Hyperparameters for the CNN-RNN Model.

Hyperparameter	Value
Optimizer	Adam
Loss Function	Categorical Cross entropy
Activation Function	Leaky ReLU(CNN)
Epochs	20
Batch Size	32
Dropout Rate	0.3
LSTM Units	128
Learning Rate	1 × 10⁻⁶

Table 3. A comparison of the performance of the AudioFakeNet model with various classification models on the “Fake or Real” public dataset.

Model	Precision	Recall	F1-Score	EER	Accuracy
Random Forest [34]	0.79	0.81	0.80	0.36	0.81
SVM [35]	0.80	0.88	0.84	0.32	0.84
MLP [8]	0.82	0.78	0.80	0.31	0.78
XGBoost [36]	0.79	0.84	0.81	0.34	0.84
CNN [5]	0.84	0.82	0.83	0.29	0.88
CNN-BiLSTM [37]	0.93	0.89	0.90	0.18	0.93
GAN [38]	0.48	0.45	0.46	0.77	0.46
DCGAN [39]	0.52	0.49	0.50	0.67	0.50
Dense model [39]	0.70	0.77	0.73	0.38	0.79
RNN [40]	0.88	0.90	0.90	0.18	0.92
RawNet2 [30]	0.81	0.83	0.81	0.28	0.81
AASIST3 [18]	0.87	0.87	0.87	0.22	0.88
LCNN [41]	0.90	0.92	0.92	0.15	0.94
Proposed AudioFakeNet	0.95	0.92	0.94	0.14	0.96

Table 4. AudioFakeNet Validation Performance on Self-collected Dataset.

Metric	Value
Precision	0.86
Recall	0.85
F1-Score	0.85
EER	0.23
Accuracy	0.88

Table 5. Ablation study showing the contribution of BiLSTM and Multi-Head Attention in the proposed AudioFakeNet model.

Model Variant	Precision	Recall	F1-Score	EER	Accuracy
V1. CNN Only	0.84	0.82	0.83	0.29	0.88
V2. CNN + BiLSTM (No MHA)	0.93	0.89	0.90	0.18	0.93
V3. CNN + MHA (No BiLSTM)	0.90	0.88	0.89	0.16	0.95
V4. Proposed AudioFakeNet (Full Model)	0.95	0.92	0.94	0.14	0.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

AudioFakeNet: A Model for Reliable Speaker Verification in Deepfake Audio

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Datasets

3.1.1. Self-Collected Dataset

3.1.2. Public Dataset

3.2. Computing MFCC Features

3.3. The AudioFakeNet Model

4. Experimentation

5. Results and Discussion

6. Ablation Study of Model Components

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics