You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

24 July 2023

Gaussian-Filtered High-Frequency-Feature Trained Optimized BiLSTM Network for Spoofed-Speech Classification

,
,
,
,
,
and
1
Electrical Engineering Department, Prince Mohammad bin Fahd University, P.O. Box 1664, Al Khobar 31952, Saudi Arabia
2
Department of Computer Engineering, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia
3
Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
4
Department of Software Engineering, College of Computer Science and Engineering, University of Jeddah, P.O. Box 34, Jeddah 21959, Saudi Arabia
This article belongs to the Section Sensor Networks

Abstract

Voice-controlled devices are in demand due to their hands-free controls. However, using voice-controlled devices in sensitive scenarios like smartphone applications and financial transactions requires protection against fraudulent attacks referred to as “speech spoofing”. The algorithms used in spoof attacks are practically unknown; hence, further analysis and development of spoof-detection models for improving spoof classification are required. A study of the spoofed-speech spectrum suggests that high-frequency features are able to discriminate genuine speech from spoofed speech well. Typically, linear or triangular filter banks are used to obtain high-frequency features. However, a Gaussian filter can extract more global information than a triangular filter. In addition, MFCC features are preferable among other speech features because of their lower covariance. Therefore, in this study, the use of a Gaussian filter is proposed for the extraction of inverted MFCC (iMFCC) features, providing high-frequency features. Complementary features are integrated with iMFCC to strengthen the features that aid in the discrimination of spoof speech. Deep learning has been proven to be efficient in classification applications, but the selection of its hyper-parameters and architecture is crucial and directly affects performance. Therefore, a Bayesian algorithm is used to optimize the BiLSTM network. Thus, in this study, we build a high-frequency-based optimized BiLSTM network to classify the spoofed-speech signal, and we present an extensive investigation using the ASVSpoof 2017 dataset. The optimized BiLSTM model is successfully trained with the least epoch and achieved a 99.58% validation accuracy. The proposed algorithm achieved a 6.58% EER on the evaluation dataset, with a relative improvement of 78% on a baseline spoof-identification system.

1. Introduction

Automation plays an essential role due to more responsive and efficient operations and tighter fraud-detection compliance. Automation saves time, effort, and money while decreasing manual errors and focusing on our primary goals. Automatic speaker authentication is a system that uses samples of human audio signals to recognize people. Entry controls to limited locations, access to confidentiality, and banking applications, including cash transfers, credit card authorizations, voice banking, and other transactions, can all benefit from speaker verification. With the increasing popularity of smartphones and voice-controlled intelligent devices, all of which contain a microphone, speaker authentication technology is expected to become even more prevalent in the future.
However, this technology’s vulnerability to manipulation of the voice using presentation attacks, also known as voice spoofing, poses a challenge. Various spoofs, such as speech synthesis (SS), voice conversion (VC), replay speech, and imitation, can be used to spoof automated voice-detection systems [1]. These possible attacks in speaker-based automation systems have been intensively examined in Reference [2], for example, microphone-based voice generation, feature extraction, and classifier- or decision-level attacks. In a replay attack, the perpetrator tries for physical access by playing a previously recorded speech that sounds like a registered speaker’s speech. The system is particularly vulnerable to replay attacks, as voices can easily be recorded in person or through a telephone conversation and then replayed to manipulate the system. Since replay attacks do not need a lot of training or equipment, these attacks are the most common and likely to happen. The ASVspoof 2017 dataset addresses the issue of replay spoofing detection. Previous works have extracted features that reflect the acoustic level difference between genuine and spoof speech for replay speech detection.
Mel frequency cepstral coefficients (MFCCs), linear prediction coefficients (LPCs), linear prediction cepstral coefficients (LPCCs), line spectral frequencies (LSFs), discrete wavelet transform (DWT) [3,4], and perceptual linear prediction (PLP) are speech feature extractions commonly used in speaker recognition as well as speaker spoofing identification [5]. A wavelet transform was used to obtain spectral features, and these features were integrated with CNN’s spatial features in Reference [6] for ECG classification. In Reference [7], the authors analyzed a 6–8 kHz high-frequency subband using CQCC features to investigate re-recording distortion. To record the distortions caused by the playback device, Singh et al. [8] derived the MFCC from the residual signal. A low-frequency frame-wise normalization in the (constant Q transform) CQT domain was suggested in Reference [9] to capture the playback speech artifacts. Deep feature-utilizing neural networks have also been studied for the recognition of playback speech in addition to these manually created elements. For instance, Siamese-embedded spectrogram and group delay were employed to teach deep features to the CNN [10]. However, feature extraction is highly dependent on DNN training, and it might be difficult to generalize it to ASV tasks that are performed outside of their intended domain.
The speech signal is processed in a ten-millisecond time frame without overlapping for future extraction from the speech. The speech signal is cleaved into two zones: the silent and speech zones. An area with low energy and an excessive zero-crossing rate is considered a silent zone, and an area with high energy is regarded as a speech zone. Huang and Pun [11] experimented with the same person’s genuine and spoofed speech signals using a replay attack. Figure 1 shows the actual and replayed speech signal, and it is observed that there is a difference in the silent segment shown in the red box. Thus, the use of a silent zone with a high-frequency region can discriminate the spoofed speech easily. A precise recording system is required for a replay attack. The background noise of a recording device is easily noticeable in a silent zone due to its low energy relative to a highly energized speech zone. However, finding a silent zone accurately is tricky. Therefore, the endpoint method of finding a zero-crossing rate and energy can be used to approximate the silent zone [12]. By adjusting the threshold of the zero-crossing rate detection and short-term energy, a speech and silent zone can be judged systematically.
Figure 1. Time-domain speech signal of the same person’s (upper) genuine waveform and (lower) spoofed waveform.
Although the MFCC and CQCC are considered reliable features, the classifier’s performance can be significantly improved by combining them with complementary features, which can be done at the feature or score level [13]. Pitch, residual phase, and dialectical features are a few examples of complementary features. These complementary features, i.e., high pitch and corresponding phase, can easily be obtained at high frequencies [14].
The objective of this paper is to identify spoof speech against replayed attacks. This paper presents a new approach to concatenating high-frequency features with complementary features for better classification results. Generally, linear or triangular filter banks and their inversion are used to obtain high-frequency speech regions using filter banks. We emphasized the high-frequency features using a Gaussian filter in the proposed work.
One of the properties of the Gaussian filter is that it is a linear filter, which means that it preserves the correlation between different frequency components of the input signal. This is important in subband processing because it allows the different subbands to retain their correlation, which can be helpful in subsequent processing steps such as feature extraction or classification. The Gaussian function gives the frequency response of a Gaussian filter:
H ( f ) = e f 2 2 σ 2
where f is the frequency and σ is a parameter known as the standard deviation. This function has a bell-shaped curve centered at zero frequency and has a smooth transition to zero as the frequency increases. When the input signal is passed through a Gaussian filter, the filter response is applied uniformly across all frequencies, with higher frequencies being attenuated more than lower frequencies. This means that the different subbands created by the Gaussian filter will have a similar frequency response, with each subband containing a range of frequencies filtered by a bandpass filter but with the same Gaussian shape. This merit of the Gaussian filter can improve the accuracy of the processing result.
In summary, the following characteristic of Gaussian filters motivated us to use them for feature extraction: Firstly, the Gaussian filter allows for smooth switching between the subbands, maintaining the correlation between bands. Secondly, independent selection of the Gaussian filter’s mean and variance allows for controlled overlap between two consecutive subbands. Finally, the ease of filter design parameter calculation from the midpoint and endpoints of the original triangular filter bank is used in MFCC feature extraction. Nowadays, the convolution neural network (CNN) works amazingly in classification applications. However, the architecture and hyper-parameters of the CNN are critical because they control the training algorithm’s behavior, and it significantly impacts performance and training time. Therefore, the BiLSTM network, which considers the connection between the past and present features, was selected and optimized using the Bayesian algorithm. The optimization of BiLSTM gave minimal architecture with optimized hyper-parameters and helped with significantly improving the classification accuracy. The proposed model was validated with the well-known dataset ASVSpoof 2017. The key contributions are summarized as follows:
  • A deep study on various feature sets used for classification is presented, and a multiple-feature integrated BiLSTM network is proposed.
  • A conventional MFCC was obtained using a linear filter bank. We propose a new Gaussian-filtered inverted MFCC feature compared with conventional MFCC that provides a smooth transition between the subbands and maintains correlation within the same subband.
  • RNN is the most effective method for spoof classification because it can handle short-term spectral features while responding to long-term temporal occurrences. LSTM networks overcome the vanishing gradients and long-term reliance problem of RNNs. BiLSTM was used in the proposed algorithm because the bidirectional strategy further improves recognition quality compared with the unidirectional approach.
  • A Bayesian optimization algorithm, to optimize the hyper-parameters of BiLSTM while reducing its computational complexity and hidden layers, is presented.
  • We used cutting-edge deep learning algorithms and compared their performance using assessment measurements.
  • We present several issues based on the experimental evaluation and recommend possible solutions.

3. Materials and Methods

3.1. Materials

This paper uses ASVSPOOF 2017 dataset provided by [2] for spoofing classification. This dataset focused on replay attacks faced under unseen conditions, i.e., playback devices and replay environments. This dataset is a mixture of both known and unknown scenarios. The audio data were digitized using 16-bit PCM with a 16 kHz sampling rate. Spoofing of audio was created in the wild using different microphones and playback devices in different environments. Creating a dataset from the played spoofing attacks under the uncontrolled setup made it difficult to analyze. This dataset contained replay attacks under different environments. High-quality recording devices can record audio under very low noise conditions, and therefore, they are difficult to identify. The data were divided into three categories: training, development, and evaluation datasets. The speech of the training dataset was captured at a single place, whereas the development set was gathered at two additional sites in addition to the training dataset’s location. A total of 42 speakers participated in dataset creation. Training (train) and development (dev) datasets were developed with 3 and 10 spoofing attacks. The evaluation (eval) dataset recorded audio with 57 types of spoofing attacks. The dataset had a total of 3565 genuine recordings and 14,465 spoofed recordings. Finally, the evaluation, training, and development sets were gathered at two more locations. The summary of the dataset is presented in Table 1 [58].
Table 1. ASVSpoof2017 dataset summary and its utilization in the experiment.
The spectrum properties of spoofed audio signals for various replay attacks are depicted in Figure 2 using high-/mid-/weak-quality recording devices and playback devices. The spoofing classification for the speech recorded using a high-quality recorder under very low noise conditions (i.e., Figure 2c) was more challenging. The system or model shall overcome the attack reliance in the detection process. The model’s capability highly depends on the sorts of attacks that can be represented by similar patterns and used in the model’s training process. However, it needs previous knowledge of the assault type, which is not a reasonable assumption. Thus, the system must discriminate the spoof audio even if that attack’s data were not utilized for training the model.
Figure 2. Spectrogram of speech signal under various attacks. (a) Genuine speech. (b) Spoofed speech using high-quality recording and playback. (c) Spoofed speech using high-quality recording and weak-quality playback. (d) Spoofed speech using mid-quality recording and low-quality playback.

3.2. Proposed Method

Feature Extraction Techniques

Genuine and spoof audio can be discriminated using spectral characteristics. The authors in Reference [17] presented several spectral features to classify genuine audio from spoofed audio. The proposed methods use a combination of several features to train the BiLSTM network. These features are listed below:
Gaussian-Filtered Inverted MFCC (GIMFCC): A replica of the human hearing system was implemented using MFCC computation to imitate the ear’s working principle. Bandpass filters with linear spacing at low frequencies and logarithm spacing at high frequencies were employed in MFCC calculation to keep phonetically crucial aspects of the speech signal. The speech signal consisted of tones, each with varying frequencies. In addition, each tone had a fundamental frequency and Mel-scale-based subjective pitch.
Let y [ n ] n = 1 N represent a speech frame obtained after framing and hamming windowing. Its power spectrum is calculated using N-point discrete Fourier transform (DFT) as follows:
Y ( k ) = n = 1 N y [ n ] e j 2 π k n N
In the calculation of MFCC, triangle filters spaced uniformly in Mel scales were used [59]. The response of the triangle filter is as follows:
H i ( k ) = 0 k k i 1 k k i 1 k i k i 1 k i 1 k k i k i + 1 k k i + 1 k i k i k k i + 1 0 otherwise
where 1 < i < Q , Q is the number of filters, and k i is the end frequencies of a particular filter. The triangle filters do not consider the correlation between the subband [13]. Gaussian filters providing symmetric and gradually decaying responses can compensate for the possible correlation loss. The Gaussian filters in the same band can be expressed as
H ( k ) i G M F C C = e k k i 2 σ i
where σ i = k i + 1 k i α is the standard deviation, α controls the variance, and k i is a frequency point representing the mean of i t h Gaussian filter.
In calculating conventional MFCC, more emphasis is given to the region of low-frequency signals. However, more promising results can be obtained from the high-frequency regions [11,60]. Therefore, the new filter-bank response was obtained by flipping the original response at the mid-point of the frequency range of the speech signal. The flipped response can be expressed as:
H i ^ ( k ) = H Q + 1 i N / 2 + 1 k
Power spectrum obtained from Equation (5) is passed through these flipped Gaussian filter banks, referred to as Gaussian-inverted MFCCs (GIMFCCs), and output can be expressed as
E G I M F C C ( i ) = k = 1 N / 2 Y ( k ) 2 H i ^ ( k )
Later, the cepstrum coefficients are calculated from the filter bank’s output using Equation (7).
C G I M F C C ( k ) = 2 Q i = 1 Q 1 l o g E G I M F C C ( i + 1 ) c o s π k ( i 0.5 ) Q
Dynamic GIMFCC: The above cepstral coefficients contain the information for the given frame and are considered static features. Further information can be obtained by calculating the first (i.e., delta) and second (delta-delta) derivatives of these coefficients, where delta–GIMFCC provides the speech-rate information and delta-delta-GIMFCC provides speech-acceleration information. These delta features can be calculated using Equation (8).
Δ C G I M F C C ( k ) = t = T T k i C G I M F C C ( k + i ) t = T T | i |
where T is the frame number used for coefficient calculation. Thus, these two features are also appended with GIMFCC in the proposed method to improve the performance with negligible computational cost. Figure 3 shows the inverted MFCC feature extraction process.
Figure 3. Block diagram representing extraction of Inverted MFCC features emphasizing high-frequency regions.
GTCC and its variant: The human audio response has been modeled using Gammatone filters in many past applications. The Gammatone filter is a function of a sinusoidal tone centered at the F c frequency and Gamma distribution function, which can be written as
g ( t ) = A t n 1 e 2 π B t c o s ( w c t + θ )
where A is the amplitude; n is the filter order; w c is the central frequency; θ is the phase shift; and B represents the filter bandwidth.
The suggested GTCC coefficients are computed similarly as in the MFCC coefficient extraction process, where Gammatone filters have been used instead of triangular filters. The coefficients can be expressed as:
G T C C ( k ) = 2 Q i = 1 Q 1 l o g E G T C C ( i + 1 ) c o s π k ( i 0.5 ) Q
where E G T C C is the power spectrum obtained by passing the speech frame through Gammatone filters g ( t ) . Further features using Δ ( c ) G T C C and Δ Δ ( c ) G T C C are also extracted, as explained in the previous subsection.
Spectral Features: Emotion is another influencing factor in the speech signal. It changes the vocal cord vibration, and the speech signal’s spectral features vary. The spectral centroid gives the geometric center of the spectrum. It is a weighted sum of frequency components where weights are assigned based on normalized energy. The spectral flatness determines the degree of periodicity. Spectral entropy accesses the degree of spectral probability’s randomness. It can be analyzed for both voice and unvoiced frames. Thus, it helps to characterize the high-frequency regions. In addition to that, other spectral features, including flatness, skews, a roll of points, and pitch, are also incorporated to improve the classification.
The overall process of extracting various features from the speech signal is shown in Figure 4. These features are used in the proposed optimized BiLSTM network for further classification.
Figure 4. Process flow for feature extraction from the speech signal.
Before applying the proposed inverted high-frequency MFCC features to construct ASVSpoof systems, their ability to distinguish between authentic and spoof voice segments was studied using the t-stochastic neighborhood embedding (t-SNE) [61] visualization. The t-SNE was plotted from the feature sets of the training dataset. Figure 5 shows that the genuine and spoof class features are separated. Thus, t-SNE provides a significant difference between these two classes.
Figure 5. Features’ t-SNE visualization for training dataset.

3.3. Proposed Optimized BiLSTM

The recurrent neural network (RNN) is widespread in sequential data processing and classification. The conventional deep network uses the features independently, ignoring the correlation between the present and the following features. In contrast, RNN incorporates the dependencies between the features by storing previous input information at each hidden layer to generate the subsequent output. However, RNN only applies for short periods; LSTM is designed to handle extended periods. A BiSLTM network consisting of two LSTMs is intended to improve the performance further. One LSTM processes input in the forward direction, and the second LSTM processes data in the backward direction. Thus, using data in both forward and backward directions makes the network more efficient than other deep networks. Hence, the proposed method uses a BiLSTM-based deep learning model.
First, let us define the input sequence as x = x 1 , x 2 , , x T , where T is the sequence length. The forward LSTM layer takes this input sequence and produces a hidden state sequence h f = h f 1 , h f 2 , , h f T , while the backward LSTM layer takes the input sequence in reverse order and produces a hidden state sequence h b = h b 1 , h b 2 , , h b T .
The hidden state sequence h f was computed using the following equations:
i t = s i g m o i d ( W x i x t + W h i h t 1 + b i ) f t = s i g m o i d ( W x f x t + W h f h t 1 + b f ) o t = s i g m o i d ( W x o x t + W h o h t 1 + b o ) g t = t a n h ( W x g x t + W h g h t 1 + b g ) c t = f t × c t 1 + i t × g t h t = o t × t a n h ( c t )
where i , f , o , and g are the input, forget, output, and cell gates, respectively. W x i , W h i , W x f , W h f , W x o , W h o , W x g , and W h g are weight matrices; b i , b f , b o , and b g are bias vectors; and tanh and sigmoid are activation functions.
Similarly, the hidden state sequence h b is computed using the following equations:
i t = s i g m o i d ( W x i x t + W h i h t + 1 + b i ) f t = s i g m o i d ( W x f x t + W h f h t + 1 + b f ) o t = s i g m o i d ( W x o x t + W h o h t + 1 + b o ) g t = t a n h ( W x g x t + W h g h t + 1 + b g ) c t = f t × c t + 1 + i t × g t h t = o t × t a n h ( c t )
where x t = x T t + 1 and h t = h T t + 1 are the inputs to the backward LSTM layer, and the weight matrices and bias vectors have a prime symbol to distinguish them from those of the forward LSTM layer.
Finally, the output of the BiLSTM network was obtained by concatenating the hidden state sequences from both directions:
h t B i L S T M = [ h t , h t ]
where [ ] denotes concatenation.
BiLSTM can capture the long-term dependencies between the data, thus avoiding the vanishing exploding gradient problem. The internal long-term memory helps BiLSTM learn the data explicitly compared with standard deep learning models. However, BiLSTM is more complex, increasing the computational cost due to using two LSTMs. Therefore, an optimum network with few hidden layers and optimum hyper-parameters must be identified before classification.
Deep learning architecture has two model parameters: trainable parameters, i.e., weights adjusted in the training process, and hyper-parameters, i.e., learning rate, hidden layers, neurons in each layer, etc. The selection of these model parameters is critical and plays an essential role in the algorithm’s performance. Previously, grid search algorithm was used to evaluate these parameters. In the grid search method, parameters are divided into even ranges, and the model is assessed for all possible combinations of these parameters; therefore, it is highly time-consuming. Another alternative is the random search, where a random assortment of parameters is evaluated to find the best values. However, it does not guarantee optimum values due to the selection of random combinations.
Bayesian optimization is a well-known function to optimize discrete, non-differentiable, and time-consuming algorithms. It is based on the evaluation of the objective function, where the objective function is modeled using the Gaussian process. It employs the Gaussian probabilistic approach to predict the performance and maximize the efficiency in the following samples [62]. Thus, it can find the best parameters in minimal time, in contrast to grid search and random search. In addition, grid and random searches do not have any learning from the accessed set of parameters in the tuning process [63]. Therefore, Bayesian optimization is used in the proposed method to find the best architecture and corresponding parameters for the BiLSTM network. The overall process for spoof audio classification with an optimized BiLSTM network is shown in Figure 6. The dataset is split into training, testing, and evaluation datasets. The training dataset trains the model (optimal BiLSTM). The model’s parameters are changed in accordance with the findings of the comparison and the particular learning. The results of the observations in a second dataset, known as the test dataset, are successively predicted using the fitted model. When the model’s hyper-parameters are adjusted, the test dataset evaluates a model fit on the training dataset. Ultimately, a final model is objectively assessed using the evaluation dataset. Various features, including MFCC and its variant, GTCC and its variant, Spectral features, and Pitch information, were used to train the optimized BiLSTM network.
Figure 6. Process flow of the proposed algorithm.
Optimization of BiLSTM using Gaussian-oriented Bayesian approach: A Bayesian model can find the optimum values of all parameters in minimal time. The process of optimization is categorized into three steps:
Step 1: The first step is to build the Gaussian model for predicting the results of the feature adjustment of the parameters. The hyper-parameters are completely unknown initially; the Gaussian model can realize the scale of setting the hyper-parameters. This scale helps to figure out how much deviation is needed to have drastically different results. For each prediction, the scale generates the Gaussian-distributed curve. The sharpness and variance of this curve suggest how the prediction is close to the consistent training process. The Gaussian curve, which is sharp with smaller variance, provides better consistency in the prediction and optimum tuning of the parameters. Let f ( x ) be a model function over the known values of training data x and F ( x * ) be the model function of a set of test data x * . Then, the multivariant Gaussian distribution function can be expressed as
f ( x 1 ) f ( x * ) N μ ( x 1 ) μ ( x * ) γ ( x * , x * ) γ ( x * , x ) γ ( x * , x ) γ ( x , x )
where μ ( x ) is the mean and γ ( x , x ) is the covariance of the training data. μ ( x * ) and γ ( x * , x * ) are the mean and covariance of the test data.
This calculation can be extended for a large set of datasets or to multiple dimensions as follows:
f ( x 1 ) f ( x ) N μ ( x 1 ) μ ( x ) γ ( x 1 , x 1 ) γ ( x 1 , x ) γ ( x , x 1 ) γ ( x , x )
The posterior probability of the t h point is calculated using previously obtained 1 observation points as follows:
μ = ( x , x 1 : 1 ) ( x 1 : 1 , x 1 : 1 ) 1 f ( x 1 : 1 )
σ 2 = ( x , x ) ( x , x 1 : 1 ) × ( x 1 : 1 , x 1 : 1 ) 1 ( x 1 : 1 , x )
The above two equations of mean and variance compute the distribution at any desired point x . This creates the Bayesian model for a given objective function.
Step 2: The second step is to find the hyperparameter to maximize the objective function, i.e., accuracy, or to minimize the objective function, i.e., error. In the proposed model, the classification error in the testing dataset was used as an objective function. Therefore, we modeled the error function using the Gaussian process. After the nth iteration, the minimum value of posterior function f * for some test data x * is obtained. After one more iteration, the posterior value of the function is updated, i.e., the updated posterior value of the function is f ( x ) at point x. If maximum iteration has been reached, iteration is stopped here, and the optimum values are calculated by finding m i n ( f * , f ( x ) ) . Thus, the expected difference between the updated posterior value f ( x ) and the previous value f * suggests a reduction in the error. This is expressed as
Expected   improvement = E f * m i n ( f * , f ( x ) )
Equation (18) can be re-written in Equation (19), where n indicates the completed iterations, f * gives optimum value obtained through n iterations, p is the probability density function and P is a cumulative density. The objective function is evaluated at a point maximizing an expected improvement.
Expected   improvement = f * μ n ( x ) + + σ n ( x ) p f * μ n ( x ) σ n ( x ) f * μ n ( x ) P f * μ n ( x ) σ n ( x )
Step 3: All hyper-parameters obtained in all iterations are stored for evaluation. This step needs an evaluation of the objective function using these stored parameters. The least minimum objective function for a specific iteration gives the most optimum values of hyper-parameters of the network.
The overall algorithm to find optimum parameters for the BiLSTM network is as follows (Algorithm 1):
Algorithm 1 Bayesian optimization algorithm to tune BiLSTM parameters
  1:
Specify the range of parameters to be optimized
  2:
Define objective function using network’s validation accuracy and error
  3:
while   n 0 do
  4:
    Search hyper-parameter space.
  5:
    Create BiLSTM network using hyper-parameters.
  6:
    Train the BiLSTM using the training dataset.
  7:
    Evaluate the performance of BiLSTM using the Testing dataset.
  8:
    Score history of BiLSTM.
  9:
    Update the probability model of BiLSTM.
10:
end while

4. Experimentation and Analysis

4.1. Experiment Setup

Traditional methods use only a training dataset to train the model. In the proposed model, both training and development datasets are combined and used to train and test the BiLSTM. A total of 3779 (i.e., 80%) audio signals from the training and development datasets were used for training, and the remaining 945 (i.e., 20%) audio signals were used to test the model. The evaluation dataset was used to quantify the accuracy of the proposed model. Traditional statistical methods, e.g., the Gaussian Mixture model, use the logarithm of likelihood ratio to classify the given speech signal as a genuine or spoof type. However, the Equal Error Rate (EER) is the most popular metric to quantify the classifier’s performance in machine learning. EER computation involves false acceptance and rejection, i.e., genuine signals identified as spoofing signals and vice versa. The confusion matrix [64] provides all necessary quantitative parameters to validate the model. The confusion matrix is based on four parameters: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). Where TP gives the correct prediction of the positive class, i.e., genuine audio signals, TN gives the correct prediction of the negative class, i.e., spoof audios, FP gives the incorrect prediction of the positive class, and FN gives the incorrect prediction of the negative class. EER gives an overall accuracy of the classifier and it is a ratio of incorrect predictions (FP + FN) to the total number of the dataset (TP + FP + FN + TN). A lower value of EER presents a better classification accuracy of the classifier.
The first step is the segmentation of the speech signal. A short-time zero-crossing rate is calculated in the ten ms period of a Hamming window. Based on the particular threshold, silent and tail segments are detected with a period of 512 ms. The segment is duplicated for an insufficient length of the segment. Then, features are extracted from each segment and used in the BiLSTM network. During the training step, weights are initialized randomly. The parameters’ range is initialized and supplied to the Bayesian model for optimization. The proposed BiLSTM algorithm was designed in MATLAB V2022, installed on a Pentium I7 processor with 8 GB RAM, a 1.90 GHz PC.

4.2. Analysis

The top layer of the BiLSTM is the sequence input layer. We created a BiLSTM structure by placing two BiLSTM layers, each comprising a fully connected layer of 50 neurons, a softmax, and a classification layer. Finding optimum parameters is a challenge in the structuring of neural networks. Therefore, the design incorporated the Bayesian optimization model. The layer information of BiLSTM is shown in Table 2.
Table 2. Proposed BiLSTM network’s parameters.
The Bayesian approach needs a range of parameters to be optimized. The proposed method optimizes the number of hidden layers, learning rate, momentum, and regularization parameters. The validation error minimization was used as an objective function to find the optimum values of these parameters. A total of 400 × 428 features size was used, where 400 is the length and 450 is the feature dimension. At the training time, the weights were randomly initialized. We buffered the feature vectors into sequences of 20 feature vectors with ten overlaps. The mini-batch size was kept at 256. The adaptive learning rate was used, and the drop factor of the learning rate was kept at 0.1. Figure 7 represents the minimization of the objective function based on the tuning of various parameters in the BiLSTM network.
Figure 7. Minimization of the objective function over a number of iterations to find optimum parameters of BiLSTM network.
The initial range and optimum value obtained using the approach are presented in the following Table 3. These parameters were used in the BiLSTM structure for classification. After optimization, the number of trainable parameters was only 252,202.
Table 3. Range of hyper-parameters initialized and obtained optimum value.
Figure 8 represents the training and validation accuracy obtained over the epoch number, and the confusion matrix for the validation dataset is shown in Figure 9. Diagonal cells show correctly classified observations, and off-diagonal cells show incorrectly classified observations in the confusion matrix. Because of the optimisation, the obtained structure of BiLSTM was minimal, and within only four epochs, the training of the network was fulfilled with a 99.58% validation accuracy and a 2.64% EER.
Figure 8. Analysis of accuracy and loss over epochs in the training of optimised BiLSTM network using ASVSpoof2017 dataset.
Figure 9. Confusion matrix for validation dataset.
The training dataset contains genuine and replay attacks based on spoof speech using strong recording devices. The validation dataset contains genuine speech and spoofed speeches, where spoofed speeches are generated using both mid- and high-quality recording devices. The BiLSTM model was trained using both training and development datasets to verify the accuracy against replay attacks. The confusion matrix plotted in Figure 9 and Figure 10 shows the models’ succession to tackle all types of attacks over the validation dataset and evaluation dataset.
Figure 10. Confusion matrix for evaluation dataset including all attacks.
The evaluation dataset contains recorded speech using all three (i.e., low, mid, and high) qualities recorded as well as playback devices recording. This makes it challenging. Out of 1298 genuine speeches, 94.9% are correct, and 5.1% are wrong.
The accuracy, precision, and recall analysis for two categorical attacks are listed in Table 4. The experiment received a higher recall rate in comparison to precision. Thus, the proposed model succeeded in finding correct attacks more accurately. The ROC curve obtained by plotting the true positive against the false positive is shown in Figure 11. This curve validates the accuracy of true classification, covering 96% of the area under the curve.
Table 4. Accuracy, precision and recall analysis for development and evaluation datasets.
Figure 11. Receiver operating characteristic curve.
In our method, feature integration and BLSTM optimization play an important role. Initially, we explored the impact of the features in the optimized classification model. Three main features were used in the experiment, including MFCC, GTCC, and spectral features. The results of MFCC, the integration of MFCC with GTCC, and the proposed Gaussian-oriented MFCC features sets and their integration with the remaining feature are shown in Figure 12. The MFCC features did not work well; however, integrating MFCC-oriented high-frequency features with GTCC performed well for the development dataset. In contrast, the Gaussian features were better for obtaining a better performance.
Figure 12. Performance of optimized BILSTM using different feature sets.
Table 5 shows a comparison between the proposed model and other models in the literature. In Reference [65], the authors adopted the conventional CNN-based LCNN-FFT model where the local interpretable model-agnostic explanations (LIME) algorithm was used to find characteristics from the audio signals’ spectral and temporal data. Later, they preprocessed the audio by removing the artifacts from the signals and improved the EER rate of the evaluation dataset to 7.8%. This LIME-based model is computationally complex and needs audio preprocessing. Yoon et al. [66] proposed a new replay attack for the speaker verification system. Initially, they tested various models on the ASVSpoof2017 dataset. They observed that the LSTM classifier using spectrogram features and the RestNET-18 classifier using GD-gram features failed to differentiate the genuine speech from the spoof speech. According to them, the authors of the dataset have not addressed the new replay attacks through their statistical or prototype models.
In Reference [67], a cochlear filter is proposed to obtain cepstral coefficients using instantaneous frequency (CFCCIF). Further, they used quadrature-phase signals using the Hilbert transform with CFCC to improve the feature. They used various CNN models, including light-CNN and ResNet. They obtained a 2.33% EER on the development dataset. However, the EER for the evaluation dataset was 12.88%. One of the weaknesses of this model is that it does not resolve all paradoxes related to instantaneous frequency and its estimation. Bharath and Rajesh Kumar [68] also presented a high-frequency feature-based algorithm. Initially, the glottal excitation spectrum is obtained using adaptive inverse filtering. Finally, CQCC features are extracted from the obtained spectrum for spoofing classification. The EER for the development and evaluation datasets was reduced greatly by the suggested approach to 3.68% and 8.32%, respectively. One of the weaknesses of the CNN is its uncertainty, which may occur due to the use of the softmax layer in the CNN. A Bayesian approach was proposed in Reference [69], which strengthened the assessment of uncertainty in CNN. However, their EER was quite high for a development dataset.
Table 5. Comparison of % EER for various features and classifier models using development and evaluation datasets.
Table 5. Comparison of % EER for various features and classifier models using development and evaluation datasets.
RefFeaturesClassifierDev EER (%)Eval EER (%)
[70]Inverted constant Q-features and CQCCDNN2.6297.777
[67]CFCCIF and quadrature-phase signalsResNET2.3312.88
[71]LC-GRNN features from spectrogramPLDA3.266.08
[72]MFCC + FbankLSTM, GRU RNN6.329.81
[68]Iterative adaptive inverse filtering
-based glottal information + CQCC
Gaussian
mixture model
3.688.32
[69]CQCCLCNN21.738.20
[73]Normalized log-power magnitude
spectrum using Q-transform and FFT
Conventional
CNN + RNN
3.956.73
[65]Local interpretable model
-Agnostic explanations
LCNN-FFT7.610.6
[66]SpectrogramLSTM-21.0602
[66]Group delay-GramResNET-18-35.35
[11]Linear filter bank-based
high frequency features
DenseNET
+ BiLSTM
2.796.43
[74]MFCC + CQCC
High-frequency features
DenseNET
+ LSTM
3.628.84
[47]Improved TEOLCNN6.9813.54
[43]CQCC + deep learning
features
LSTM 7.73
ProposedStatic + dynamic GIMFCC
+ GTCC+ spectral features
Optimized
BiLSTM
1.026.58
In Reference [47], TEO features were improved with signal mass consideration. The TEO operator only uses energy in low-frequency regions, and signal mass compensates for the energy of high-frequency regions, providing a more precise signal energy estimation. A lightweight CNN model achieved a 6.98% and 13.54% EER. In contrast, the GMM model performed well, with a 10.75% EER on the evaluation dataset. Chen et al. [72] validated the two recurrent models, i.e., RNN, LSTM, and Gated Recurrent Unit (GRU) network. They evaluated these models using the MFCC and FBank features. The GRU model performed best with FBank features, with an EER of 6.32% and 9.81% for the development and evaluation datasets. In Reference [74], a silent segment was extracted from the speech signal using a zero-crossing rate, and its CQCC high-frequency features in the 3 to 8 kHz frequency range were obtained. In addition, they proposed a DenseNet-LSTM network compared with CNN-RNN as a classifier. This model succeeded in lowering the EER to 3.62% and 8.84% on the development and evaluation datasets, respectively. Later, the authors in Reference [11] modified the classifier to a DenseNet-BiLSTM network with an attention mechanism. Their model outperformed the LSTM and GMM models with an EER of 6.43% on the evaluation dataset. However, in both models, high-frequency features heavily rely on segment-based linear filter banks, which are sensitive to the background sound and playback devices. A convolutional feature using a light convolutional gated RNN (LC-GRNN) was extracted in Reference [71]. Three machine learning models were tested for all possible attacks, including SVM, linear discriminant analysis (LDA), and probabilistic LDA. Among all classifiers, probabilistic LDA performed better. However, the major drawbacks of RNN are its vanishing gradient and long-term dependency. Huang et al. [75] used CQCC features and features obtained using deep learning from the raw data. These two features were used in the LSTM network for classification and achieved a 7.73% EER on the evaluation dataset.
High-frequency features work well, as depicted in Reference [11]. They used a triangle filter as a linear filter bank to obtain high-frequency features, and they used a combination of DenseNET and BiLSTM as a classifier, achieving a 6.43% EER. Due to the prominent characteristics of the Gaussian filter, as explained in Section 1, The Gaussian filter-based high-frequency features with complementary features intuitively worked well with the optimized BiLSTM model. Due to the discriminating high-frequency features of spoofed and genuine speech signals, the BiLSTM model could discriminate speech with the best classification accuracy and lower EER. In addition, BiLSTM overcomes the weakness of RNN, tackling the problem of the vanishing gradient, and, therefore, performed well compared with the RNN. The proposed model succeeded in obtaining an overall 1.02% and 6.58% EER on the validation and evaluation datasets, which were the least among all the models, as presented in Table 5.

5. Conclusions

The manipulation of genuine speech is a common threat and a prime concern for the system’s vulnerability. Though baseline features are reliable, the fusion of features and better classifiers is expected to strengthen the system. The present study reveals that high-frequency features can easily differentiate spoof speech from genuine speech. In addition, the complementary features based on the high-frequency region of speech in the signal can help to improve the classification accuracy. This work presents the use of high-frequency features for spoof-speech detection. A Gaussian filter offering good correlation among the subband and allowing a smooth transition between the subbands is proposed to extract high-frequency features in contrast to linear or triangular filter banks. Inverted MFCC-based high-frequency features using a Gaussian filter are obtained from the high-frequency region of speech signals. To use these features efficiently, BiLSTM-based CNN architecture offering a correlation between the past and present features is explored. Further, optimizing the BiLSTM network using the Bayesian algorithm showed a minimal architecture with the best hyper-parameters. The integrated features with the optimized BiLSTM network performed better than base and state-of-art algorithms.
The proposed model was evaluated using the ASVSPoof 2017 dataset. In the base algorithm of GMM, the training dataset was used in training the algorithm. In contrast, both the training and development datasets were used in the training process of the network, and performance was evaluated for both development and evaluation datasets in the proposed approach. The proposed model achieved a 1.02% and 6.58% EER with a relative reduction of 78%. The state-of-the-art comparison with other CNN classifier-based algorithms states that the proposed model achieved better performance with minimal architecture.
The limitation of our work is that we did not investigate the performance of our method in cross-dataset scenarios. Future research will enhance the anti-spoofing countermeasures’ capacity to generalize under many circumstances. In addition, the effect of noise is more considerable at high frequencies because it contains information, so any small noise significantly impacts the signal [11]. Although the experimental results show the efficiency of the proposed features, these features may need to perform better for noisy signals. Features describing the glottal source are another alternative MFCC, which can be calculated using single-frequency filtering, zero-time windowing, and zero-frequency filtering. Therefore, further investigation in the presence of noise, as well as adopting alternate features, is required.

Author Contributions

Conceptualization, H.M. and J.F.A.-A.; methodology, H.M. and Q.N.; software, H.M. and F.A.A.; validation, A.H.K., S.E.-N. and N.A.A.; investigation, F.A.A. and Q.N.; resources, S.E.-N.; writing—original draft preparation, N.A.A.; writing—review and editing, H.M. and J.F.A.-A.; supervision, H.M.; project administration, N.A.A. and Q.N.; funding acquisition, N.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023R410), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the article [2].

Acknowledgments

The authors are thankful to Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023R410), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
MFCCMel Frequency Cepstral Coefficients
GIMFCCGaussian-Inverted MFCC
GTCCGammatone Cepstrum Coefficient
LPCLinear Prediction Coefficients
LFCCLinear Frequency Cepstrum Coefficients
ECGElectrocardigram
CQCCConstant Q Cepstral Coefficients
CQTConstant Q Transform
GMMGaussain Mixture Model
CNNConvolutional Neural Network
RNNRecurrent Neural Network
LSTMLong short-term memory
BiLSTMA bidirectional LSTM
SVMSupport Vector Machine
EEREqual Error Rate
XGBoostExtended Gradient Boost
ReLURectified Linear Unit
ResNETResidual Neural Network
PCMPulse Code Modulation
t-SNET-Stochastic Neighborhood Embedding

References

  1. Wu, Z.; Evans, N.; Kinnunen, T.; Yamagishi, J.; Alegre, F.; Li, H. Spoofing and countermeasures for speaker verification: A survey. Speech Commun. 2015, 66, 130–153. [Google Scholar] [CrossRef]
  2. Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; Yamagishi, J.; Lee, K.A. The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection; The International Speech Communication Association: Berlin, Germany, 2017. [Google Scholar]
  3. Ghaderpour, E.; Pagiatakis, S.D.; Hassan, Q.K. A survey on change detection and time series analysis with applications. Appl. Sci. 2021, 11, 6141. [Google Scholar] [CrossRef]
  4. Mewada, H.K.; Patel, A.V.; Chaudhari, J.; Mahant, K.; Vala, A. Wavelet features embedded convolutional neural network for multiscale ear recognition. J. Electron. Imaging 2020, 29, 043029. [Google Scholar] [CrossRef]
  5. Alim, S.A.; Rashid, N.K.A. Some Commonly Used Speech Feature Extraction Algorithms; IntechOpen: London, UK, 2018. [Google Scholar]
  6. Mewada, H. 2D-wavelet encoded deep CNN for image-based ECG classification. In Multimedia Tools and Applications; Springer: Berlin/Heidelberg, Germany, 2023; pp. 1–17. [Google Scholar]
  7. Witkowski, M.; Kacprzak, S.; Zelasko, P.; Kowalczyk, K.; Galka, J. Audio Replay Attack Detection Using High-Frequency Features. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 27–31. [Google Scholar]
  8. Singh, M.; Pati, D. Usefulness of linear prediction residual for replay attack detection. AEU-Int. J. Electron. Commun. 2019, 110, 152837. [Google Scholar] [CrossRef]
  9. Yang, J.; Das, R.K. Low frequency frame-wise normalization over constant-Q transform for playback speech detection. Digit. Signal Process. 2019, 89, 30–39. [Google Scholar] [CrossRef]
  10. Sriskandaraja, K.; Sethu, V.; Ambikairajah, E. Deep siamese architecture based replay detection for secure voice biometric. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 671–675. [Google Scholar]
  11. Huang, L.; Pun, C.M. Audio Replay Spoof Attack Detection by Joint Segment-Based Linear Filter Bank Feature Extraction and Attention-Enhanced DenseNet-BiLSTM Network. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 1813–1825. [Google Scholar] [CrossRef]
  12. Zaw, T.H.; War, N. The combination of spectral entropy, zero crossing rate, short time energy and linear prediction error for voice activity detection. In Proceedings of the 2017 20th International Conference of Computer and Information Technology (ICCIT), Dhaka, Bangladesh, 22–24 December 2017; pp. 1–5. [Google Scholar]
  13. Singh, S.; Rajan, E. Vector quantization approach for speaker recognition using MFCC and inverted MFCC. Int. J. Comput. Appl. 2011, 17, 1–7. [Google Scholar] [CrossRef]
  14. Singh, S.; Rajan, D.E. A Vector Quantization approach Using MFCC for Speaker Recognition. In Proceedings of the International Conference Systemic, Cybernatics and Informatics ICSCI under the Aegis of Pentagram Research Centre Hyderabad, Hyderabad, India, 4–7 January 2007; pp. 786–790. [Google Scholar]
  15. Chakroborty, S.; Saha, G. Improved text-independent speaker identification using fused MFCC & IMFCC feature sets based on Gaussian filter. Int. J. Signal Process. 2009, 5, 11–19. [Google Scholar]
  16. Jelil, S.; Das, R.K.; Prasanna, S.M.; Sinha, R. Spoof detection using source, instantaneous frequency and cepstral features. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 22–26. [Google Scholar]
  17. Sahidullah, M.; Kinnunen, T.; Hanilçi, C. A comparison of features for synthetic speech detection. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
  18. Loweimi, E.; Barker, J.; Hain, T. Statistical normalisation of phase-based feature representation for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5310–5314. [Google Scholar]
  19. Pal, M.; Paul, D.; Saha, G. Synthetic speech detection using fundamental frequency variation and spectral features. Comput. Speech Lang. 2018, 48, 31–50. [Google Scholar] [CrossRef]
  20. Patil, A.T.; Patil, H.A.; Khoria, K. Effectiveness of energy separation-based instantaneous frequency estimation for cochlear cepstral features for synthetic and voice-converted spoofed speech detection. Comput. Speech Lang. 2022, 72, 101301. [Google Scholar] [CrossRef]
  21. Kadiri, S.R.; Yegnanarayana, B. Analysis and Detection of Phonation Modes in Singing Voice using Excitation Source Features and Single Frequency Filtering Cepstral Coefficients (SFFCC). In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 441–445. [Google Scholar]
  22. Kethireddy, R.; Kadiri, S.R.; Gangashetty, S.V. Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations. J. Acoust. Soc. Am. 2022, 151, 1077–1092. [Google Scholar] [CrossRef] [PubMed]
  23. Kethireddy, R.; Kadiri, S.R.; Kesiraju, S.; Gangashetty, S.V. Zero-Time Windowing Cepstral Coefficients for Dialect Classification. In Proceedings of the The Speaker and Language Recognition Workshop (Odyssey), Tokyo, Japan, 2–5 November 2020; pp. 32–38. [Google Scholar]
  24. Kadiri, S.R.; Alku, P. Mel-Frequency Cepstral Coefficients of Voice Source Waveforms for Classification of Phonation Types in Speech. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 2508–2512. [Google Scholar]
  25. Mewada, H.K.; Chaudhari, J. Low computation digital down converter using polyphase IIR filter. Circuit World 2019, 45, 169–178. [Google Scholar] [CrossRef]
  26. Loweimi, E.; Ahadi, S.M.; Drugman, T. A new phase-based feature representation for robust speech recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7155–7159. [Google Scholar]
  27. Dua, M.; Aggarwal, R.K.; Biswas, M. Discriminative training using noise robust integrated features and refined HMM modeling. J. Intell. Syst. 2020, 29, 327–344. [Google Scholar] [CrossRef]
  28. Rahmeni, R.; Aicha, A.B.; Ayed, Y.B. Speech spoofing detection using SVM and ELM technique with acoustic features. In Proceedings of the 2020 5th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Sousse, Tunisia, 2–5 September 2020; pp. 1–4. [Google Scholar]
  29. Muckenhirn, H.; Korshunov, P.; Magimai-Doss, M.; Marcel, S. Long-term spectral statistics for voice presentation attack detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2098–2111. [Google Scholar] [CrossRef]
  30. Zhang, C.; Yu, C.; Hansen, J.H. An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J. Sel. Top. Signal Process. 2017, 11, 684–694. [Google Scholar] [CrossRef]
  31. Ghosh, R.; Phadikar, S.; Deb, N.; Sinha, N.; Das, P.; Ghaderpour, E. Automatic Eyeblink and Muscular Artifact Detection and Removal From EEG Signals Using k-Nearest Neighbor Classifier and Long Short-Term Memory Networks. IEEE Sens. J. 2023, 23, 5422–5436. [Google Scholar] [CrossRef]
  32. Jo, J.; Kung, J.; Lee, Y. Approximate LSTM computing for energy-efficient speech recognition. Electronics 2020, 9, 2004. [Google Scholar] [CrossRef]
  33. Gong, P.; Wang, P.; Zhou, Y.; Zhang, D. A Spiking Neural Network With Adaptive Graph Convolution and LSTM for EEG-Based Brain-Computer Interfaces. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 1440–1450. [Google Scholar] [CrossRef]
  34. Wu, Z.; Kinnunen, T.; Evans, N.; Yamagishi, J.; Hanilçi, C.; Sahidullah, M.; Sizov, A. ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
  35. Todisco, M.; Delgado, H.; Evans, N.W. A new feature for automatic speaker verification anti-spoofing: Constant q cepstral coefficients. In Proceedings of the Odyssey, Bilbao, Spain, 21–24 June 2016; Volume 2016, pp. 283–290. [Google Scholar]
  36. Xue, J.; Zhou, H.; Song, H.; Wu, B.; Shi, L. Cross-modal information fusion for voice spoofing detection. Speech Commun. 2023, 147, 41–50. [Google Scholar] [CrossRef]
  37. Alluri, K.R.; Achanta, S.; Kadiri, S.R.; Gangashetty, S.V.; Vuppala, A.K. Detection of Replay Attacks Using Single Frequency Filtering Cepstral Coefficients. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 2596–2600. [Google Scholar]
  38. Bharath, K.; Kumar, M.R. Replay spoof detection for speaker verification system using magnitude-phase-instantaneous frequency and energy features. Multimed. Tools Appl. 2022, 81, 39343–39366. [Google Scholar] [CrossRef]
  39. Woubie, A.; Bäckström, T. Voice Quality Features for Replay Attack Detection. In Proceedings of the 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022; pp. 384–388. [Google Scholar]
  40. Chaudhari, A.; Shedge, D. Integration of CQCC and MFCC based Features for Replay Attack Detection. In Proceedings of the 2022 International Conference on Emerging Smart Computing and Informatics (ESCI), Pune, India, 9–11 March 2022; pp. 1–5. [Google Scholar]
  41. Rahmeni, R.; Aicha, A.B.; Ayed, Y.B. Voice spoofing detection based on acoustic and glottal flow features using conventional machine learning techniques. Multimed. Tools Appl. 2022, 81, 31443–31467. [Google Scholar] [CrossRef]
  42. Naith, Q. Thesis title: Crowdsourced Testing Approach For Mobile Compatibility Testing. Ph.D. Thesis, University of Sheffield, Sheffield, UK, 2021. [Google Scholar]
  43. Sizov, A.; Khoury, E.; Kinnunen, T.; Wu, Z.; Marcel, S. Joint speaker verification and antispoofing in the i-vector space. IEEE Trans. Inf. Forensics Secur. 2015, 10, 821–832. [Google Scholar] [CrossRef]
  44. Luo, A.; Li, E.; Liu, Y.; Kang, X.; Wang, Z.J. A Capsule Network Based Approach for Detection of Audio Spoofing Attacks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6359–6363. [Google Scholar]
  45. Monteiro, J.; Alam, J.; Falk, T.H. An ensemble based approach for generalized detection of spoofing attacks to automatic speaker recognizers. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6599–6603. [Google Scholar]
  46. Alluri, K.R.; Achanta, S.; Kadiri, S.R.; Gangashetty, S.V.; Vuppala, A.K. SFF Anti-Spoofer: IIIT-H Submission for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2017. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 107–111. [Google Scholar]
  47. Patil, A.T.; Acharya, R.; Patil, H.A.; Guido, R.C. Improving the potential of Enhanced Teager Energy Cepstral Coefficients (ETECC) for replay attack detection. Comput. Speech Lang. 2022, 72, 101281. [Google Scholar] [CrossRef]
  48. Tom, F.; Jain, M.; Dey, P. End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 681–685. [Google Scholar]
  49. Lai, C.I.; Chen, N.; Villalba, J.; Dehak, N. ASSERT: Anti-spoofing with squeeze-excitation and residual networks. arXiv 2019, arXiv:1904.01120. [Google Scholar]
  50. Scardapane, S.; Stoffl, L.; Röhrbein, F.; Uncini, A. On the use of deep recurrent neural networks for detecting audio spoofing attacks. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 3483–3490. [Google Scholar]
  51. Mittal, A.; Dua, M. Static–dynamic features and hybrid deep learning models based spoof detection system for ASV. Complex Intell. Syst. 2022, 8, 1153–1166. [Google Scholar] [CrossRef]
  52. Dinkel, H.; Qian, Y.; Yu, K. Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 2002–2014. [Google Scholar] [CrossRef]
  53. Mittal, A.; Dua, M. Automatic speaker verification system using three dimensional static and contextual variation-based features with two dimensional convolutional neural network. Int. J. Swarm Intell. 2021, 6, 143–153. [Google Scholar] [CrossRef]
  54. Chintha, A.; Thai, B.; Sohrawardi, S.J.; Bhatt, K.; Hickerson, A.; Wright, M.; Ptucha, R. Recurrent convolutional structures for audio spoof and video deepfake detection. IEEE J. Sel. Top. Signal Process. 2020, 14, 1024–1037. [Google Scholar] [CrossRef]
  55. Alzantot, M.; Wang, Z.; Srivastava, M.B. Deep residual neural networks for audio spoofing detection. arXiv 2019, arXiv:1907.00501. [Google Scholar]
  56. Wu, Z.; Das, R.K.; Yang, J.; Li, H. Light convolutional neural network with feature genuinization for detection of synthetic speech attacks. arXiv 2020, arXiv:2009.09637. [Google Scholar]
  57. Li, J.; Wang, H.; He, P.; Abdullahi, S.M.; Li, B. Long-term variable Q transform: A novel time-frequency transform algorithm for synthetic speech detection. Digit. Signal Process. 2022, 120, 103256. [Google Scholar] [CrossRef]
  58. Sahidullah, M.; Delgado, H.; Todisco, M.; Kinnunen, T.; Evans, N.; Yamagishi, J.; Lee, K.A. Introduction to voice presentation attack detection and recent advances. In Handbook of Biometric Anti-Spoofing; Springer: Berlin/Heidelberg, Germany, 2019; pp. 321–361. [Google Scholar]
  59. Brancoa, H.M.C.; Reisa, J.B.; Pereirab, L.M.; Sáa, L.d.C.; de AL Rabeloa, R. Transmission line fault location using MFCC and LS-SVR. Learn. Nonlinear Model. J. Braz. Soc. Comput. Intell. 2023, 21, 110–122. [Google Scholar] [CrossRef]
  60. Paul, D.; Pal, M.; Saha, G. Novel speech features for improved detection of spoofing attacks. In Proceedings of the 2015 Annual IEEE India Conference (INDICON), New Delhi, India, 17–20 December 2015; pp. 1–6. [Google Scholar]
  61. Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  62. Mu, J.; Fan, H.; Zhang, W. High-dimensional Bayesian Optimization for CNN Auto Pruning with Clustering and Rollback. arXiv 2021, arXiv:2109.10591. [Google Scholar]
  63. Doke, P.; Shrivastava, D.; Pan, C.; Zhou, Q.; Zhang, Y.D. Using CNN with Bayesian optimization to identify cerebral micro-bleeds. Mach. Vis. Appl. 2020, 31, 36. [Google Scholar] [CrossRef]
  64. Ohsaki, M.; Wang, P.; Matsuda, K.; Katagiri, S.; Watanabe, H.; Ralescu, A. Confusion-matrix-based kernel logistic regression for imbalanced data classification. IEEE Trans. Knowl. Data Eng. 2017, 29, 1806–1819. [Google Scholar] [CrossRef]
  65. Chettri, B.; Mishra, S.; Sturm, B.L.; Benetos, E. Analysing the predictions of a cnn-based replay spoofing detection system. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 92–97. [Google Scholar]
  66. Yoon, S.H.; Koh, M.S.; Park, J.H.; Yu, H.J. A new replay attack against automatic speaker verification systems. IEEE Access 2020, 8, 36080–36088. [Google Scholar] [CrossRef]
  67. Gupta, P.; Chodingala, P.K.; Patil, H.A. Replay spoof detection using energy separation based instantaneous frequency estimation from quadrature and in-phase components. Comput. Speech Lang. 2023, 77, 101423. [Google Scholar] [CrossRef]
  68. Bharath, K.; Kumar, M.R. New replay attack detection using iterative adaptive inverse filtering and high frequency band. Expert Syst. Appl. 2022, 195, 116597. [Google Scholar] [CrossRef]
  69. Süslü, Ç.; Eren, E.; Demiroğlu, C. Uncertainty assessment for detection of spoofing attacks to speaker verification systems using a Bayesian approach. Speech Commun. 2022, 137, 44–51. [Google Scholar] [CrossRef]
  70. Yang, J.; Das, R.K. Long-term high frequency features for synthetic speech detection. Digit. Signal Process. 2020, 97, 102622. [Google Scholar] [CrossRef]
  71. Gomez-Alanis, A.; Peinado, A.M.; Gonzalez, J.A.; Gomez, A.M. A light convolutional GRU-RNN deep feature extractor for ASVSpoofing detection. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; Volume 2019, pp. 1068–1072. [Google Scholar]
  72. Chen, Z.; Zhang, W.; Xie, Z.; Xu, X.; Chen, D. Recurrent neural networks for automatic replay spoofing attack detection. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2052–2056. [Google Scholar]
  73. Lavrentyeva, G.; Novoselov, S.; Malykh, E.; Kozlov, A.; Kudashev, O.; Shchemelinin, V. Audio Replay Attack Detection with Deep Learning Frameworks. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 82–86. [Google Scholar]
  74. Huang, L.; Pun, C.M. Audio replay spoof attack detection using segment-based hybrid feature and densenet-LSTM network. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2567–2571. [Google Scholar]
  75. Huang, L.; Zhao, J. Audio replay spoofing attack detection using deep learning feature and long-short-term memory recurrent neural network. In Proceedings of the AIIPCC 2021, The Second International Conference on Artificial Intelligence, Information Processing and Cloud Computing, VDE, Hangzhou, China, 26-28 June 2021; pp. 1–5. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.